Slides - TCE Events
Download
Report
Transcript Slides - TCE Events
Security and Deduplication in
the Cloud
Danny Harnik - IBM Haifa Research Labs
What is Deduplication
Deduplication: storing only a single copy of redundant data
Applied at the file or block level
Major savings in backup environments (saves more than 90% in
common business scenarios)
“most impactful storage technology”
April 2008: IBM acquires Dilligent
July 2009: EMC acquires DataDomain
July 2010: DELL acquires Ocarina
2
How are files deduped?
Fingerprint each file using a hash function
Common hashes used: Sha1, Sha256, others…
Store an index of all the hashes already in the system
New file:
Compute hash
Look hash up in index table
If new → add to index
If known hash → store as pointer to existing data
3
Client-side deduplication
Save bandwidth as well as storage.
Also know as “source-based dedupe” or “WAN deduplication”
Client computes hash and sends to server
If new → server requests client for the file (upload data)
Otherwise (dedupe) → skip upload and register the client as
another owner of the file
Client
Server
Let it be.mp3
Index
hash
2fd4e1
2fd4e1
2fd4e1
4
Let it be.mp3
Deduplication and privacy
Our attacks are relevant to the following setting:
Client-side deduplication
Cross-user deduplication
If two or more users store the same file, only a single copy is stored.
5
Cloud storage and deduplication
Cloud storage services are gaining popularity
Online file backup and synchronization is huge
Lots to gain from deduplication
Use/used cross-user client-side deduplication
Mozy
Dropbox
Memopal
…
MP3Tunes
6
Deduplication and privacy I
Harnik, Pinkas & Shulman-Peleg,
IEEE Journal of Security and Privacy, Vol 8. 2010
Client learns if an object is already in system
A narrow “peep hole” to contents of other users
Discussed attacks and partial solutions
Illegal content searching
“Salary attack”
Covert channel
Several ways to prevent:
Encrypt or dedupe server side only
Dedupe only on long files
Noisy dedupe…
7
Deduplication and privacy II
Halevi, Harnik, Pinkas & Shulman-Peleg,
ACM CCS 2011
A more direct attack
Starting point: Suppose I get the hash value of your file…
8
The attack
Attacker obtains hash of victim’s file
Signs up for the service with own account
Attempts to upload a file, but swaps the hash value with
that of the victim’s file.
File is now registered to attacker
Download file…
Client
Server
Any file
Index
hash
2fd4e1
e3b890
2fd4e1
2fd4e1
9
Let it be.mp3
Obtaining the hash
1.
Hash used for other services
2.
Malicious software
3.
Hash does not reveal “anything” on the file – not meant to be secret
Easier to send a small signature undetected
Also true for break-in at the server side
CDN attack
Alice sends all her friends the hash of a movie
Friends can download it from the server
Server essentially serves as a Content Distribution Network (CDN).
Might break its cost structure, if it planned on serving only a few restore
ops.
10
Swapping the hash
[Dorrendorf & Pinkas 2011]
Dropship (April, 2011)
implementation of the CDN over dropbox
“written in Python. Allow you to download to your Dropbox any
file, which description we got in JSON format (similar as
description propagated in .torrent files).”
[Mulazzani, Schrittwieser, Leithner, Huber & Weippl 2011]
Implemented the attacks against two major storage servers
One services uses SHA256 to identify files
Another uses a 160 bit hash value which was not identified
Implemented the attack on Dropbox
In Usenix Security 2011
A non-issue in upcoming cloud storage standards
11
SOLUTIONS !
12
Naïve Solutions
Use a non-standard hash
(e.g. Hash(“service name” | file) )
But all clients must know hash function
Irrelevant in most scenarios (CDN/malicious software etc..)
13
Better naïve Solutions
Use a challenge-response phase
For every upload, server picks a random nonce, and
asks client to compute Hash( nonce | file )
This requires client to have the file
But the server, too, must now retrieve the file from secondary
storage, and compute the hash
Alternative: Pre-compute Hash( nonce | file) and store
together with hash
Back to root cause of problem: short hash represents file entirely.
14
Proofs of Ownership (POWs)
Server preprocesses the file
Stores some short information per file (few bytes only)
Proof stage: a challenge response – done only during file upload
Honest client has access to the file
Server has only access to preprocessed information. cannot retrieve files
from secondary storage.
Must be bandwidth efficient
Client computation should be efficient (time & memory)
Security definition: Malicious client may have:
Partial knowledge of file (file has k min-entropy to it)
May receive additional information from accomplices (m bits)
If k – m > security parameter, then proof fails whp.
s
Prior knowledge
file
Accomplice
k
data
15
Proofs of Retrievability (PORs)
Role reversal: Server proves to client that it actually store its file
Strong extraction based definition (we use a relaxed notion)
State of the art solutions all send a pre-processed file to the server.
E.g. [NR05],[JK07],[SW08],[DVW09]
Cannot be done in our setting
In general, POR without preprocessing is a good POW
Our first solution is a Merkle tree based POR
16
Solution – first attempt
Merkle Tree
File
17
Solution – first attempt
Preprocessing:
server stores root
of tree
Merkle Tree
File
18
Solution – first attempt
Proof: server asks
client to present
paths to t random
leaves
√ very efficient
Merkle Tree
File
A client which knows only a p fraction
of the file, succeeds with prob < pt.
19
Problem and solution
Does not suffice when min-entropy is low (e.g. 90% of the file)
Solution: Apply tree to an erasure coding of the file
Satisfies security of POW and POR.
Efficient encoding?
Must pay either:
Large memory
Multiple disk accesses
Bad for large files
Merkle
TreeTree
Merkle
File
Erasure
code
20
Protocols with small space
Limit solution to use an L byte buffer for all the
computation
For example: L=64MB
Relax security guarantees:
Can only tolerate L bytes of accomplice data.
s
Prior knowledgefile
L
Accomplice
23
Second protocol: hash to small space
First hash file to a buffer of L bytes. Then construct Merkle-tree over
the buffer.
Reducer: use pairwise-independent hashing
Security: POW will fail (w.h.p.) adversary that
Has at least k bits min-entropy on the file
Receives less than Min(L, k-s) bits
from an accomplice
Merkle
Tree
Reduced file
Reducer
File
24
Is this efficient enough ?
Still not really practical
File size M
Buffer size L
Reducer requires Ω(M·L) time
We want to push it further down…
25
Third protocol: Reduce and Mix
In Reducer: XOR each block to a constant number of random
locations
Runs in O(M+L) time
Add a mixing phase
Merkle
Tree
Hypothesis: reduce + mix forms
a good code
Reduced &
mixed file
Mixer
Security defined against a
generalized block fixing source
distribution
File
Reduced file
Reducer
26
Performance of the different phases of the low space PoW
27
When is it worth the effort?
Summary
Identified security implications of client-side deduplication
Introduced POWs to enable client-side deduplication in the cloud
The challenge: offer meaningful privacy guarantees with a limited toll
on the resources
Merkle
Tree
Mixer
Reducer
29