Transcript Document

Fast File Clone in ZFS
Design Proposal
Pavel Zakharov
7/18/2015
Introduction
Description: Copy files almost instantly by copying
by reference.
Motivations:
 VMware VAAI support: NAS Full File Clone and
Fast File Clone.
 Save memory and disk space
Existing alternatives:
 dataset clone
 deduplication.
2
Regular Copy
Copy
dnode 1
Data blocks
Copied blocks
L2
L1
dnode 2
L1'
L1
A
B
C
L2'
A'
L1'
B'
File copy is currently a very costly operation that has to
duplicate every data block of the original file.
3
C'
Direct References
File Clone: Method 1
Clone: fast file copy
Clone
dnode 1
Data blocks
L2
dnode 2
update blkptr’s
L1
L2'
L1'
L1
propagate changes
A
B
1
2
C
3
C'
modify data block 3 of dnode 2
 Clone is similar to a snapshot.
 Clone references same blocks as original file.
 Only modified blocks are written out.
5
3
Diagrams
Clone
dnode 1
data blocks
both nodes point
shared data
to the same data
A
A
dnode 2
A
A
A
A
A
A
B
A
private data
A
B
A
A
B
modify data
A cleaner way to represent shared and private data.
6
Reference node: reference original data
Clone
dnode 1
data
dnode 2
(clone)
reference
dnode
(hidden)
A
A
A
A
A
A
In order to make the original file writable, we use an approach similar to a
dataset clone. A dataset clone is performed on top of a read-only snapshot.
Likewise, when a file is cloned, a hidden read-only dnode is created; it references
all the original blocks.
7
Reference node: avoid refcount
dnode 2
(clone)
dnode 1
file 1
file 2
B
A
B
A
B
A
A
A
modify data
A
garbage
data
original
data
C
A
C
A
A
A
A
A
A
C
A
reference
dnode
A
A
A
A
modify data
The extra dnode is used to keep references to the original data even if it is not
used anymore. As long as clones exist, original data blocks are not freed. This
avoids having to implement any kind of complicated refcount algorithm. Original
data is freed only when the refnode is destroyed.
8
System Attributes
file 1
dnode 34
____________
pflags: clone
refnode: 55
reference dnode
dnode 55
____________
birth: txg 200
pflags: clone_ref
clones: 34, 56
file 2 (clone)
dnode 56
____________
pflags: clone
refnode: 55
For cloned files:
 a flag is set indicating it is a clone.
 a new attribute is created: refnode. It points to the reference dnode.
For the reference dnode:
 a flag is set indicating it is a the reference dnode for other clones.
 new attribute: clones. It is an array of dnode numbers representing all the
clones.
 new attribute: birth. Txg of when the reference was created.
9
System Attributes
file 1
dnode 34
____________
pflags: clone
refnode: 55
reference dnode
dnode 55
____________
birth: txg 200
pflags: clone_ref
clones:
5756
clones:
34,
file 2 (clone)
dnode 56
____________
pflags: clone
refnode: 55
New Attributes
 Refnode: object number of reference dnode.
 Birth: txg when the reference node was created.
ZAP
dnode 57
________
34
56
 Clones: array of all dnodes that are clones of the refnode.
Alternatively, clones could point to a ZAP object storing the clones list.
 pflags: new flags for pflags attribute: clone and clone_ref .
10
Freeing Blocks
reference dnode
dnode 55
file 1
dnode 34
file 1
C
B
A
C
B
A
C
B
A
birth: txg 200
A
birth: txg = 177
B
Refnode birth: txg = 200
txg
200
txg<=
> 200
birth: txg = 202
KEEP
FREE
Free when replaced
birth: txg = 205
A
A
modify
data.
txg
205
old
data.
txg
Original
data.
202
= 177
Old
data.
txg 202
=txg
177
11
A
When overwriting or
destroying data, only
free blocks that are
born after the
reference node.
C
Keep when replaced
Freeing Blocks
 Blocks that were born after the reference node are treated
the same way as regular blocks.
 Blocks born before the reference node are only freed when
the reference node is destroyed.
Any writes sent to a file right after it has been cloned cannot be assigned the
same txg as the reference node.
 The reference node is destroyed when:
Option 1: All clones are destroyed.
Option 2: All but one clone is destroyed (harder).
12
Multiple Clones
clone
update clone list
file 2 (clone)
dnode 56
____________
pflags: clone
refnode: 55
reference dnode
dnode 55
birth: txg 200
____________
pflags: clone_ref
clones:
clones:
34,
34,
56,
5663
B
B
A
A
A
A
B
A
A
modify file 2
13
A
A
A
file 1
dnode 34
____________
pflags: clone
refnode: 55
file 3 (clone)
dnode 63
____________
pflags: clone
refnode: 55
file 1 was not modified after
the cloning operation, thus
file 3 can link to the same
reference node.
Nested Clones
clone
If a clone is modified and
then cloned again, a new
reference node will be
created.
file 2 (clone)
dnode 56
____________
pflags: clone
refnode: 63
update clone list
file 1
dnode 34
____________
pflags: clone
refnode: 55
reference dnode
dnode 55
birth: txg 200
____________
pflags: clone_ref
clones: 34, 63
56
file 3 (clone)
dnode 64
____________
pflags: clone
refnode: 63
reference
modifydnode
data 2
dnode 63
file 2 (clone)
birth:
txg 240
dnode 56
____________
____________
pflags:
clone_ref
pflags:
clone
refnode:
55
clone_obj:
55
refnode:
clones: 56,55
64
Data 2
Data 1
14
Integration with other ZFS features
Problem
Traversing blocks in-order without going twice over the same block now becomes
problematic.
ZFS Send/Receive



ZFS send/receive loses all information about block pointers and txgs.
Traversal must be done in multiple passes: first send the reference nodes, then
send the clones
Several passes required for nested clones. One extra pass for each clone depth.
dataset
dnode 1
regular
dnode 4
clone of 9
dnode 7
regular
First Pass: dnodes 1, 7, 9, 12
Solution: implicit references
15
dnode 9
refnode
dnode 12
regular
dnode 13
clone of 9
Second Pass: dnodes 4, 13
Implicit References
File Clone: Method 2
Implicit Block Pointers
 Shared data is only accessible/linked from the reference
node.
 The clones have embedded block pointers indicating to look
for data in the reference node
Implicit
Direct
References
refnode
clone
B
A
A
A
1
B
A
A
A
2
A
3
return block 2 of refnode
B
A
1
B
L1
A
2
B
overwrite
data
block
read data
block
2 1
replace references to shared blocks by special hole
17
3
Nested Clones: performance issue
Whe data is not found the refnodes have to be traversed one
by one. This can cause performance degradation when multiple
refnodes are nested.
Clone
refnode 2
C
refnode 1
B
C
B
A
1
A
A
1
A
A
2
A
return block 1
18
3
2
B
1
3
2
C
3
read data block 1
Nested Clones: improvements
One solution to improve performance is to have a reference to
the dnode of the appropriate refnode in the embedded blkptr.
Clone
dnode 13
refnode 2
dnode 9
C
refnode 1
dnode 7
7
B
7
A
1
A
A
1
A
A
2
A
return block 1
19
3
B
2
B
1
3
C
2
C
3
read data block 1
Analysis
Integration and comparison
Compare referencing methods
Direct References
 Access clones at same
speed as regular files.
 ZFS send/receive
becomes non-trivial and
potentially slower.
 Requires more changes in
DMU layer.
21
Implicit References
 Accessing clones is
potentially slower for each
nesting level.
 Higher arc footprint.
 Requires more changes in
ZPL layer.
Integration with other ZFS features
ZIL
 A new record type must be implemented.
Snapshots
 File clone should not interfere with current snapshot logic.
 Special care has to be taken so that unreferenced clone-related data is destroyed
when a snapshot is destroyed.
Scrubbing
 Do not scrub cloned data multiple times. Easy with implicit references.
Send / Receive
 Do not send cloned data multiple times. Easy with implicit references.
ZFS features
 Clone feature should be downgraded from active to enabled when all clones are
deleted.
22
Space Quotas
Space quotas can be tricky. It is a similar situation
as with Linux reflinks.
If we treat clone as a copy on the user level:
 Full ZPL size of each clone (shared + private
data) is accounted to owner’s userquota.
 Full ZPL size of each clone is accounted to
dataset’s refquota and refreservation.
 Shared data of refnode plus private data of each
clone is accounted to dataset’s quota and
reservation.
23
Accessing Reference Nodes
 A ZAP object in the dataset
links to all refnodes.
 ZPL layer can access this ZAP
object as a special read-only
folder.
 Inside this folder, each refnode
is displayed as a directory.
 Each refnode directory
contains one file to view the
refnode’s contents and
another file that contains the
relative paths to all its clones.
24
$refnodes
refnode 7
contents
clones
refnode 13
contents
Integration with OS
NFS
Fast File Clone
NFS Fast File Clone support
libc
fclone
userland
system call
fclone
kernel
zfs vnode ops
vop_fclone
zfs znode ops
zfs_clone_node
25
A new system call is required
File clone support within the same
dataset
File clone workhorse
Other thoughts
Ditto blocks for highly cloned files.
Ability to unlink clone: obtain a hard copy.
When cloning a clone, avoid nesting existing refnode if
changes between the clone and its refnode are minor.
Alternative clone designs
 Use deduplication
 Link to dataset clones
 Work to do.
26
Compare clone alternatives
Direct/Implicit
References
Linked dataset
clones
Deduplication
 Instant cloning.
 Instant cloning.
 Slow cloning: need to
traverse data.
 Scalable.
 Not scalable as of now.
Affects pool import
times.
 Scalable.
 Space wasted by
refnode if shared data
no longer referenced.
 Space wasted by
snapshot if shared data
no longer referenced.
 No space wasted.
 Space quotas must be
implemented.
 Space quotas must be
modified for dataset
clones that represent
files.
 Space quotas already
implemented.
27
Q&A
28
THANK YOU