Transcript Document
Fast File Clone in ZFS Design Proposal Pavel Zakharov 7/18/2015 Introduction Description: Copy files almost instantly by copying by reference. Motivations: VMware VAAI support: NAS Full File Clone and Fast File Clone. Save memory and disk space Existing alternatives: dataset clone deduplication. 2 Regular Copy Copy dnode 1 Data blocks Copied blocks L2 L1 dnode 2 L1' L1 A B C L2' A' L1' B' File copy is currently a very costly operation that has to duplicate every data block of the original file. 3 C' Direct References File Clone: Method 1 Clone: fast file copy Clone dnode 1 Data blocks L2 dnode 2 update blkptr’s L1 L2' L1' L1 propagate changes A B 1 2 C 3 C' modify data block 3 of dnode 2 Clone is similar to a snapshot. Clone references same blocks as original file. Only modified blocks are written out. 5 3 Diagrams Clone dnode 1 data blocks both nodes point shared data to the same data A A dnode 2 A A A A A A B A private data A B A A B modify data A cleaner way to represent shared and private data. 6 Reference node: reference original data Clone dnode 1 data dnode 2 (clone) reference dnode (hidden) A A A A A A In order to make the original file writable, we use an approach similar to a dataset clone. A dataset clone is performed on top of a read-only snapshot. Likewise, when a file is cloned, a hidden read-only dnode is created; it references all the original blocks. 7 Reference node: avoid refcount dnode 2 (clone) dnode 1 file 1 file 2 B A B A B A A A modify data A garbage data original data C A C A A A A A A C A reference dnode A A A A modify data The extra dnode is used to keep references to the original data even if it is not used anymore. As long as clones exist, original data blocks are not freed. This avoids having to implement any kind of complicated refcount algorithm. Original data is freed only when the refnode is destroyed. 8 System Attributes file 1 dnode 34 ____________ pflags: clone refnode: 55 reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref clones: 34, 56 file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 For cloned files: a flag is set indicating it is a clone. a new attribute is created: refnode. It points to the reference dnode. For the reference dnode: a flag is set indicating it is a the reference dnode for other clones. new attribute: clones. It is an array of dnode numbers representing all the clones. new attribute: birth. Txg of when the reference was created. 9 System Attributes file 1 dnode 34 ____________ pflags: clone refnode: 55 reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref clones: 5756 clones: 34, file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 New Attributes Refnode: object number of reference dnode. Birth: txg when the reference node was created. ZAP dnode 57 ________ 34 56 Clones: array of all dnodes that are clones of the refnode. Alternatively, clones could point to a ZAP object storing the clones list. pflags: new flags for pflags attribute: clone and clone_ref . 10 Freeing Blocks reference dnode dnode 55 file 1 dnode 34 file 1 C B A C B A C B A birth: txg 200 A birth: txg = 177 B Refnode birth: txg = 200 txg 200 txg<= > 200 birth: txg = 202 KEEP FREE Free when replaced birth: txg = 205 A A modify data. txg 205 old data. txg Original data. 202 = 177 Old data. txg 202 =txg 177 11 A When overwriting or destroying data, only free blocks that are born after the reference node. C Keep when replaced Freeing Blocks Blocks that were born after the reference node are treated the same way as regular blocks. Blocks born before the reference node are only freed when the reference node is destroyed. Any writes sent to a file right after it has been cloned cannot be assigned the same txg as the reference node. The reference node is destroyed when: Option 1: All clones are destroyed. Option 2: All but one clone is destroyed (harder). 12 Multiple Clones clone update clone list file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: clones: 34, 34, 56, 5663 B B A A A A B A A modify file 2 13 A A A file 1 dnode 34 ____________ pflags: clone refnode: 55 file 3 (clone) dnode 63 ____________ pflags: clone refnode: 55 file 1 was not modified after the cloning operation, thus file 3 can link to the same reference node. Nested Clones clone If a clone is modified and then cloned again, a new reference node will be created. file 2 (clone) dnode 56 ____________ pflags: clone refnode: 63 update clone list file 1 dnode 34 ____________ pflags: clone refnode: 55 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: 34, 63 56 file 3 (clone) dnode 64 ____________ pflags: clone refnode: 63 reference modifydnode data 2 dnode 63 file 2 (clone) birth: txg 240 dnode 56 ____________ ____________ pflags: clone_ref pflags: clone refnode: 55 clone_obj: 55 refnode: clones: 56,55 64 Data 2 Data 1 14 Integration with other ZFS features Problem Traversing blocks in-order without going twice over the same block now becomes problematic. ZFS Send/Receive ZFS send/receive loses all information about block pointers and txgs. Traversal must be done in multiple passes: first send the reference nodes, then send the clones Several passes required for nested clones. One extra pass for each clone depth. dataset dnode 1 regular dnode 4 clone of 9 dnode 7 regular First Pass: dnodes 1, 7, 9, 12 Solution: implicit references 15 dnode 9 refnode dnode 12 regular dnode 13 clone of 9 Second Pass: dnodes 4, 13 Implicit References File Clone: Method 2 Implicit Block Pointers Shared data is only accessible/linked from the reference node. The clones have embedded block pointers indicating to look for data in the reference node Implicit Direct References refnode clone B A A A 1 B A A A 2 A 3 return block 2 of refnode B A 1 B L1 A 2 B overwrite data block read data block 2 1 replace references to shared blocks by special hole 17 3 Nested Clones: performance issue Whe data is not found the refnodes have to be traversed one by one. This can cause performance degradation when multiple refnodes are nested. Clone refnode 2 C refnode 1 B C B A 1 A A 1 A A 2 A return block 1 18 3 2 B 1 3 2 C 3 read data block 1 Nested Clones: improvements One solution to improve performance is to have a reference to the dnode of the appropriate refnode in the embedded blkptr. Clone dnode 13 refnode 2 dnode 9 C refnode 1 dnode 7 7 B 7 A 1 A A 1 A A 2 A return block 1 19 3 B 2 B 1 3 C 2 C 3 read data block 1 Analysis Integration and comparison Compare referencing methods Direct References Access clones at same speed as regular files. ZFS send/receive becomes non-trivial and potentially slower. Requires more changes in DMU layer. 21 Implicit References Accessing clones is potentially slower for each nesting level. Higher arc footprint. Requires more changes in ZPL layer. Integration with other ZFS features ZIL A new record type must be implemented. Snapshots File clone should not interfere with current snapshot logic. Special care has to be taken so that unreferenced clone-related data is destroyed when a snapshot is destroyed. Scrubbing Do not scrub cloned data multiple times. Easy with implicit references. Send / Receive Do not send cloned data multiple times. Easy with implicit references. ZFS features Clone feature should be downgraded from active to enabled when all clones are deleted. 22 Space Quotas Space quotas can be tricky. It is a similar situation as with Linux reflinks. If we treat clone as a copy on the user level: Full ZPL size of each clone (shared + private data) is accounted to owner’s userquota. Full ZPL size of each clone is accounted to dataset’s refquota and refreservation. Shared data of refnode plus private data of each clone is accounted to dataset’s quota and reservation. 23 Accessing Reference Nodes A ZAP object in the dataset links to all refnodes. ZPL layer can access this ZAP object as a special read-only folder. Inside this folder, each refnode is displayed as a directory. Each refnode directory contains one file to view the refnode’s contents and another file that contains the relative paths to all its clones. 24 $refnodes refnode 7 contents clones refnode 13 contents Integration with OS NFS Fast File Clone NFS Fast File Clone support libc fclone userland system call fclone kernel zfs vnode ops vop_fclone zfs znode ops zfs_clone_node 25 A new system call is required File clone support within the same dataset File clone workhorse Other thoughts Ditto blocks for highly cloned files. Ability to unlink clone: obtain a hard copy. When cloning a clone, avoid nesting existing refnode if changes between the clone and its refnode are minor. Alternative clone designs Use deduplication Link to dataset clones Work to do. 26 Compare clone alternatives Direct/Implicit References Linked dataset clones Deduplication Instant cloning. Instant cloning. Slow cloning: need to traverse data. Scalable. Not scalable as of now. Affects pool import times. Scalable. Space wasted by refnode if shared data no longer referenced. Space wasted by snapshot if shared data no longer referenced. No space wasted. Space quotas must be implemented. Space quotas must be modified for dataset clones that represent files. Space quotas already implemented. 27 Q&A 28 THANK YOU