A multi-channel architecture for high

Download Report

Transcript A multi-channel architecture for high

A multi-channel architecture for high-performance
NAND flash-based storage system
Jeong-Uk Kang*, Jin-Soo Kim, Chanik Park,
Hyoungjun Park, Joonwon Lee
Agenda
• Introduce
• Background
• Multi-channel architecture
–
–
–
–
–
–
–
Read Operation
Write Operation
Software Architecture
Stripping
Interleaving
Pipelining
Combined System
• Evaluation
• Conclusion
• Appendix
2/23
Introduction (1/2)
• Flash Memory의 처리속도 (K9LAG08U0M)
– Write: Register 복사 시간(twc X 256) + program 시간(tprog)
• 약 2.4MBps
– Read: Page 읽는 시간(tread) + Register 복사 시간(trc X 256)
• 약 28.8MBps
 256 = 2048 Byte (page size) / 8 bit (bus width)
각 Flash 칩의 성능이 낮기 때문에 병렬화가 중요하다.
3/23
Introduction (2/2)
• 그럼 어떻게?
– Stripping
#1-A
Request #3 Request #2 Request #1
Request #1
#1-B
– Interleaving
Request #3
#2
Request #2
Request #1
#1
– Pipelining
Request #3
Request #2
Request #1
4/23
Background
• NAND Flash memory
– Total size: 128MB (1024 Blocks)
– Block
• Size: 64 pages
– Page
• Size: 2048B with 64B spare size
• Operation
– Write, Read, Erase
• Features
–
–
–
–
–
–
Erase before write
Program/erase cycle: 10,000~1,000,000
Read delay: 10~25us
Program delay: 200 ~ 700us
Erase delay: 2~3ms
Data transfer: 50us (8 or 16 bit bus band width)
5/23
Multi-channel architecture
DUMBO
CTR
CPU
NOR FLASH
INT
SDRAM
Host interface
Host
Channel Manager
Channel Manager
Channel Manager
Channel Manager
Data
32
Channel Manager
INT
CTR
32
DATA
32
Control
Logic
NAND
Interface
Buffer 1
Buffer 2 DATA
16
NAND
NAND
Flash
NAND
Flash
NAND
Flash
NAND
Flash
NAND
Flash
NAND
Flash
NAND
Flash
Flash
6/23
Read Operation
Interrupt
Read Data (RD)
HOST
Read Set (RS)
DUMBO
Setup
NAND
Busy
Data Transfer
Read from NAND (RN)
7/23
Write Operation
Write Data (WD)
Interrupt
Write Confirm (WC)
HOST
Write Set (WS)
DUMBO
Setup
Data transfer
NAND Program (NP)
NAND
BUSY
Write to NAND (WN)
8/23
Software Architecture
I/O Subsystem (Request Queue Management)
KERNEL
Block Device Driver
Flash Translation Layer (FTL)
Low-level Device Driver (I/O Scheduler, Interrupt Handler)
DUMBO
File System Data
FTL method: Hybrid (구체적인 구현 방법에 대하여 기술 안 함)
Garbage Collection: 없음
9/23
Stripping
without stripping
RN
RS
RD
Channel Manager
with stripping
RN
RS
RD
Channel Manager#1
RN
RS
RD
Channel Manager#2
without stripping
WD
WS
WN
WC
NP
Channel Manager
with stripping
WDWS WN WC
NP
Channel Manager#1
WDWS WN WC
NP
Channel Manager#2
10/23
Interleaving
without interleaving
RS
RN
RD
RS
RN
RD
RN
RS
RD
Channel Manager
with interleaving
Channel Manager#1
RS
RN
RD
Channel Manager#2
without interleaving
WDWS WN WC
NP
WDWS WN WC
NP
WDWS WN WC
NP
Channel Manager
with interleaving
Channel Manager#1
WDWS WN WC
NP
Channel Manager#2
11/23
Pipelining
without pipelining
RS
RN
RS
RN
RD
RS
RN
RD
Channel Manager
with pipelining
Buffer#1
Channel Manager
RN
RS
Buffer#2
without pipelining
RD
WDWS WN WC
NP
WDWS WN WC
NP
RD
WDWS WN WC
NP
Channel Manager
with pipelining
Buffer#1
Channel Manager
Buffer#2
WD
WS WN WC
NP
12/23
Combined system
#2-1
#4-1
#2-2
Request #4
Request #2
Request #3
Request #1
Request #4 Request #2
Request #3 Request #1
#4-2
#2-1
#4-1
#2-2
#4-2
Request
Interleaving
Striping
Pipelining
Example)
Write: 2.4 X 8 = 19.2MBps
Read: 28.8 X 8 = 230.4MBps 13/23
Evaluation (1/3)
READ
WRITE
STRIPPING
INTERLEAVING
PIPELINIG
?
14/23
Evaluation (2/3)
Putting it all together
Selection is S2:I2.
4KB에서 가장 좋은 결과를 나타낼 수 있음
- Stripping의 Sub-request의 크기가 Page 크기보다 큼
- 대부분의 파일 시스템의 최대 요청 크기를 4KB단위로 함
 최대 성능: 23.3MBps, 16.0MBps
15/23
Evaluation (3/3)
Block Device Driver
16/23
Conclusion
• 단일 채널 보다 약 3.6배 빠르게 처리 할 수 있었다.
• Ideal한 성능에 80%밖에 미치지 못하였다.
DMA를 쓰면 개선할 수 있을 것이다.
• Real work-load에 대한 실험이 필요하다.
17/23
Appendix
Issues
• 무엇이 병렬화를 어렵게 하는가?
– Mapping Algorithm
• 읽기 병렬화를 위해서는 병렬로 기록되어 있어야 한다.
• Hybrid의 Mapping방법은 병렬기록을 하기 위하여 Sequential Block의 성공
률이 낮아진다.  Block mapping으로 전락
• Page mapping은 병렬화는 손쉽게 가능하나 Garbage Collection이 효율적
이지 못하다.
• Minor issues
–
–
–
–
–
–
읽고 쓰기가 다른 버퍼 메모리
병렬화로 인하여 버려지는 공간에 대하여 고려하지 않음
FLASH에 기록하면서 한번에 기록할 것을 두 번에 기록하는데 느려지지 않음 오히려 빨라짐
파일 시스템에서의 기록 단위가 4KB라고 해서 장치에 4KB 단위로 기록되는 것이라고 판단
Channel Manager 내 다른 Flash에 대한 동시 Program이 불가
EXT4와 같은 종래의 파일시스템 기반 실험이 아닌 것으로 판단
•
–
–
–
Kernel에 FTL이 존재
부분 구현된 FTL에 코드 양으로 부하 측정
Mapping 및 Garbage Collection방법 부재
A. Ban, Flash file system, United States Patent No. 5,404,485, April 1995. 는 Garbage
Collection에 대한 내용이 없음
19/23
Plane Parallelism (1/2)
Sector
Copy to Register
Cell program
800ms
93ms
2KB (Page Size) X twc (8bit wired)
-Atomic write를 보장하여야 함 (CPU Intensive operation)
-전원이 꺼지면 데이터 삭제됨
-CPU와 상관없이 독립적으로 데이터 기록
-Cell기록이 완료되면 전원에 상관없이 데이터를 기록함
 Register Write Speed: 35.7MB/S
 Actual Write Speed:
4.3MB/S
20/23
Plane Parallelism (2/2)
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
1172ms=13.3MB/S
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
4ch
Copy to Register
Copy to Cell
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Flash
Copy to Cell
Copy to Register
Copy to Register
Copy to Cell
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Copy to Register
Copy to Cell
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
Copy to Register
Copy to Cell
13.3 X 4 = 55.4MB/S
21/23
References
• K9XXG08UXM Datasheet, Samsung Electronics
22/23