Using Uncacheable Memory to Improve Unity Linux Performance
Download
Report
Transcript Using Uncacheable Memory to Improve Unity Linux Performance
Using Uncacheable Memory to
Improve Unity Linux Performance
Ning Qu
Xiaogang Gou
Xu Cheng
Microprocessor Research and Development Center
Peking University
Issues
UniCore-F64
(CP2)
UniCore32
System Control
Modules
RTC
INTC
I_BUS
IIC
Hardware
IMMU table walking
CP0
in main memory
CP1
UART1
DMA
D_BUSTLB
I-Cache
UART0
28 GPIO
PowerM.
DMMU
OST
SPI
No snooping
Main Memory
D-Cache
ResetC
BIU
APB Bridge
Cache
coherency problem
everywhere
!! ICache
CPU
DCache
PCI Bridge
EMI
10/100M
MAC
6 channel
DMA
Unity SoC architecture
Peking University
Issues
cont.
User Process
User Process
process I/O buffer
process I/O buffer
poor temporal
locality!
Linux Kernel
Linux Kernel
kernel I/O buffer
DMA
kernel I/O buffer
DMA
I/O device buffer
I/O device buffer
I/O Device
I/O Device
Peking University
Motivation
Heavy cost of Cache coherency operations
Many
high-end embedded processors have Cache,
But many of them have very limited support to
guarantee cache coherency
How to avoid the disadvantages?
Poor locality leads to more data Cache
Uncacheable
memory
may
be
a
solution!
pollution
Cache
is based on property of locality
Some programs have poor locality, for example
TCP/IP processing
Peking University
Contributions
Analyze the scenarios in which Cache doesn’t
perform well, propose uncacheable memory
has two advantages
Eliminate most of Cache coherency operations
Avoid Cache pollution
Apply uncacheable memory in Unity Linux to
improve the I/O performance.
Some important aspects improves from 5% - 29%
Peking University
Outline
Issues
Motivation
Contribution
Uncacheable Memory
Evaluation
Related Work
Conclusions
Peking University
Recv Packet Flow
step 1
step 2
User Space
flush cache
Kernel Space
Buffer
step 3
Simple data
processing
Buffer
step 4
User Buffer
Buffer
Buffer
CPU copy
I/O Device
DMA copy
using uncacheable memory
Peking University
Send Packet Flow
step 1
User Space
Kernel Space
CPU copy
step 2
User Buffer
Buffer
step 3
step 4
clean cache
Buffer
Simple data
processing
DMA copy
Buffer
Buffer
I/O Device
using uncacheable memory
Peking University
Cacheable vs. Uncacheable
Send
Receive
CH processing
1. copy from U to K
2. clean data cache
1. clean&invalidate data cache
2. copy from K to U
NC processing
1. copy from U to K(N)
1. copy from K(N) to U
1. accessing uncacheable
memory is slower
2. no data cache pollution
3. no cache clean operation
1. accessing uncacheable
memory is slower
2. no data cache pollution
3. no cache flush operation
side effect
DMA send and receive cost analysis
Peking University
Cacheable vs. Uncacheable cont.
DMA Send:
DMA Recv:
Cache clean cost
load U to Cache
load K to Cache
load Uflush
into
store
to K load U into Cache and store
Cache
loadcost
K to Cache
Cache
load K
load U into Cache and store
Peking University
Cacheable vs. Uncacheable cont.
Recv and Send Performance CH vs NC
Peking University
Using Uncacheable Memory
Implemented in Unity Linux ported from
Linux 2.4.17
Uncacheable page table
eliminate Cache coherency operations when
modifying the page tables
Uncacheable socket buffer for sending
eliminate Cache coherency operations
avoid data Cache pollution
Peking University
Outline
Motivation
Issues
Contribution
Uncacheable Memory?
Evaluation
Related Work
Conclusions
Peking University
Methodology
Benchmarks: Netperf, Lmbench and Modified
Andrew benchmark.
Experiments environment
160 MHz Unity network computer with 256 MB
DRAM, a SoC build-in 10M/100M Ethernet card
Dell 4600 server, two Intel Xeon PIII 700 MHz
processors with 4 GB DRAM and 1000M/100M
Ethernet card
All benchmarks are executed in single-user mode
on NFS.
Peking University
Netperf Benchmark Results
Netperf TCP_STREAM Send Performance
Peking University
Netperf Benchmark Results cont.
Netperf TCP_RR Performance
Peking University
Lmbench Benchmark Results
Lmbench Performance
Peking University
Modified Andrew Benchmark Results
Modified Andrew Benchmark
Peking University
Related Work
Related work: accelerate uncacheable memory
performance
New memory type
New instructions
Intel write-combining
MIPS R10000: uncached-accelerated page
SPARC V9, ARM, Unity II: block move instructions
Future work: new memory type support
Read like common cache with low pollution
Write like Write-Combining without write-allocate
Peking University
Conclusions
This paper focuses on the uncacheable
memory usage.
Pros: eliminating coherency operations and
avoiding data Cache pollution.
Cons: slow accessing time
Uncacheable memory can perform well
with a carefully design when considering
system specialties
Peking University
Thank You!
Questions?
Peking University