Using Uncacheable Memory to Improve Unity Linux Performance

Download Report

Transcript Using Uncacheable Memory to Improve Unity Linux Performance

Using Uncacheable Memory to
Improve Unity Linux Performance
Ning Qu
Xiaogang Gou
Xu Cheng
Microprocessor Research and Development Center
Peking University
Issues
UniCore-F64
(CP2)
UniCore32
System Control
Modules
RTC
INTC
I_BUS
IIC
Hardware
IMMU table walking
CP0
in main memory
CP1
UART1
DMA
D_BUSTLB
I-Cache
UART0
28 GPIO
PowerM.
DMMU
OST
SPI
No snooping
Main Memory
D-Cache
ResetC
BIU
APB Bridge
Cache
coherency problem
everywhere
!! ICache
CPU
DCache
PCI Bridge
EMI
10/100M
MAC
6 channel
DMA
Unity SoC architecture
Peking University
Issues
cont.
User Process
User Process
process I/O buffer
process I/O buffer
poor temporal
locality!
Linux Kernel
Linux Kernel
kernel I/O buffer
DMA
kernel I/O buffer
DMA
I/O device buffer
I/O device buffer
I/O Device
I/O Device
Peking University
Motivation

Heavy cost of Cache coherency operations
Many
high-end embedded processors have Cache,
But many of them have very limited support to
guarantee cache coherency

How to avoid the disadvantages?
Poor locality leads to more data Cache
Uncacheable
memory
may
be
a
solution!
pollution
Cache
is based on property of locality
Some programs have poor locality, for example
TCP/IP processing
Peking University
Contributions

Analyze the scenarios in which Cache doesn’t
perform well, propose uncacheable memory
has two advantages


Eliminate most of Cache coherency operations
Avoid Cache pollution
 Apply uncacheable memory in Unity Linux to
improve the I/O performance.
 Some important aspects improves from 5% - 29%
Peking University
Outline
Issues
 Motivation
 Contribution
 Uncacheable Memory
 Evaluation
 Related Work
 Conclusions

Peking University
Recv Packet Flow
step 1
step 2
User Space
flush cache
Kernel Space
Buffer
step 3
Simple data
processing
Buffer
step 4
User Buffer
Buffer
Buffer
CPU copy
I/O Device
DMA copy
using uncacheable memory
Peking University
Send Packet Flow
step 1
User Space
Kernel Space
CPU copy
step 2
User Buffer
Buffer
step 3
step 4
clean cache
Buffer
Simple data
processing
DMA copy
Buffer
Buffer
I/O Device
using uncacheable memory
Peking University
Cacheable vs. Uncacheable
Send
Receive
CH processing
1. copy from U to K
2. clean data cache
1. clean&invalidate data cache
2. copy from K to U
NC processing
1. copy from U to K(N)
1. copy from K(N) to U
1. accessing uncacheable
memory is slower
2. no data cache pollution
3. no cache clean operation
1. accessing uncacheable
memory is slower
2. no data cache pollution
3. no cache flush operation
side effect
DMA send and receive cost analysis
Peking University
Cacheable vs. Uncacheable cont.
DMA Send:
DMA Recv:
Cache clean cost
load U to Cache
load K to Cache
load Uflush
into
store
to K load U into Cache and store
Cache
loadcost
K to Cache
Cache
load K
load U into Cache and store
Peking University
Cacheable vs. Uncacheable cont.
Recv and Send Performance CH vs NC
Peking University
Using Uncacheable Memory

Implemented in Unity Linux ported from
Linux 2.4.17

Uncacheable page table


eliminate Cache coherency operations when
modifying the page tables
Uncacheable socket buffer for sending
eliminate Cache coherency operations
 avoid data Cache pollution

Peking University
Outline
Motivation
 Issues
 Contribution
 Uncacheable Memory?
 Evaluation
 Related Work
 Conclusions

Peking University
Methodology

Benchmarks: Netperf, Lmbench and Modified
Andrew benchmark.
 Experiments environment



160 MHz Unity network computer with 256 MB
DRAM, a SoC build-in 10M/100M Ethernet card
Dell 4600 server, two Intel Xeon PIII 700 MHz
processors with 4 GB DRAM and 1000M/100M
Ethernet card
All benchmarks are executed in single-user mode
on NFS.
Peking University
Netperf Benchmark Results
Netperf TCP_STREAM Send Performance
Peking University
Netperf Benchmark Results cont.
Netperf TCP_RR Performance
Peking University
Lmbench Benchmark Results
Lmbench Performance
Peking University
Modified Andrew Benchmark Results
Modified Andrew Benchmark
Peking University
Related Work

Related work: accelerate uncacheable memory
performance
 New memory type



New instructions


Intel write-combining
MIPS R10000: uncached-accelerated page
SPARC V9, ARM, Unity II: block move instructions
Future work: new memory type support


Read like common cache with low pollution
Write like Write-Combining without write-allocate
Peking University
Conclusions

This paper focuses on the uncacheable
memory usage.
Pros: eliminating coherency operations and
avoiding data Cache pollution.
 Cons: slow accessing time


Uncacheable memory can perform well
with a carefully design when considering
system specialties
Peking University
Thank You!
Questions?
Peking University