Paper Report - National Sun Yat
Download
Report
Transcript Paper Report - National Sun Yat
Presenter: Zong Ze - Huang
Jiun-Hung Ding , Po-Chun Chang , Wei-Chung Hsu , Yeh-Ching Chung
Parallel and Distributed Systems(ICPADS), 2011 IEEE 17th
International Conference on
2015/7/20
1
A full system emulator, such as QEMU, can provide a versatile
virtual platform for software development. However, most current
system simulators do not have sufficient support for multi-processor
emulations to effectively utilize the underlying parallelism presented
by today’s multi-core processors. In this paper, we focus on
parallelizing a system emulator and implement a prototype parallel
emulator based on the widely used QEMU. Using this parallel
QEMU, emulating an ARM11MPCore platform on a quad-core Intel
i7 machine with the SPLASH-2 benchmarks, we have achieved
3.8x speedup over the original QEMU design.
2
3
Current design of QEMU is only suitable for single-core
processor emulation.
When executing a multi-threaded application on a multithreaded application on a multi-core machine, QEMU
emulates the execution of the application in serial and
cannot take advantage of the parallelism available in the
application and the underlying hardware.
Simulation
Coremu[16]
Micro-architectural simulation
Functional simulation
SimpleScalar[5]
Wattch[4]
SimOS[11]
Simics[12]
Increased simulation efficiency
Dynamic binary
translation[3]
Full system simulation
RSIM[14], SimOS[11], QEMU[6]
Simics[12], Mambo[2]
Implement a protorype parallel emulator based on the widely used QEMU
This paper:
4
PQEMU: A Parallel System Emulator Base on QEMU
Propose a novel design of a multi-threaded
QEMU, called PQEMU.
5
Unified code cache design
Separate code cache design
QEMU work for Multi-core guest
Host OS scheduler
QEMU
Guest processor
G0
Thread on host
machine
Physical core
T0
P0
G1
P1
G2
G3
P2
Round-Robin
P3
PQEMU work for Multi-core guest
Host OS scheduler
QEMU
Guest processor
Physical core
T0
P0
G1
T1
P1
G2
T2
P2
G3
T3
P3
G0
6
Thread on host
machine
IO thread
CPU 0, 1
CPU Idle
Find
Invalidate
Build
Chain
Execute
Restore
Help Function
Flush
Memor
y
SDRAM
Code
Cache
Soft MMU
Exception/Interrupt Check
RAM
Block
FLASH
RAM
Block
Keystroke
receive
Screen update
IO
Interrupt notification:
Unchain
7
I/O Device Model
Alarm signal
Emulation threads group
Emulation thread #0
CPU 0
Emulation thread #1
CPU 1
Unified
Code
Cache
Memory
IO
8
IO thread
9
Translated Block(TB)
Two architecture states
10
A unit of basic block
Emulator executes in the code
cache(dark grey boxs).
Emulation manager(white
boxs).
TCG translation engine(TCG):
Code Cache(CC):
Simplify the management of TB descriptors.
TB Hash Table(TBHT):
11
It holds the meta-information of a TB in code cache.
TB Descriptor Array(TBDA):
The storage space for TB output after Build.
TB Descriptor(TBD):
It is the binary translation engine in system emulator.
It is the central hash table in key of guest PC value that Find Slow
searches after Find Fast fails. Every in-use TBD has an index in
this hash table to reference to.
TB Descriptor Pointer(TBDP):
Memory Page Descriptor(MPD)
12
It is a field private to each guest cores that holds the index to
recently-used TBD (duplicated from previous hash table).
To accelerate the detection of guest SMC activity.
Independent : never use the same shared component. ex: Find
Slow and Restore.
Synchronous : component is shared among all emulation thread.
ex: Restore and Build.
13
Dependent : though something is shared on table but no
simultaneous access would happen in real life. ex: Build with
Chain/Unchain/Execute.
Four independent sets:
Two rules:
14
Construct = { Find Fast, Find Slow, Build and Restore }
Link = { Chain, Unchain }
Use = { Execute }
Destruct = { Flush, Invalidate }
Any two states live in the same set must run sequentially,
except those pure read operations like Find Fast, Find
Slow and Execute; otherwise they could go parallel.
Destruct requires an exclusive access for efficiency
reason, since the states will modify most of sharing
components all at once.
Deploy locks only at state combination in Synchronous.
15
Exclusive_rwlock, Build_lock and Chain_lock
Independent relationship between Find Slow and
Build as following:
Revise the rule 1
Any two states live in the same set must run sequentially,
except Search; otherwise they could go parallel.
Optimization also introduces the redundancy
problem
16
Construct = { Build and Restore }
Search = { Find Fast and Find Slow }
Induce memory waste but not impact correctness.
17
Duplicates all sharing components for every
emulation thread except MPD.
18
Single core
Multi-cores
CPU 0
Time
1
1
IO
Time
CPU
2
4
3
IO
2
3
2
1
3
4
5
5
19
CPU 1
Original
parallel I/O system
CPU 0
Time
1
IO
2
5
3
1
Time
CPU
4
IO
CPU 1
2
4
5
1
3
2
4
5
3
20
Run benchmark on various emulation designs
and compare with baseline QEMU.
21
P-UCC
P-UCC+IO
P-UCC+IO+FS
P-SCC
P-SCC+IO
Coremu
22
One working thread
Four working thread
23
On average has 5~10% slowdown to compared with baseline QEMU.
On P-UCC+IO+FS has 3.72x speed up to compared with baseline QEMU.
24
SCC need more memory space and translation time, but it
eliminates most synchronization.
Invalidate in SCC incurs more overhead, because update
has to apply to all duplicated sharing components.
Latency of guest interrupt in UCC is slightly worse than
SCC, because of the contention for TB chaining and
unchaining.
SCC may be too costly in terms of the memory overhead
when emulating a many-core but have best for running
parallel applications with massive code sharing.