Paper Report - National Sun Yat

Transcript Paper Report - National Sun Yat

Presenter: Zong Ze - Huang
Jiun-Hung Ding , Po-Chun Chang , Wei-Chung Hsu , Yeh-Ching Chung
Parallel and Distributed Systems(ICPADS), 2011 IEEE 17th
International Conference on
2015/7/20
1

A full system emulator, such as QEMU, can provide a versatile
virtual platform for software development. However, most current
system simulators do not have sufficient support for multi-processor
emulations to effectively utilize the underlying parallelism presented
by today’s multi-core processors. In this paper, we focus on
parallelizing a system emulator and implement a prototype parallel
emulator based on the widely used QEMU. Using this parallel
QEMU, emulating an ARM11MPCore platform on a quad-core Intel
i7 machine with the SPLASH-2 benchmarks, we have achieved
3.8x speedup over the original QEMU design.
2
3

Current design of QEMU is only suitable for single-core
processor emulation.

When executing a multi-threaded application on a multithreaded application on a multi-core machine, QEMU
emulates the execution of the application in serial and
cannot take advantage of the parallelism available in the
application and the underlying hardware.
Simulation
Coremu[16]
Micro-architectural simulation
Functional simulation
SimpleScalar[5]
Wattch[4]
SimOS[11]
Simics[12]
Increased simulation efficiency
Dynamic binary
translation[3]
Full system simulation
RSIM[14], SimOS[11], QEMU[6]
Simics[12], Mambo[2]
Implement a protorype parallel emulator based on the widely used QEMU
This paper:
4
PQEMU: A Parallel System Emulator Base on QEMU

Propose a novel design of a multi-threaded
QEMU, called PQEMU.


5
Unified code cache design
Separate code cache design

QEMU work for Multi-core guest
Host OS scheduler
QEMU
Guest processor
G0
Thread on host
machine
Physical core
T0
P0
G1
P1
G2
G3

P2
Round-Robin
P3
PQEMU work for Multi-core guest
Host OS scheduler
QEMU
Guest processor
Physical core
T0
P0
G1
T1
P1
G2
T2
P2
G3
T3
P3
G0
6
Thread on host
machine
IO thread
CPU 0, 1
CPU Idle
Find
Invalidate
Build
Chain
Execute
Restore
Help Function
Flush
Memor
y
SDRAM
Code
Cache
Soft MMU
Exception/Interrupt Check
RAM
Block
FLASH
RAM
Block
Keystroke
receive
Screen update
IO
Interrupt notification:
Unchain
7
I/O Device Model
Alarm signal
Emulation threads group
Emulation thread #0
CPU 0
Emulation thread #1
CPU 1
Unified
Code
Cache
Memory
IO
8
IO thread
9

Translated Block(TB)


Two architecture states


10
A unit of basic block
Emulator executes in the code
cache(dark grey boxs).
Emulation manager(white
boxs).

TCG translation engine(TCG):


Code Cache(CC):


Simplify the management of TB descriptors.
TB Hash Table(TBHT):

11
It holds the meta-information of a TB in code cache.
TB Descriptor Array(TBDA):


The storage space for TB output after Build.
TB Descriptor(TBD):


It is the binary translation engine in system emulator.
It is the central hash table in key of guest PC value that Find Slow
searches after Find Fast fails. Every in-use TBD has an index in
this hash table to reference to.

TB Descriptor Pointer(TBDP):


Memory Page Descriptor(MPD)

12
It is a field private to each guest cores that holds the index to
recently-used TBD (duplicated from previous hash table).
To accelerate the detection of guest SMC activity.

Independent : never use the same shared component. ex: Find
Slow and Restore.

Synchronous : component is shared among all emulation thread.
ex: Restore and Build.

13
Dependent : though something is shared on table but no
simultaneous access would happen in real life. ex: Build with
Chain/Unchain/Execute.

Four independent sets:





Two rules:


14
Construct = { Find Fast, Find Slow, Build and Restore }
Link = { Chain, Unchain }
Use = { Execute }
Destruct = { Flush, Invalidate }
Any two states live in the same set must run sequentially,
except those pure read operations like Find Fast, Find
Slow and Execute; otherwise they could go parallel.
Destruct requires an exclusive access for efficiency
reason, since the states will modify most of sharing
components all at once.

Deploy locks only at state combination in Synchronous.

15
Exclusive_rwlock, Build_lock and Chain_lock

Independent relationship between Find Slow and
Build as following:



Revise the rule 1


Any two states live in the same set must run sequentially,
except Search; otherwise they could go parallel.
Optimization also introduces the redundancy
problem

16
Construct = { Build and Restore }
Search = { Find Fast and Find Slow }
Induce memory waste but not impact correctness.

17
Duplicates all sharing components for every
emulation thread except MPD.
18
Single core
Multi-cores
CPU 0
Time
1
1
IO
Time
CPU
2
4
3
IO
2
3
2
1
3
4
5
5
19
CPU 1
Original
parallel I/O system
CPU 0
Time
1
IO
2
5
3
1
Time
CPU
4
IO
CPU 1
2
4
5
1
3
2
4
5
3
20

Run benchmark on various emulation designs
and compare with baseline QEMU.






21
P-UCC
P-UCC+IO
P-UCC+IO+FS
P-SCC
P-SCC+IO
Coremu
22

One working thread


Four working thread

23
On average has 5~10% slowdown to compared with baseline QEMU.
On P-UCC+IO+FS has 3.72x speed up to compared with baseline QEMU.
24

SCC need more memory space and translation time, but it
eliminates most synchronization.

Invalidate in SCC incurs more overhead, because update
has to apply to all duplicated sharing components.

Latency of guest interrupt in UCC is slightly worse than
SCC, because of the contention for TB chaining and
unchaining.

SCC may be too costly in terms of the memory overhead
when emulating a many-core but have best for running
parallel applications with massive code sharing.

Paper Report - National Sun Yat

Transcript Paper Report - National Sun Yat

Directory