Paper Report - National Sun Yat

Transcript Paper Report - National Sun Yat

HQEMU: A Multi-Threaded and
Retargetable Dynamic
Binary Translator on Multicores
Cite count: 7
Presenter: Zong-Ze Huang
Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu
Wei-Chung Hsu, Pangfeng Liu, Chien-Min Wang, Yeh-Ching Chung
Proceedings of the Tenth International Symposium on Code Generation and
Optimization, 2012
2

Dynamic binary translation (DBT) is a core technology to many
important applications such as system virtualization, dynamic binary
instrumentation and security. However, there are several factors that
often impede its performance: (1) emulation overhead before
translation; (2) translation and optimization overhead, and (3)
translated code quality. On the dynamic binary translator itself, the
issues also include its retargetability to support guest applications from
different instruction-set architectures (ISAs) to host machines also
with different ISAs, an important feature for system virtualization.

In this work, we take advantage of the ubiquitous multicore platforms,
using multithreaded approach to implement DBT. By running the
translators and the dynamic binary optimizers on different threads on
different cores, it could off-load the overhead caused by DBT on the
target applications; thus, afford DBT of more sophisticated
optimization techniques as well as the support of its retargetability.

3
Using QEMU (a popular retargetable DBT for system virtualization)
and LLVM (Low Level Virtual Machine) as our building blocks, we
demonstrated in a multi-threaded DBT prototype, called HQEMU,
that it could improve QEMU performance by a factor of 2.4X and
4X on the SPEC 2006 integer and floating point benchmarks for x86
to x86-64 emulations, respectively, i.e. it is only 2.5X and 2.1X
slower than native execution of the same benchmarks on x86-64, as
opposed to 6X and 8.4X slowdown on QEMU. For ARM to x86-64
emulation, HQEMU could gain a factor of 2.4X speedup over
QEMU for the SPEC 2006 integer benchmarks.

Three factors often impede DBT’s performance
1.
2.
3.
Proposal methods



4
Emulation overhead before translation
Translation and optimization overhead
Translated code quality
Developed a multi-threaded retargetable DBT prototype, call
HQEMU.
Propose a novel trace combination technique to improve
existing selection algorithm
QEMU+LLVM
COREMU
[17]
[28]
Can send one trace
at a time to LLVM
Used TCG IR to
achieve retargetable
NET
algorithm
Sampling
profiles HPM
[16]
[19]
Multiple QEMU instance
to multiple threads
PQEMU
To choose
the hot TBs
Propose some technique to
improve accuracy of HPM
[11]
Accuracy
sampling
profiles HPM
[7]
One instance of QEMU but
parallelizes it internally
HQEMU: A Multi-Threaded and Retargetable Dynamic
Binary Translator on Multicores
5

TCG (Tiny Code Generator) is core translation engine which
provides a small set of IR operations (about 142 operation code).


A generic backend for a C compiler and used in QEMU.
Flow:
translation
Guest
code

compile
TCG
IR
TCG
Host
code
Advantages:
。Translation fast. (compare with LLVM)
。Code optimization i.e. dead code elimination. (compare with Dyngen)

6
Defects:
。Code is low quality. (compare with LLVM)
。Without further optimizations, there are often many redundant load and store
operation left in the generated host code.

The goal is to design a DBT that not only can emit highquality host codes but also exert low overhead on the running
application.

Two translations are designed for different purposes.



When LLVM optimizer receives an optimization request
from the FIFO queue, it converts its TCG IRs to LLVM IRs.


7
TCG translator for fast translation.
LLVM translator for generating high quality host codes.
Retargetable.
Simplifies the backend translator tremendously.

Use Multi-Thread to hidden these optimization overhead.

8
LLVM translator is running on the other thread without interfering
with the execution of the program.

Problem definition:

Binary translator needs to save and restore program contexts when the
switches between the one TB to other TB because different register mapping.
。Even if two TBs have a direct transition path (e.g. through block chaining) and
also have the same guest to host register mappings.
Load CPU state
Save CPU state
TB1


Load CPU state
Save CPU state
Load CPU state
TB2
Save CPU state
TB3
Frequent storing and reloading of registers, performance pool.
Proposal method:

Merge many small TBs into larger ones, called traces.
。Eliminating the redundant load and store operation by promoting such memory
operation to register accesses within traces.
9

If dispatcher looks up directory hit, the basic block has been
translated before and a cycle execution path is found.
1.
2.
3.
10
Profile routine is enabled to count each time this block is executed.
Predict routine is enabled to record the head block to recording list.
Patch a direct jump to redirect optimized code.

Problem definition:


Proposal method: Trace merging


11
Trace optimization only could handle are either a straight path or a
simple loop, cannot deal more complex control flow graph (CFG).
Force the merging of problematic traces that frequently jump among
themselves.
Use a feedback-directed approach with the help of on-chip hardware
performance monitor (HPM) to perform trace merging

Trace has to meet three criteria to be considered as a hot trace
1.
The trace is in a stable state
• Assume 100 traces and collect most recent N=10 sampling intervals.
100~91 90~81 80~71 70~61
traces
traces traces traces
……
……
Circular queue
• Consider a trace is in a stable state if it appears in all entries of the circular queue.
2.
12
The count of the trace must be greater than a threshold.
13

HQEMU performance compare with QEMU.

Multi-thread HQEMU compare with single-thread HQEMU

How many memory operations reduce by trace formation and
trace merging.

Overhead of trace formation.

Host platform




Target platform


14
3.3 GHz quad-core Intel Core i7 processor
12 GBytes main memory
64-bit Gentoo Linux with kernel version 2.6.30
Two different ISAs, ARM and x86
LLVM version 2.8

SPEC2006 benchmark suite is tested.

Trace profiling threshold is set to 50 and the maximum
length of a trace is 16 TBs.

The trace merging in the dynamic optimizer is set to 8.

Four different configurations are used to evaluate the
effectiveness of HQEMU.

QEMU
。Which is the QEMU version 0.13 with the fast TCG translator.

LLVM
。Which uses the same modules of QEMU except that the TCG translator is replaced
by the LLVM translator.

HQEMU-S
。Which is the single threaded HQEMU with TCG and LLVM translators running on
the same thread.

HQEMU-M
。Which is the multi-threaded HQEMU, with TCG and LLVM translators running on
separate threads.
15

For SPEC2006 CINT benchmark with test input set.

HQEMU-M is faster than both the QEMU and the LLVM configurations.
。 The average slowdown of QEMU for CINT is 7.7X, 12.8X for LLVM, and 4.X for HQEMU-M.

For SPEC2006 CFP benchmark with test input set.

HQEMU-M is faster than both the QEMU and the LLVM configurations.
。 The average slowdowns of QEMU and LLVM are both 9.95X for CFP and HQEMU-M only 3.3X.

Form four benchmark that have too much translation can know our propose
multi-thread HQEMU is useful compare with single thread HQEMU.
(A) CINT (test input)
16
(B) CFP (test input)

Programs spend much more time running in the optimized code caches.

LLVM configuration outperforms QEMU


Optimization overhead is very much amortized.
HQEMU significant improvement over both QEMU and LLVM.
(C) CINT (Ref input)
17
(D) CFP (Ref input)

18
Use Trace formation and trace merging can reduce much
number of memory operation.
19

The translation time represents the time spending on trace
generation by the thread of the LLVM translator.

As the table shows, most benchmarks spend less than 1% of
total time conducting trace translation.

presented the multi-threaded QEMU+LLVM hybrid
(HQEMU) approach can achieve low translation overhead
and good translated code quality on the target binary
applications.

Proposed a novel trace merging technique could remove
redundant memory operations

My Comment


20
This paper let me know other methods of improve performance for
QEMU.
Experimental is very detail.

Paper Report - National Sun Yat

Transcript Paper Report - National Sun Yat

Directory