Understanding software that doesn*t want to be understood
Download
Report
Transcript Understanding software that doesn*t want to be understood
UNDERSTANDING SOFTWARE THAT
DOESN’T WANT TO BE UNDERSTOOD
REVERSE ENGINEERING OBFUSCATED BINARIES
Saumya Debray
The University of Arizona
Tucson, AZ 85721
The Problem
Rapid analysis and understanding of malware
code essential for swift response to new threats
‒ Malicious software are usually heavily obfuscated
against analysis
Existing approaches to reverse engineering such
code are primitive
‒ not a lot of high-level tool support
‒ requires a lot of manual intervention
‒ slow, cumbersome, potentially error-prone
Delays development of countermeasures
Goals
Develop automated techniques for analysis and
reverse engineering of obfuscated binaries
semantics-based
‒ output is functionally equivalent to, but simpler than,
the input program
generality
‒ should work on any obfuscation
even ones we haven’t thought of yet!
‒ should minimize assumptions about obfuscations
Challenges
can’t make assumptions about obfuscations
‒ what do we leverage for deobfuscation?
‒ distinguishing code we care about from code we don’t
how do we know which instructions we care about?
scale
‒ “needle in haystack”
no. of instructions executed increases by 270x (VMprotect) to
4300x (Themida) [Lau 2008]
anti-analysis defenses
‒ runtime unpacking
‒ anti-emulation, anti-debug checks
Our Approach
no obfuscation-specific assumptions
‒ treat programs as input-to-output transformations
‒ use semantics-preserving transformations to simplify
execution traces
Taint analysis
(bit-level)
map flow of values
from input to output
Semanticspreserving
transformations
simplify logic of
input-to-output
transformation
Control flow
reconstruction
reconstruct logic of
simplified computation
control flow
graph
input
program
dynamic analysis to handle runtime unpacking
Ex 1:Emulation-based Obfuscation
bytecode logic (data)
input program
random seed
Obfuscator
mutation
engine
emulator (code)
examination of the code reveals only the
emulator’s logic
‒ actual program logic embedded in byte code
lots of “chaff” during execution
‒ separating emulator logic from payload logic tricky
emulators can be nested
Ex 2:Return-Oriented Programs (ROP)
Originally designed to bypass anti-code-injection
defenses
‒ stitches together existing code fragments ( “gadgets” ),
e.g., in system libraries
Logic can be difficult to discern
‒ gadgets are typically scattered across many different
functions and/or libraries
‒ gadgets can overlap in memory in weird ways
‒ control flow structures (if-else, loops, function calls) are
typically implemented using non-standard idioms
Example 1 (emulation-obfuscation)
factorial (Themida)
Example 2 (ROP)
factorial
o
original
ROP
Interactions between Obfuscations
Example: Unpacking + Emulation
unpack
input
output
output
instructions “tainted” as propagating
values from input to output
input-to-output
computation
(further simplified)
used to construct control flow graph
input
unpack
execution trace
Results
Ex. 1. binary search : Themida
original
obfuscated (cropped)
deobfuscated
Results
Ex. 2. Hunatcha (drive infection code) : ExeCryptor
original
obfuscated (cropped)
deobfuscated
Results
Ex. 3. fibonacci: ROP
original
obfuscated
deobfuscated
Results
Ex. 4. Win32/Kryptik.OHY: Code Virtualizer
obfuscated
multiple layers of runtime
code generation
unpacking code
the CFG shown materializes
incrementally
initial unpacker is
emulation-obfuscated
deobfuscated
Results: CFG Similarity
100
Similarity with original program (%)
90
80
70
60
50
OBFUSCATED
DEOBFUSCATED
40
30
20
10
0
Programs
Lessons and Issues
Static vs. dynamic analysis
‒ multiple layers of runtime code generation/unpacking
limits utility of static analysis
‒ dynamic analysis can run into problems of scale
O(n2) algorithms impractical ; even O(n log n) can be problematic
trade memory space for execution time/complexity
code coverage — multi-path exploration?
Taint propagation
‒ byte/word-level analyses may not be precise enough
we use (enhanced) bit-level taint propagation
Simplified trace → CFG: NP-hard
‒ semantic considerations?
Conclusions
Rapid analysis and understanding of malware
code essential for swift response to new threats
‒ need to deal with advanced code obfuscations
‒ obfuscation-specific solutions tend to be fragile
We describe a semantics-based framework for
automatic code deobfuscation
‒ no assumptions about the obfuscation(s) used
‒ promising results on obfuscators (e.g., Themida) not
handled by prior research
ADDITIONAL MATERIAL
Semantics-based simplification
Quasi-invariant locations: locations that have the
same value at each use.
Our transformations (currently):
‒ Arithmetic simplification
adaptation of constant folding to execution traces
consider quasi-invariant locations as constants
controlled to avoid over-simplification
‒ Data movement simplification
use pattern-driven rules to identify and simplify data movement.
‒ Dead code elimination
need to consider implicit destinations, e.g., condition code flags.