William Whistler REcon 2010 Who am I Long-time reverser Studied Computer Science at Oxford Enjoys reversing challenges ▪ T2 2007 winner, etc. Purpose.
Download
Report
Transcript William Whistler REcon 2010 Who am I Long-time reverser Studied Computer Science at Oxford Enjoys reversing challenges ▪ T2 2007 winner, etc. Purpose.
William Whistler
REcon 2010
Who am I
Long-time reverser
Studied Computer Science at Oxford
Enjoys reversing challenges
▪ T2 2007 winner, etc.
Purpose of this presentation
To describe the concepts behind a new application
for reversing called REvealer
Let’s model things like ‘real’ engineers
Allows us to ask the model questions
Include everything we discover
▪ Both through manual and automated techniques
Gradually move to higher-level concepts
But still allow drilling down to the exact detail
{define eax as tainted}
Step through,
instruction by
instruction
If the input to
something is
tainted, the output
is tainted
mov ebx, eax
{tainted: eax, ebx}
add edx, ecx
{tainted: eax, ebx}
mov eax, edx
{tainted: ebx}
add edx, ebx
{tainted: ebx, edx}
{define eax as tainted}
Used by
vulnerability
researchers to find
where data from
the network is used
But useful in other
reversing contexts
too
mov ebx, eax
{tainted: eax, ebx}
add edx, ecx
{tainted: eax, ebx}
mov eax, edx
{tainted: ebx}
add edx, ebx
{tainted: ebx, edx}
Dynamic, so deals with one execution path
No problems with loops/etc
Exact memory addresses are available at use
mov eax, dword ptr [esi+ecx]
Just look up esi and ecx to get the exact
memory address being read
Some complications, but still very useful
{eax tainted red} {ecx tainted green}
More than one taint
source
Analysis done
simultaneously
Locations can be
tainted by more
than one source
mov ebx, eax
{red: eax, ebx} {green: ecx}
add edx, ecx
{red: eax, ebx} {green: ecx, edx}
mov eax, edx
{red: ebx} {green: eax, ecx, edx}
add edx, ebx
{red: ebx, edx} {green: eax, ecx, edx}
Let’s define every external input to our code
as a taint source
CPU timestamp, file reads, network buffers,
API results…
Useful to see where values have come from
when looking at code for the first time
We want to deal with more than one
particular run through
Ideally all possible!
So we go static instead
But now we have some problems…
What does it access?
Indirect memory reads/writes
▪ mov eax, dword ptr [esi+ecx]
Where does it go?
Conditional jumps
▪ jz
Indirect jumps/calls
▪ jmp eax
Self-modifying code
Loops
Exceptions
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
test eax, eax
jz 103
…
test ecx, ecx
jz 106
…
test ebx, ebx
jz 109
…
…
100
101
102
103
104
105
106
107
103
104
106
107
105
106
107
106
107
Recombining paths
“OR” taint status of each location?
Store different possibilities somehow?
Let’s go with the former for now.
Which ones to recombine?
Obvious choice: by EIP?
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
test eax, eax
jz 103
…
test ecx, ecx
jz 106
…
test ebx, ebx
jz 109
…
…
100
101
102
103
104
105
106
107
But why be so strict?
Be user-guided, not fully automated
When is keeping paths separate useful?
Self-modifying code
Loop unrolling
Obfuscations
▪ Intrainstruction jumps
We want to ask what’s tainted at any point
the user chooses, not just termination
Repeating the whole analysis is slow
Let’s store what we learn along the way
Store all tainted sets at each point?
Store differences?
Store effects?
{eax tainted red} {ecx tainted green}
Store what changes
for each taint source
after each instruction
Query at any point
by working back up
Can either query for
the full set or a
subset of locations
mov ebx, eax
{red: +ebx} {green: no change}
add edx, ecx
{red: no change} {green: +edx}
mov eax, edx
{red: -eax} {green: +eax}
add edx, ebx
{red: +edx} {green: no change}
mov ebx, eax
If eax was tainted, ebx is now tainted
If eax was untainted, ebx is now untainted
add ebx, eax
If eax was tainted, ebx is now tainted
If eax was untainted, ebx remains as it was
Flags updated in a similar way
Uses the principle of “lazy evaluation”
If we only do specific queries, this is much
more efficient
Allows us to consider intermediary locations
as ‘internal’ taint sources
e.g. Ask what’s tainted by a function parameter
See what’s involved in calculating something,
not just the taint sources used
Plenty of practical uses
Where was this buffer encrypted?
Where was this checksum calculated?
Can even hide the instructions we’re not
interested in as a nice visualization
mov
mov
mov
mov
add
add
add
eax,
ebx,
ecx,
edx,
eax,
ecx,
eax,
100
200
300
400
ebx
edx
ecx
mov eax, 100
mov ebx, 200
add eax, ebx
eax
ecx
mov ecx, 300
mov edx, 400
add ecx, edx
mov eax,100
eax
mov ebx,200
mov ecx,300
ebx
ecx
add eax, ebx
mov
mov
mov
mov
add
add
add
eax,
ebx,
ecx,
edx,
eax,
ecx,
eax,
100
200
300
400
ebx
edx
ecx
mov edx,400
edx
add ecx, edx
ecx
eax
add eax, ecx
mov eax,100
eax=100
mov ebx,200
ebx=200
mov ecx,300
ecx=300
add eax, ebx
mov
mov
mov
mov
add
add
add
eax,
ebx,
ecx,
edx,
eax,
ecx,
eax,
100
200
300
400
ebx
edx
ecx
mov edx,400
edx=400
add ecx, edx
ecx=700
eax=300
add eax, ecx
eax=1000
Why restrict ourselves to the “route” taken by
information?
Emulate instructions when values are known
As well as storing the taint effects of each
instruction, store an evaluation function too
Allows us to deal with some (but not all)
branches and indirect jumps/calls
Opaque predicates
Similar to certain compiler optimizations
(constant folding)
For merged paths, similar choice to before
If the incoming values are different, just forget it
Find something more complicated to keep all
possibilities
Why restrict ourselves to exact values?
More limited information can still be useful
Many possibilities
Even/odd
▪ Or other modular congruencies
Positive/negative
Alphanumeric or not
Ranges
Instructions can
pass non-exact
value information
Can be specialized
for specific cases (as
with the mul here)
{eax totally unknown}
mul eax, 2
{eax is even}
inc eax
{eax is odd}
Requires “inverse” evaluation functions for
instructions
Many will not be possible or only provide limited
information
Allows the user to ask “what restrictions must
hold in order for a value to match”
Meeting the restrictions does not guarantee that
the value matches – not weakest preconditions
Every branch choice depends on values
e.g. a jz will branch one way if the z flag is set, the other if
not
We can spread this back up in the same way
And use this information to improve the analysis
Allows path-specific “input crafting”
“What must be true for this code to be reached”
Useful against control-flow obfuscation
Further links with compiler optimizations
So far, no use of constraint solvers or computer
algebra systems
Every input was considered “independent” of
other calculations
But that’s not always the case
e.g. polynomial equations
So we can use the external systems to help
But we can reduce the complexity of queries by using
our analysis information
So far, no functions – entirely interprocedural
No assumptions about code structure used
e.g. stack frames, functions always return, etc.
Haven’t been able to deal with APIs or loops
Simple solution: think of them as instructions
Bunch existing instructions together
Not just for labelling and visualizations; allows
function specific improvements
Perform costly optimizations only once
Allows custom inverse implementations
e.g. MD5 cracking
Recognize the API based on call address
Can use a symbolic base for modules
Define what we know about behaviour
e.g. Possible return values
External modelling
e.g. For writing/reading pipes or atoms
Can check for anti-debug/emulation too
Overwritten API code
Using undefined OS behaviour
The loop becomes an instruction with all potentially read
locations as inputs and all potentially written locations as
outputs
Can be refined further by checking internal constraints
A few cases have easy models
e.g. repeated adds = multiplication
Otherwise, do what we can
Anything too complicated is just defined as tainted by all
potential inputs with no evaluation
Again, the user can improve the analysis manually
Unroll iterations, add evaluators, etc.
Gradually forming more expressive units
Becoming increasingly higher-level
More flexible than decompilation
Can always zoom back in to the details
Reflects what’s happening in the analyst’s mind
Figuring out components and how they fit together
Easier to manage and navigate than just by function
Automated static analysis tools can’t miss bugs
False positives are a necessary evil for them
Whenever there’s ambiguity, they have to
assume the worst
We can allow the user to focus on the cases they
consider interesting
The ‘normal’ cases for some RE tasks
The ‘edge’ cases for others (e.g. vuln research)
Considering when accesses overlap
A form of alias analysis
?
?
mov [eax], 123
mov [ebx], 456
mov ecx, [eax]
eax
ebx
Pointers to objects don’t just magically appear
Based on results from allocation functions
So compare the taint sources!
Not immune to deliberate obfuscation
e.g. a loop searching memory, like egg hunters in
shellcode
… but still useful in many cases
esi
?
?
Sharing a taint source
– prioritize equal case
mov [eax], 123
mov [ebx], 456
mov ecx, [eax]
eax
ebx
esi
edi
?
?
No shared taint
source – prioritize
unequal case
mov [eax], 123
mov [ebx], 456
mov ecx, [eax]
eax
ebx
Perform runs to collect example values
Can be guided by our analysis
▪ Input values
▪ Which values to collect
▪ Where to intercept
Useful for still-unresolved indirect calls/jumps
Also useful to just display for the user!
Improve memory heuristics
Timing and hit count information also useful
Much like profiling
Opcode handlers become instructions
Suddenly we have a disassembler and
debugger with no additional effort
Can compare routines between VM and native
implementation
Could be useful for Java or Flash VMs too
Needs further investigation
Haven’t mentioned any sort of Intermediate
Language
The whole thing can be thought of as an
extensible IL
Share defined abstractions between projects
For library code, etc.
Integrate techniques
Share information as much as possible
Stay interactive
Both for analysis and visualization
Allow flexible abstraction
Reflect what happens in the analyst’s mind
Type recovery
C++ classes
Dealing with new obfuscation methods
White-box cryptography
Functional programming?
Under heavy development
Release likely in early 2011
GUI
Useful visualization of all this
Dynamic extensions
Hypervisor based?
Other information sources
Debugging symbols
Extensible… somehow!
Python?
Lua?
Rolf Rolles
Sean Heelan
Cyril Paciullo
Shawn Zivontsis
Jonathan Kay
Sérgio Silva
Tareq Saade
Daniel Hauenstein
Jason Geffner
Mailing list: http://revealer.co.uk
E-mail: [email protected]
Twitter: http://twitter.com/WillWhistler
Other useful links:
http://reddit.com/r/ReverseEngineering
http://ref.x86asm.net