Transcript malware_04
Chapter 4
Anti-Virus
Anti-Virus
Three
1.
tasks for anti-virus
Detection
o Infected or not? Provably undecidable…
2.
Identification
o May be separate from detection,
depending on detection method used
3.
Disinfection
o Remove the virus
Detection: Static Methods
Generic
methods
o Detects known and unknown viruses
o For example, anomaly detection
Virus-specific
methods
o Detects known viruses
o For example, signature detection
Static
--- virus code not running
Dynamic --- virus code running
Detection Outcomes
Detection Outcomes
Also
can have ghost positive
Virus remnant “detected”
o But virus is no longer there
How
can this happen?
o Previous disinfection was incomplete
Static Detection
Detection
without running virus code
Three approaches…
1. Scanners
o Signature
2.
Heuristics
o Look for “virus-like” code
3.
Integrity Checkers
o Hash/checksum
Scanners
On-demand
o Files scanned when you say so
On-access
o Constant scanning in background
o Whenever file is accessed, it’s scanned
Scanners
Signature scanning
o Viruses represented by “signature”
o Signature == pattern of bits in a virus (might
include wildcards)
“Hundreds of thousands of signatures”
Not feasible to scan one-by-one
o Multiple pattern search
o Efficiency is critical
We look in detail at several algorithms
Algorithm: Aho-Corasick
Developed
1975, bibliographic search
Based on finite automaton (graph)
o Circles are search states
o Edges are transitions
o Double circles are final states/output
And
a failure function
o What to do when no suitable transition
o I.e., where to resume “matching”
Algorithm: Aho-Corasick
When
virus scanning, search for virus
signature, which is bit string
For simplicity, illustrate algorithm
using English words
For our example…
Scan for any of the following words:
o hi, hips, hip, hit, chip
Algorithm: Aho-Corasick
Aho-Corasick
Example
Algorithm: Aho-Corasick
How
to construct automaton?
o And failure function
Build
the automaton --- next slide
o A “trie”, also known as a “prefix tree”
Then
determine failure function
o Two slides ahead
Aho-Corasick:
Trie
Labels added in
breadth-first
order
Closest to root
get smallest
numbers
Aho-Corasick: Failure Function
Depth
1 nodes
o Fail goes back to start state
For
other states
o Go back to earliest place where search
can resume
o Pseudo-code is in the book
Aho-Corasick
The
bottom line…
Linear search that can find multiple
signatures
o Like searching in parallel for related
signatures
Efficient
representation of
automaton is the challenge
o Both time and space issues
Algorithm: Veldman
Linear
search on “reduced” signatures
o Sequential search on reduced set
From
each signature, select 4
adjacent non-wildcard bytes
o Want as many signatures as possible to
have each selected 4-byte pattern
Then
use 2 hash tables to filter…
o Hash tables: 1st 2 bytes & 2nd 2 bytes
Algorithm: Veldman
Example
Suppose
the following 5 signatures
o blar?g, foo, greep, green, agreed
Select
4-byte patterns, no wildcards:
Algorithm: Veldman
Hashes
act as filters
Test things that pass thru both filters
o In this example, get things like “grar”
Algorithm: Veldman
Veldman
allows for wildcards and
complex signatures
o Aho-Corasick does not
But
both algorithms analyze every
byte of input
Is it possible to do better?
o That is, can we skip some of the input?
Algorithm: Wu-Manber
Like
Veldman’s algorithm
o But can skip over bytes that can’t
possibly match
o Faster, improved performance
Illustrate
algorithm with same
signatures used for Veldman’s:
o blar?g, foo, greep, green, agreed
Algorithm: Wu-Manber
Calculate
MINLEN
o Min length of any pattern substring
Two
hash tables
o SHIFT --- number of bytes that can
safely be skipped
o HASH --- mapping to signatures
Input
bytes denoted b1,b2,…,bn
Start at bMINLEN consider byte pairs
Algorithm: Wu-Manber
Example:
Suppose hash tables are…
Wu-Manber
Example
Here,
MINLEN =
3
Start at bMINLEN
Algorithm: Wu-Manber
How
to construct hash tables?
It’s a 4-step process
o Calculate MINLEN
o Initialize SHIFT table
o Fill SHIFT table
o Fill HASH table
Algorithm: Wu-Manber
Calculate
MINLEN
o Minimum number of adjacent, non-
wildcard bytes in any signature
For
this example, we have
o blar?g
o greep
o agreed
So
4
5
6
foo
green
we have MINLEN = 3
3
5
Algorithm: Wu-Manber
SHIFT
table
Extract MINLEN pattern substrings
o blar?g
o greep
o agreed
Extract
bla
gre
agr
foo
green
foo
gre
all distinct 2-byte sequences
o bl, la, fo, oo, gr, re, ag
If
input pair is not one of these, safe
to skip MINLEN - 1 bytes
Algorithm: Wu-Manber
SHIFT
table
Initialize SHIFT table to MINLEN – 1
For 2-byte pairs: bl, la, fo, oo, gr, re, ag
o Denote as xy
o Let qxy be rightmost ending position of xy
in any pattern substring
o For example, gr in agr and gre, but
in bla
o So, qgr = 3 while qbl = 2
o Then set SHIFT[xy] = MINLEN – qxy
Note:
Wildcard matches everything…
Algorithm: Wu-Manber
HASH
table
If SHIFT[xy] = MINLEN – qxy = 0
o Then we are at right edge of a pattern
So,
set HASH[xy] to all signatures with
pattern substring ending xy
For example
o HASH[gr] agreed
o HASH[
] greep, green
Algorithm: Wu-Manber
Here,
we illustrated simplest form of
the algorithm
More advanced forms can handle 10s
of thousands of signatures
Worst case performance is terrible
o Sequential search thru every byte of
input for every signature…
But
tests show it’s good in practice
Testing
How
can we know if scanner works?
Test on live viruses?
o Might not be a good idea
EICAR
standard antivirus test file
o Not too useful either
So,
what to do?
o Author doesn’t have any suggestions!
Improving Performance
“Grunt
scanning” --- scan everything
o Slow slow slow
Search
only beginning and end of files
Scan code entry point
o And points reachable from entry point
If
position of virus in file is known…
o Make it part of the “signature”
Limit
scans to size of virus(es)
Improving Performance
Only
scan certain types of files
o Not so viable today
Only
rescan files that have changed
o How to detect change?
o Where to store this info? Cache?
Database? Tagged to file?
o Updates to signatures? Must rescan…
o How to checksum efficiently?
Improving Performance
How
to checksum efficiently?
o Checksum entire file might take longer
than scanning it
o Only checksum parts that are scanned
How
to avoid checksum tampering?
o Encrypt? Where to store the key?
o Checksum the checksums?
o Other?
Improving Performance
Improve
the algorithm
o Maybe tailor algorithms to file type
Optimize
implementation
o May be of limited value
Other?
Static Heuristics
Like
having expert look at code…
Look for “virus-like” code
o Static, so we don’t execute the code
2
step process
o Gather data
o Analyze data
Static Heuristics
What
data to gather?
“Short signatures” or boosters
o Junk code
o Decryption loop
o Self-modifying code
o Undocumented API calls
o Unusual/non-compiler instructions
o Strings containing obscenities or “virus”
Stopper
--- thing virus would not do
Static Heuristics
Other
heuristics include…
Length of code
o Too short? May be appended virus
Statistical
analysis of instructions
o Handwritten assembly
o Encrypted code
Might
look for signature heuristics
o Common characteristics of signatures
Static Heuristics
Analysis
phase
May be simple…
o Weighted sum of various factors
o Unusual opcodes, etc.
…or
complex
o Machine learning (HMM, neural nets, etc.)
o Data mining
o Heuristic search (genetic algorithm, etc.)
Integrity Checkers
Look
for unauthorized change to files
Start with 100% clean files
Compute checksums/hashes
Store checksums
Recompute checksums and compare
o If they differ, a change has occurred
Integrity Checkers
3
types of integrity checkers
Offline --- recompute checksums
periodically (e.g., once/week)
Self-checking --- modify file to check
itself when run
o Essentially, a beneficial “virus”
o For example, virus scanner self-checks
Integrity
shell --- OS performs
checksum before file executed
Detection: Dynamic Methods
Detection
based on running the code
o Observe the “behavior”
Two
type of dynamic methods
o Behavior monitor/blockers
o Emulation
Behavior Monitor/Blocker
Monitor
program as running
Watch for “suspicious” behavior
What is suspicious?
o It’s too far from “normal”
What
is normal?
o A statistical measure --- mean, average
How
far is too far?
o Depends on variance, standard deviation
Behavior Monitor/Blocker
“Normal”
1.
monitored in 3 ways…
Actions that are permitted
o White list, positive detection
2.
Actions that are not permitted
o Black list, negative detection
Some combination of these two
Analogies to immune system
3.
o Distinguish self from non-self
Behavior Monitor/Blocker
“Care
must be taken… because
anomalous behavior does not
automatically imply viral behavior”
o That’s an understatement!
is the fundamental problem in
anomaly detection
This
o Potential for lots of false positives
Behavior Monitor/Blocker
Look
for short “dynamic signatures”
o Like signature detection, but input string
generated dynamically
But
what to monitor?
Infection-like behavior?
o Open an exe for read/write
o Read code start address from header
o Write start address to header
o Seek to end of exe, append to exe, etc.
Behavior Monitor/Blocker
How
to reduce false positives?
o Consider “ownership” --- some apps get
more leeway (e.g., browser clearing cache)
How
to prevent damage?
o “Dynamic” implies code actually running…
o System undo capability?
How
long to monitor?
o Monitoring increases overhead
o Can virus outlast monitor?
Emulation
Execute code, but not for real…
Instead, emulate execution
Emulation can provide all of the info gotten
thru code execution
o But much safer
“Execute” code in emulator
o Gather info for static/dynamic signatures or
heuristics
o Behavior blocker stuff applies too
Emulation
Emulation
and polymorphic detection
o Let virus decrypt itself
o Then use ordinary signature scan
When
has decryption occurred?
o Use some heuristics…
o Execution of code that was modified
(decrypted) or in such a memory location
o More than N bytes of modified code, etc.
Emulator Anatomy
Emulate
by single-stepping thru code?
o Easily detected by viruses (???)
o Danger of virus “escaping” emulator
“A
more elaborate emulation
mechanism is needed”
o Why?
Conceptually,
5 parts to an emulator
o Next slide please…
Emulator Anatomy
5
parts to new-and-improved
emulator
1. CPU emulation --- nothing more to say
2. Memory emulation
3. Hardware and OS emulation
4. Emulation controller
5. Extra analyses
Memory Emulation
This
could be difficult…
o 32-bit addressing, so 4G of “memory”
Do
we need to emulate all of this?
o No, most apps only uses small amount
Keep
track of memory that’s modified
and where it is located
o Only need to deal with memory that is
modified by a specific app/virus
Hardware/OS Emulation
Use
stripped-down, fake OS, due to…
o Copyright issues
o Size
o Startup time
o Emulator needs additional monitoring
What
about OS system calls?
o Return faked/fixed values
o Don’t faithfully emulate some low-level stuff
Emulation Controller
When
does emulation stop?
o Can’t expect to run code to completion…
Use
heuristics to decide when to stop
o Number of instructions?
o Amount of time?
o Threshold on percent of instructions
that modify memory?
o “Stoppers”? E.g., assume virus wouldn’t
write output before being malicious
Emulator: Extra Analyses
Post-emulation
analysis
For example, look at histogram of
instructions
o Does it match typical polymorphic?
o Does it match a metamorphic family?
Other
examples of post-emulation
analysis???
If at First You Don’t Succeed
Emulation
controller may re-invoke
emulator for the following reasons
o Rerun with different CPU parameters
o Test interrupt handlers
o Test multiple possible entry points
o Test for self-replication on “goat” files
o Test untaken branches in code
o Test “unused” memory locations
Emulator Optimizations
Improve
performance, reduce size
and/or complexity
o Use the real file system (with caution)
o “Data” files must be checked for
malware, use lots of stoppers
o Cache state --- if match is found to
previous (non-virus) run, goto next file
Cache register values, size, stack pointer and
contents, number of writes, checksums, etc.
Comparison of Techniques
Recall,
1.
2.
3.
4.
5.
the techniques considered…
Scanning
Static heuristics
Integrity check
Behavior blocker
Emulation
Comparison of Techniques
Scanning
Pros:
o Precise ID of malware
Cons:
o Requires up-to-date signatures
o Cannot detect new/unknown malware
Comparison of Techniques
Static
heuristics
Pros:
o Detect known and unknown malware
Cons:
o Detected malware not identified
o False positives
Comparison of Techniques
Integrity
check
Pros:
o Can be efficient and fast
o Detect known and unknown malware
Cons:
o Detected after infection & not identified
o Can’t detect in new/modified file
o Heavy burden on users/admins
Comparison of Techniques
Behavior
blocker
Pros:
o Known and unknown malware detected
Cons:
o Probably won’t identify malware
o High overhead
o False positives
o Malware runs on system before detected
Comparison of Techniques
Emulation
Pros:
o Known, unknown, polymorphic detection
o Malware executed in “safe” environment
Cons:
o Slow
o Malware might outlast emulator
o Might not provide identification
Detection: Bottom Line
Static
analysis is fast
o Good approach when it works
Dynamic
analysis can “peel away a
layer of obfuscation”
o Dynamic analysis is relatively costly
Verification, Quarantine,
Disinfect
What
to do after virus detected?
1. Verify that it really is a virus
2. Quarantine infected code
3. Disinfect --- remove infection
These
are done rarely, so can be slow
and costly in comparison to detection
Verification
After
detection comes verification
Why verify?
o Secondary test needed due to short,
general signature, or…
o …no signature, due to detection method
Behavior,
heuristic, emulation, etc.
o Do not usually provide identification
Writer
might try to make virus look
like some other virus
Verification
How
to verify?
“X-ray” the virus
If encrypted, decrypt it, or
frequency analysis might suffice
o Like simple substitution cipher
Extract
info/stats, etc.
Verification
After
x-ray analysis…
o Longer virus-specific signatures
o Checksum all or part of virus
o Call special-purpose verification code
Note
that these probably won’t work
on (good) metamorphic code
Quarantine
Isolate
detected virus from system
o Then ask user if it’s OK to disinfect
o Or do further analysis of virus
How
to quarantine virus?
o Copy to a “quarantine” directory?
o Hide it in “invisible” location?
o Encrypt it?
Disinfect
Disinfect
== remove infection
Not always possible to return file to
it’s original state
o E.g., file might have been overwritten
Disinfection
methods…
Delete the infected file
o Pros and cons?
Disinfect
Disinfection
methods…
Restore files from backup
o Pros and cons?
Use
virus-specific info
o Info may be found automatically --compare infected files with uninfected
o E.g., appended virus, changes start
address, appends itself to file, etc.
o Like a chosen plaintext attack
Disinfect
Disinfection
methods…
Use virus-behavior specific info
o E.g., prepended virus changes header
Save
some info about files
o Headers info, for example
o Then changed parts can be restored
o Integrates well with integrity checker
o Restore parts until checksum matches…
Disinfect
Disinfection
methods…
Use the virus to disinfect
o Stealth virus may give original code
Generic
disinfection
o Virus may restore code when executed
o Might be dangerous to run virus code…
o …emulation is a better strategy, maybe
even disinfect as part of detection
Virus Databases
What
to put in a virus database?
o Name of virus?
o Characteristics of virus?
o Signatures?
o Encrypted/hashed signatures?
o Disinfection info?
o Other info?
Virus Databases
How
to update database/signatures?
o Push or pull?
o Automatic or manual?
o How often to update?
o How to distribute updates?
o Distribute entire database or deltas?
Also
must be able to update AV
software
Virus Updates
Update
process is a BIG target
o AV’s machines that distribute updates
o Insider attack at AV site
o Trick user to getting “AV” from attacker
o Man-in-the-middle attack on
communications between user/AV
Virus Description Languages
AV
vendors have specialized virus
description languages
2 examples given in the book
Short Subjects
A
few quick points…
Anti-stealth techniques
Macro viruses
Compiler optimizations and detection
Anti-Stealth Techniques
Recall,
stealth viruses hide presence
Anti-stealth as part of AV?
o Detect and disable stealth --- check
that OS calls go to right place
o Bypass usual OS features --- direct calls
to BIOS, for example
Macro Virus Detection
Macro
viruses tricky to detect
o Macros are in source code
o Easy to change source
o Robust execution when errors occur
So,
any changes can create new virus
AV might create a new virus
o Eg, incomplete disinfection
Macro
virus can infect other macros
Macro Viruses
One
redeeming feature…
They operate in restricted domain
o So easier to determine “normal”
o Reduces number of false positives
Most/all
are not parasitic
o More like companion viruses
All
the usual detection techniques can
be applied
Macro Viruses: Disinfection
Delete
all macros in infected
document
Delete all associated macros
Delete macro if in doubt (heuristic)
Emulation to find all macros used by
infected macro, and delete them
Basic idea?
o Err on side of caution/deletion
Macro
viruses not so common today
Compiler Optimization
Compilers
use similar techniques as AV
“Optimizing compiler” for detection??
o Constant propagation – reduces variables
o Dead code (executed, but not needed)
o Polymorphics may have lots of dead code
If
used, efficiency could be an issue
o Compilers extensively studied
o Bad cases well-known, so virus writers
might take advantage of these