simple substitution distance
Download
Report
Transcript simple substitution distance
Simple Substitution Distance
and Metamorphic Detection
Gayathri Shanmugam
Richard M. Low
Mark Stamp
Simple Substitution Distance
1
The Idea
Metamorphic
malware “mutates” with
each infection
Measuring software similarity is one
method of detection
But, how to measure similarity?
o Lots of relevant previous work
Here,
an unusual and interesting
distance measure is considered
Simple Substitution Distance
2
Simple Substitution
Distance
We treat each metamorphic copy as if it
is an “encrypted” version of “base” virus
o Where the cipher is a simple substitution
Why simple substitution?
Why might this work?
o Easy to work with, fast algorithm to solve
o Simple substitution cryptanalysis gives
results that match family statistics
o Accounts for modifications to files similar
to some common metamorphic techniques
Simple Substitution Distance
3
Motivation
Given a simple substitution ciphertext
where plaintext is English…
o If we cryptanalyze using English language
statistics, we expect a good score
o If we cryptanalyze using, say, French language
statistics, we expect a not-so-good score
We can obtain opcode statistics for a
metamorphic family
o Using simple substitution cryptanalysis, a virus
of same family should score well…
o …but, a benign exe should not score as well
o Assuming statistics of these families differ
Simple Substitution Distance
4
Metamorphic Techniques
Many possible morphing strategies
Here, briefly consider
o
o
o
o
o
Register swapping
Garbage code insertion
Equivalent substitution
Transposition
Formal grammar mutation
At a high level --- substitution,
transposition, insertion, and deletion
Simple Substitution Distance
5
Register Swap
Register
swapping
o E.g., replace EBX register with EAX,
provided EAX not in use
Very
simple and used in some of first
metamorphic malware
Not very effective
o Why not?
Simple Substitution Distance
6
Garbage Insertion
Garbage code insertion
Two cases:
o Dead code --- inserted, but not executed
We can simply JMP over dead code
o Do-nothing instructions --- executed, but
has no effect on program
Like NOP or ADD EAX,0
Relatively easy to implement
Effective at breaking signatures
Changes the opcodes statistics
Simple Substitution Distance
7
Code Substitution
Equivalent instruction substitution
o For example, can replace SUB EAX,EAX
with XOR EAX,EAX
Does not need to be 1 for 1 substitution
o That is, can also include insertion/deletion
Unlimited number of substitutions
o And can be very effective
Somewhat difficult to implement
Simple Substitution Distance
8
Transposition
Transposition
o Reorder instructions that have no
dependency
For example,
MOV R1,R2
ADD R3,R4
ADD R3,R4
MOV R1,R2
Can be highly effective
But, can be difficult to implement
o Sometimes applied only to subroutines
Simple Substitution Distance
9
Formal Grammar Mutation
Formal
grammar mutation
View morphing engine as nondeterministic automata
o Allow transitions between any symbols
o Apply formal grammar rules
Obtain
many variants, high variation
Really just a formalization of others
approaches, not a separate technique
Simple Substitution Distance
10
Previous Work
Easy to prove that “good” metamorphic
code is immune to signature detection
o Why?
But, many successes detecting hackerproduced metamorphic malware…
o
o
o
o
o
HMM/PHMM/machine learning
Graph-based techniques
Statistics (chi-squared, naïve Bayes)
Structural entropy
Linear algebraic techniques
Simple Substitution Distance
11
Topic of This Research
Measure
similarity using simple
substitution distance
We
“decrypt” suspect file using
statistics from a metamorphic family
o If decryption is good, we classify it as a
member of the same metamorphic family
o If decryption is poor, we classify it as
NOT a member of the given family
Simple Substitution Distance
12
Simple Substitution Cipher
Simple
substitution is one of the oldest
and simplest means of encryption
A fixed key used to substitute letters
o For example, Caesar’s cipher, substitute
letter 3 positions ahead in alphabet
o In general, any permutation can be key
Simple
substitution cryptanalysis?
o Statistical analysis of ciphertext
Simple Substitution Distance
13
Simple Substitution Cryptanalysis
Suppose you observe the ciphertext
PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQW
AXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVX
GTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZH
VFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJ
TODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOT
HPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCF
HQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIX
PFHXAFQHEFZQWGFLVWPTOFFA
Analyze frequency counts…
Likely that ciphertext “F” represents “E”
o And so on, at least for common letters
Simple Substitution Distance
14
Simple Substitution Cryptanalysis
Can automate the cryptanalysis
1.
2.
3.
4.
5.
6.
7.
Make initial guess for key using frequency counts
Compute oldScore
Modify key by swapping adjacent elements
Compute newScore
If newScore > oldScore. let oldScore = newScore
Else unswap key elements
Goto 3
How to compute score?
o Number of dictionary words in putative plaintext?
o Much better to use English digraph statistics
Simple Substitution Distance
15
Jackobsen’s Algorithm
Method
on previous slide can be slow
o Why?
Jackobsen’s
algorithm uses similar
idea, but fast and efficient
o Ciphertext is only decrypted once
o So algorithm is (essentially) independent
of length of message
o Then, only matrix manipulations required
Simple Substitution Distance
16
Jackobsen’s Algorithm: Swapping
Assume plaintext is English, 26 letters
Let K = k1,k2,k3,…,k26 be putative key
Then we swap elements as follows
Restart this swapping from the beginning
whenever the score improves
o And let “|” represent “swap”
Simple Substitution Distance
17
Jackobsen’s Algorithm: Swapping
Minimum swaps is 26 choose 2, or 325
Maximum is unbounded
Each swap requires a score computation
Average number of swaps, experimentally:
o Ciphertext of length 500, average 1050 swaps
o Ciphertext of length 8000, avg just 630 swaps
So, work depends on length of ciphertext
o More ciphertext, better scores, fewer swaps
Simple Substitution Distance
18
Jackobsen’s Algorithm: Scoring
Let
D = {dij} be digraph distribution
corresponding to putative key K
Let E = {eij} be digraph distribution of
English language
These matrices are 26 x 26
Compute score as
Simple Substitution Distance
19
Jackobsen’s Algorithm
So far, nothing fancy here
o Could see all of this in a CS 265 assignment
Jackobsen’s trick: Determine new D
matrix from old D without decrypting
How to do so?
o It turns out that swapping elements of K
swaps corresponding rows and columns of D
See example on next slides…
Simple Substitution Distance
20
Swapping Example
To
simplify, suppose 10 letter
alphabet
E, T, A, O, I, N, S, R, H, D
Suppose
you are given the ciphertext
TNDEODRHISOADDRTEDOAHENSINEOAR
DTTDTINDDRNEDNTTTDDISRETEEEEEAA
Frequency
counts given by
Simple Substitution Distance
21
Swapping Example
We choose the putative
key K given here
The corresponding
putative plaintext is
AOETRENDSHRIEENATE
RIDTOHSOTRINEAAEAS
OEENOTEOAAAEESHNA
TTTTTII
Corresponding digraph
distribution D is
Simple Substitution Distance
22
Swapping Example
Suppose
we
swap first 2
elements of K
Then decrypt
using new K
And compute
digraph matrix
for new K
Simple Substitution Distance
Previous key K
New key K
23
Swapping Example
Old
D matrix vs
new D matrix
What do you
notice?
So what’s the
point here?
This is good!
Simple Substitution Distance
24
Jackobsen’s Algorithm
Simple Substitution Distance
25
Proposed Similarity Score
Extract opcodes sequences from
collection of (family) viruses
o All viruses from same metamorphic family
Determine n most common opcodes
o Symbol n+1 used for all “other” opcodes
Use resulting digraph statistics form
matrix E = {eij}
o Note that matrix is (n+1) x (n+1)
Simple Substitution Distance
26
Scoring a File
Given an executable we want to score…
Extract it’s opcode sequence
Use opcode digraph stats to get D = {dij}
o This matrix also (n+1) x (n+1)
Initial “key” K chosen to match monograph
stats of virus family
o Most frequent opcode in exe maps to most
frequent opcode in virus family, etc.
Score based on distance between D and E
o “Decrypt” D and score how closely it matches E
o Jackobsen’s algorithm used for “decryption”
Simple Substitution Distance
27
Example
Suppose only 5 common opcodes in family
viruses (in descending frequency)
Extract following sequence from an exe
Initial “key” is
And “decrypt” is
Simple Substitution Distance
28
Example
Given
“decrypt”
Form
D matrix
After
swap
o And so on…
Simple Substitution Distance
29
Scoring
Algorithm
Simple Substitution Distance
30
Quantifying Success
Consider
Which
these 2 scatterplots of scores
is better (and why)?
Simple Substitution Distance
31
ROC Curves
Plot
true-positive vs
false positive
o As “threshold” varies
Curve
nearer 45-degree
line is bad
Curve nearer upper-left
is better
Simple Substitution Distance
32
ROC Curves
Use
ROC curves to quantify success
Area under the ROC curve (AUC)
o Probability that randomly chosen
positive instance scores higher than a
randomly chosen negative instance
AUC
of 1.0 implies ideal detection
AUC of 0.5 means classification is no
better than flipping a coin
Simple Substitution Distance
33
Parameter Selection
Tested
the following parameters
o Opcode matrix size
o Scoring function
o Normalization
o Swapping strategy
None
significant, except matrix size
o So we only give results for matrix size
Simple Substitution Distance
34
Opcode Matrix Size
Obtained
So,
following results
ironically, we use 26 x 26 matrix
Simple Substitution Distance
35
Test Data
Tested the following metamorphic families
o G2 --- known to be weak
o NGVCK --- highly metamorphic
o MWOR --- highly metamorphic and stealthy
MWOR “padding ratios” of 0.5 to 4.0
For G2 and NGVCK
o 50 files tested, cygwin utilities for benign files
For each MWOR padding ratio
o 100 files tested, Linux utilities for benign files
5-fold cross validation in each experiment
Simple Substitution Distance
36
NGVCK and G2 Graphs
Simple Substitution Distance
37
MWOR Score Graphs
Simple Substitution Distance
38
MWOR ROC Curves
Simple Substitution Distance
39
MWOR AUC Statistics
Simple Substitution Distance
40
Efficiency
Simple Substitution Distance
41
Conclusions
+
+
+
-
Simple substitution score, good
results for challenging metamorphics
Scoring is fast and efficient
Applicable to other types of malware
Requires opcodes
Simple Substitution Distance
42
Related Work
Recently,
we generalized Jakobsen’s
algorithm to “combination” cipher
Simple substitution column
transposition (SSCT)
Uses multiple D matrices
o One D matrix for each column
o Enables easy column manipulations
o Overall, fast and effective SSCT attack
Simple Substitution Distance
43
SSCT
SSCT
for malware detection
This might be stronger malware score
o Why?
Finding
good test data is an issue
o Can we find/make data where SSCT
outperforms simple substitution score?
Currently
studying this problem
Simple Substitution Distance
44
Homophonic Substitution
Homophonic
sub. allows more than one
ciphertext symbol for each plaintext
o Easy to encrypt, but harder to break
than simple substitution --- why?
Previous
student developed Jakobsenlike algorithm for homophonic sub.
o Uses a nested hill climb approach
This
could be tested on malware
Simple Substitution Distance
45
HMM
A
different way to attack simple
substitution ciphers?
Train an HMM (of course!)
o Let A be 26 x 26, English digraph stats
o Then train, without updating A matrix
o Resulting B matrix is the key
o Can work for homophonic case too
Any
problems with this?
Simple Substitution Distance
46
HMM with Random Restarts
HMM
requires lots of data to converge
Often, we don’t have lots of data
In such cases, try random restarts
o HMM should converge with less data if we
start closer to the solution
o Try enough random restarts, might start
close enough to converge
How
many random restarts?
Simple Substitution Distance
47
HMM with Random Restarts
Could
be applied to malware detection
o However, slow and expensive
More
relevant for cryptanalysis
Zodiac 340 cipher, for example
o This has previously been analyzed using
millions of random restarts
Simple Substitution Distance
48
References
G. Shanmugam, R.M. Low, and M. Stamp,
Simple substitution distance
and metamorphic detection, Journal of
Computer Virology and Hacking Techniques,
9(3):159-170, 2013
A. Dhavare, R.M. Low, and M. Stamp,
Efficient cryptanalysis of homophonic
substitution ciphers, Cryptologia,
37(3):250-281, 2013
Simple Substitution Distance
49
References
T. Berg-Kirkpatrick and D. Klein,
Decipherment with a million random
restarts,
http://www.cs.berkeley.edu/~tberg/papers
/emnlp2013.pdf
Simple Substitution Distance
50