Data Mining Approach for Network Intrusion Detection

Download Report

Transcript Data Mining Approach for Network Intrusion Detection

Data Mining Approach for Network
Intrusion Detection
Zhen Zhang
Advisor: Dr. Chung-E Wang
04/24/2002
Department of Computer Science
California State University, Sacramento
Outline

Background
– Intrusion Detection: promises and challenges
– Data Mining in IDS: how can it help
Motivation
 Approaches, tasks, problems and my
contributions
 Results
 Conclusion and future work

Intrusion Detection
- Building a Secure Network

Primary assumptions
– System activities are observable
– Normal and intrusive activities have distinct evidence

Main techniques
– Misuse detection: patterns of well-known
attacks
– Anomaly detection: deviation from normal
usage
Data Mining in IDS

Shortfalls with current IDS (mostly misuse
detections)
– Variants:
Intrusions change easily and frequently.
– False positive: Difficult to pick up intrusions.
– False negative: Detecting attacks for which there are
no known signatures
– Data overload: Amount of data grows rapidly.
What is Data Mining

Data Mining:
Take data and pull from it patterns or deviations.

Many different types of algorithms:
Decision Tree, Link analysis, Clustering, Association, Rule
abduction, Deviation Analysis, and Sequence analysis.

Software and Tools:
– MS SQL Server 2000
– Ripper and many others
How can Data Mining help

Variants
– Use anomaly detection, no great concern with variants
in an exploit code.

False positives
– To identify recurring sequences of alarms in order to
help identify valid network activity.

False negatives
– Attacks for which signatures have not been developed
might be detected.

Data overload
– Data mining plays a vital role.
Summary of my work

Identify objective
– Distinguish network attacks from normal traffic
– New area, several research projects, no commercial products
– Focus on the principle and basic implementation of concepts





Data Collection
Data Pre-processing on tcpdump dataset
Apply data mining on processed data
Investigate results
Software packages used: Visual Basic, Microsoft
SQL Server 2000 with Analysis Server, Tcpdump
Data Collection

Tcpdump data (http://iris.cs.uml.edu:8080/)
– Tcpdump was executed on the gateway, to capture the
traffic between LAN and external, and broadcast
packets within LAN
– Only header, no user data
– Filters were used, only TCP and UDP packets
– Baseline and 4 simulated attacks
TCPDUMP data format

TCP packet
–
–
–
–
–
–
–
–
–
–
Time stamp
Source IP address
Source port
Destination IP address
Destination port
Flags (SYN, FIN, PUSH, RST, or .)
Data sequence number of this packet
Data sequence number of the data expected in return
Number of bytes of receive buffer space available
Indication of whether or not the data is urgent
Tcpdump data format

UDP packet
–
–
–
–
–
–

Time stamp
Source IP address
Source port
Destination IP address
Destination port
Length of the packet
Example data
Example tcpdump data
Data Pre-processing
- 80% ~ 90% work

Packet level information to connection
level
–
–
Group by same source/destination IP/Port
Use flags, acks to determine status of the connection
»
–
–
–
SF, REJ, S0, S1, S3, S3, S4, RSTOSn, RSTRSn, SS, SH,
SHR, OOS1, OOS2
Record start time, duration, protocol
Calculate bytes in, bytes out, resent rate
UDP is connectionless, so simply treat each packet as
a connection
First round of processing
Intrinsic Features
Establish more information
Count_per_dest
# of connections to this
destination IP
REJ_count_per_dest
# of connections that get the
flag “REJ”
# of connections that send a
SYN packet but never get the
ACK packet (S0), or receive an
ACK on SYN that they never
have sent (S1).
S01_count_per_dest
Diff_Services_per_dest
# of unique services
Diff_Service_Rate
Diff_Services / Count
Same Destination Temporal and Statistical Attributes (last 2 seconds)
Establish more information
Count_per_service
# of connections to this type of
service
REJ_count_per_service
# of connections that get the
flag “REJ” (SYN met by RST)
# of connections that send a
SYN packet but never get the
ACK packet (S0), or receive an
ACK on SYN that they never
have sent (S1).
S01_count_per_service
Diff_Hosts_per_service
# of unique destination hosts
Diff_Hosts_Rate
Diff_Hosts / Count
Same Service Temporal and Statistical Attributes (last 2 seconds)
Second round of processing
Same Destination Temporal and Statistical Attributes
Final round of processing

Final, but important
– Reduce data amount
– Remove noise or trivial information
– Re-organization data, add new feature if necessary

Challenges
– Hard to tell which data to reduced/remove
– Requires tremendous domain knowledge
– Need experiments and adjustments
Data Mining
Decision Tree Algorithm
 Microsoft SQL Server 2000 Analysis Server
 Steps:

– 80% of baseline (normal) dataset as training data
– Use 20% left as validation data, compute
misclassification.
– 20% of each of the four intrusion datasets as
predication data, compute misclassification.
Dependency Network
Decision Tree
Apply Data Mining Model to Validate/Predicate
Results
% misclassification (by final state)
Normal
149/1510 = 9.86%
Intrusion1
443/2324 = 19.06%
Intrusion2
376/1968 = 19.10%
Intrusion3
386/2011 = 19.19%
Intrusion4
437/2298 = 19.01%
Conclusion and future improvement

Accuracy
– Preliminary experiments of using DM on the tcpdump
data showed promising results
– depends on sufficient training data and right feature set.

Performance
– 6 hours on one dataset (628775 records)

Size of time window
– 2 seconds or larger?

Automated process
– Call MSSQL DM and DTS procedures within VB
– Real-time monitor and alarm
References






Intrusion Detection, Rebecca Gurley Bace, Macmillan Technical
Publishing, 2000
Data Mining: Concepts and Techniques, Jiawei Han Micheline
kamber, Morgan Kaufmann Publishers 2001
Data Mining with Microcoft SQL Server 2000, Claude Seidman.
Microsoft Press, 2001
http://www.cs.columbia.edu/~sal/hpapers/USENIX/usenix.html
http://iris.cs.uml.edu:8080/network.html
http://www-nrg.ee.lbl.gov/. Network Research Group (NRG) of the
Information and Computing Sciences Division (ICSD) at Lawrence
Berkeley National Laboratory (LBNL) in Berkeley, California.
Thank You!