KDD cup 99 Dataset

Transcript KDD cup 99 Dataset

ANFIS Classifier for
Network Intrusion
Detection System
‫دكترمحسن كاهاني‬
http://www.um.ac.ir/~kahani/
Network Intrusion Detection
 Widespread use of computer networks
 Number of attacks and New hacking tools and
Intrusive methods
 An Intrusion Detection System (IDS) is one way of
dealing with suspicious activities within a network.
 IDS
 Monitors the activities of a given environment
 Decides whether these activities are malicious
(intrusive) or legitimate (normal).
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Soft Computing and IDS
 Many soft computing approaches have been applied
to the intrusion detection field.
 Our Novel Network IDS includes
 Neuro-Fuzzy
 Fuzzy
 Genetic algorithms
 Key Contributions
 Utilization of outputs of neuro-fuzzy network as
linguistic variables which expresses how reliable
current output is.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
KDD cup 99 Dataset
 Comparison of different works in IDS area
 Needs of Standard dataset for evaluation of computer
network IDSes.
 Fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining Collected
and generated TCP dump data of simulated network
in the form of train-and-test sets of features defined
for the connection records.
 We name this standard Dataset as KDD cup 99
dataset and will use it for our experiments.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
KDD cup 99 Dataset
 41 features derived for each connection.
 A label which specifies the status of connection records as
either normal or specific attack type.
 Features fall in four categories
 The intrinsic features e.g. duration of the connection , type of
the protocol (tcp, udp, etc), network service (http, telnet, etc),
etc.
 The content feature e.g. number of failed login attempts etc.
 The same host features examine established connections in the
past two seconds that have the same destination host as the
current connection, and calculate statistics related to the
protocol behavior, service, etc
 The similar same service features examine the connections in
the past two seconds that have the same service as the current
connection.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Basic features of individual
TCP connections
feature name
description
type
duration
length (number of seconds) of the connection
continuous
protocol_type
type of the protocol, e.g. tcp, udp, etc.
discrete
service
network service on the destination, e.g., http, telnet, etc.
discrete
src_bytes
number of data bytes from source to destination
continuous
dst_bytes
number of data bytes from destination to source
continuous
flag
normal or error status of the connection
discrete
land
1 if connection is from/to the same host/port; 0 otherwise
discrete
wrong_fragment
number of ``wrong'' fragments
continuous
urgent
number of urgent packets
continuous
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Content features within a connection
suggested by domain knowledge
feature name
description
type
hot
number of ``hot'' indicators
continuous
num_failed_logins
number of failed login attempts
continuous
logged_in
1 if successfully logged in; 0 otherwise
discrete
num_compromised
number of ``compromised'' conditions
continuous
root_shell
1 if root shell is obtained; 0 otherwise
discrete
su_attempted
1 if ``su root'' command attempted; 0 otherwise
discrete
num_root
number of ``root'' accesses
continuous
num_file_creations
number of file creation operations
continuous
num_shells
number of shell prompts
continuous
num_access_files
number of operations on access control files
continuous
num_outbound_cmds
number of outbound commands in an ftp session
continuous
is_hot_login
1 if the login belongs to the ``hot'' list; 0 otherwise
discrete
is_guest_login
1 if the login is a ``guest''login; 0 otherwise
discrete
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Traffic features computed using a
two-second time window
feature name
description
type
count
number of connections to the same host as the current connection in the
past two seconds
continuous
Note: The following features refer to these same-host connections.
serror_rate
% of connections that have ``SYN'' errors
continuous
rerror_rate
% of connections that have ``REJ'' errors
continuous
same_srv_rate
% of connections to the same service
continuous
diff_srv_rate
% of connections to different services
continuous
srv_count
number of connections to the same service as the current connection in the
past two seconds
continuous
Note: The following features refer to these same-service connections.
srv_serror_rate
% of connections that have ``SYN'' errors
continuous
srv_rerror_rate
% of connections that have ``REJ'' errors
continuous
srv_diff_host_rate
% of connections to different host
continuous
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
KDD CUP 99 Sample Data
0,tcp,http,SF,200,4213,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,15,15,0.00,0.00,0.00,0.00,1.00,0.00,0.00,31,255,1.00,0.00,0.03,0.02,0. 00,0.00,0.00,0.00,normal.
0,tcp,http,SF,293,4203,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,4,255,1.00,0.00,0.25,0.02,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,296,6903,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,2,255,1.00,0.00,0.50,0.03,0.00,0.00,0.00,0.00,normal.
0,udp,domain_u,SF,104,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,56,56,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,domain_u,SF,103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,66,66,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,domain_u,SF,89,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,76,76,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,normal.
0,udp,domain_u,SF,79,32,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,86,85,0.99,0.02,0.99,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,smtp,SF,1367,335,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,21,72,0.90,0.10,0.05,0.04,0.00,0.00,0.00,0.00,normal.
184,tcp,telnet,SF,1511,2957,0,0,0,3,0,1,2,1,0,0,1,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,1,3,1.00,0.00,1.00,0.67,0.00,0.00,0.00,0.00,buffer_overflow.
305,tcp,telnet,SF,1735,2766,0,0,0,3,0,1,2,1,0,0,1,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,2,4,1.00,0.00,0.50,0.50,0.00,0.00,0.00,0.00,buffer_overflow.
0,tcp,smtp,SF,1518,405,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,4,0.00,0.00,0.00,0.00,1.00,0.00,1.00,42,108,0.74,0.07,0.02,0.04,0.05,0.00,0.00,0.00,normal.
0,tcp,smtp,SF,1173,403,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,52,116,0.75,0.06,0.02,0.03,0.04,0.00,0.00,0.00,normal.
257,tcp,telnet,SF,181,1222,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,62,15,0.21,0.05,0.02,0.13,0.03,0.13,0.00,0.00,normal.
0,tcp,smtp,SF,2302,410,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,72,117,0.76,0.04,0.01,0.03,0.03,0.00,0.00,0.00,normal.
1,tcp,smtp,SF,1587,332,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,3,120,1.00,0.00,0.33,0.04,0.00,0.00,0.00,0.00,normal.
0,tcp,smtp,SF,1552,333,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,0.00,0.00,0.00,0.00,1.00,0.00,1.00,13,121,0.85,0.15,0.08,0.04,0.00,0.00,0.00,0.00,normal.
0,tcp,finger,SF,10,223,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,23,14,0.22,0.13,0.04,0.29,0.00,0.00,0.00,0.00,normal.
0,tcp,smtp,SF,971,335,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,16,120,0.94,0.12,0.06,0.03,0.00,0.00,0.00,0.00,normal.
1,tcp,smtp,SF,2007,335,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,3,0.00,0.00,0.00,0.00,1.00,0.00,1.00,26,129,0.92,0.12,0.04,0.03,0.00,0.00,0.00,0.00,normal.
0,tcp,finger,SF,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,3,16,0.67,0.67,0.33,0.31,0.00,0.00 ,0.00,0.00,normal.
0,tcp,smtp,SF,880,327,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,18,195,0.89,0.11,0.06,0.03,0.00,0.00,0.00,0.00,normal.
0,tcp,smtp,SF,4031,322,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,28,205,0.93,0.07,0.04,0.03,0.00,0.00,0.00,0.00,normal.
27,tcp,ftp,SF,916,2720,0,0,0,19,0,1,0,0,0,0,0,0,0,0,0,1,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,5,5,1.00,0.00,0.20,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,smtp,SF,2012,325,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,15,207,0.27,0.13,0.07,0.03,0.00,0.00,0.00,0.00,normal.
20,tcp,ftp,SF,239,774,0,0,0,4,0,1,0,0,0,0,0,0,0,0,0,1,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,55,34,0.62,0.04,0.02,0.00,0.00,0.00,0.00,0.00,normal.
23,tcp,ftp,SF,342,1072,0,0,0,6,0,1,0,0,0,0,0,0,0,0,0,1,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,65,40,0.62,0.03,0.02,0.00,0.00,0.00,0.00,0.00,normal.
1,tcp,smtp,SF,1609,364,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,4,0.00,0.00,0.00,0.00,1.00,0.00,1.00,75,187,0.37,0.03,0.01,0.03,0.00,0.00,0.00,0.00,normal.
21,tcp,ftp,SF,227,766,0,0,0,4,0,1,0,0,0,0,0,0,0,0,0,1,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,85,50,0.59,0.02,0.01,0.00,0.00,0.00,0.00,0.00,normal.
0,tcp,http,SF,54540,8314,0,0,0,2,0,1,1,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,111,111,1.00,0.00,0.01,0.00,0 .00,0.00,0.01,0.01,back.
0,tcp,http,RSTR,53452,2920,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.33,0.33,1.00,0.00,0.00,112,112,1.00,0.00,0.01,0.00 ,0.00,0.00,0.02,0.02,back.
0,tcp,http,SF,54540,8314,0,0,0,2,0,1,1,0,0,0,0,0,0,0,0,0,3,3,0.00,0.00,0.33,0.33,1.00,0.00,0.00,113,113,1.00,0.00,0.01,0.00,0 .00,0.00,0.02,0.02,back.
0,icmp,ecr_i,SF,1480,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,19,19,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,19,0.07,0.02,0.07,0.00,0.00,0.00,0.00,0.00,pod.
0,icmp,ecr_i,SF,1480,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,20,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,20,0.08,0.02,0.08,0.00,0.00,0.00,0.00,0.00,pod.
0,tcp,private,RSTR,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,1.00,1.00,1.00,0.00,0.00,255,1,0.00,0.02,0.00,0.00,0.00,0.00,0.00,1.00,portsweep.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
KDD cup 99 Dataset
 Attacks fall into four main categories
 DOS (Denial of service): making some computing or
memory resources too busy so that they deny
legitimate users access to these resources.
 R2L (Root to local): unauthorized access from a
remote machine according to exploit machine's
vulnerabilities.
 U2R (User to root): unauthorized access to local
super user (root) privileges using system's
susceptibility.
 PROBE: host and port scans as precursors to other
attacks. An attacker scans a network to gather
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
information or find known vulnerabilities.
KDD Cup 99 Dataset cont.
 KDD dataset is divided into following record sets:
 Training
 Testing
 Original training dataset was too large for our
purpose10% training dataset, was employed here
for training phase.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
KDD Cup 99 Sample Distribution
THE SAMPLE DISTRIBUTIONS ON THE SUBSET OF 10% DATA OF KDD CUP 99
DATASET
Class
Number of Samples
Samples Percent
Normal
Probe
DoS
U2R
R2L
97277
4107
391458
52
1126
19.69%
0.83%
79.24%
0.01%
0.23%
492021
100%
THE SAMPLE DISTRIBUTIONS ON THE TEST DATA WITH THE
CORRECTED LABELS OF KDD CUP 99 DATASET
Class
Number of Samples
Samples Percent
Normal
Probe
DoS
U2R
R2L
60593
4166
229853
228
16189
311029
19.48%
1.34%
73.90%
0.07%
5.20%
100%
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
ANFIS
 ANFIS as an adaptive neuro-fuzzy inference system
 Ability to construct models solely based on the target
system sample (Learning)
 Adopt itself through repeated training (Adaptation)
 Above abilities among others qualifies ANFIS as a
fuzzy classifier for IDS
 Here we use ANFIS as Neuro-fuzzy classifier to
detect intrusions in computer networks based on KDD
cup 99 datasets.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Generating Target fuzzy Inference System
 Grid partitioning
 all the possible rules are generated based on the number of MFs
for each input
 For example in a two dimensional input space, with three MFs
in the input sets, the number of rules in grid partitioning will
result in 9 rules.
 Subtractive clustering
 Subtractive Clustering is a fast, one-pass algorithm for
estimating the number of clusters and the cluster centers in a set
of data.
 The clusters’ information obtained by this method is used for
determining the initial number of rules and antecedent
membership functions, which is used for identifying the FIS.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Initial SYSTEM ARCHITECTURE
 Features of KDD had all forms continuous, discrete, and
symbolic.
 Preprocessing: mapping symbolic valued attributes to numeric
ones.
 150000 randomly selected points of the subset of 10% of data
is used as training.
 Randomly 40000 records of data selected as the checking data
(used for validating model).
 Five trails of 40000 sampled connections from the source of
training dataset that does not overlap neither with training set
nor each others, have been carried out as the testing data.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Initial SYSTEM ARCHITECTURE
 Subtractive Clustering Method with ra=0.5 (neighborhood
radius) partitions the training data and generates an FIS
structure.
 Then for further fine-tuning and adaptation of membership
functions, training dataset was used for training ANFIS while
the checking dataset was used for validating the model
identified.
 The final ANFIS contains 212 nodes and a total number of
284 fitting parameters, of which 164 are premise parameters
and 84 are consequent parameters.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Initial SYSTEM ARCHITECTURE

Training ANFIS causes further
fine-tuning and adaptation of
initial membership functions.
Initial and final membership
functions of some input features
are illustrated here.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Initial SYSTEM ARCHITECTURE
 ANFIS structure has one output, basically.
 We need to gain an approximate class number by
rounding off the output number of ANFIS. Γ is the
parameter for rounding off which gives us the
integer value.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Standard metrics for evaluating
network IDSes
 Some Definition
 Detection rate is computed as the ratio between the
number of correctly detected attacks and the total
number of attacks,
 False alarm (false positive) rate is computed as the
ratio between the number of normal connections that
is incorrectly misclassified as attacks and the total
number of normal connections.
 Classification rate is defined as ratio between number
of test instances correctly classified and the total
number of test instances classified.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Results

False Alarm, Detection and classification rate for training and checking data,
Γ=0.5
Data
False Alarm Rate%
Detection Rate%
Classification Rate%
Training
0.61
99.75
99.68
Checking
1.6
91.00
92.44
 Error measures vs. epoch
numbers for the training dataset
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Results
 Experiment 1
 All the records of labeled test dataset (corrected) as the testing
data to evaluate our classifiers
 False Alarm, Detection and Classification Rate for test data
of first experiment; Γ=0.5
Data
False Alarm Rate %
Detection Rate%
Classification Rate%
Test
1.6
91.07
92.48
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Results
 Experiment 2




5 trials of 40000 randomly selected 40000 samples.
Average of the resulting.
We compare our classifiers with different fuzzy algorithms.
Comparing False Alarm, Detection and complexity of
different algorithms.
Algorithm
False Alarm Rate%
Detection Rate%
Complexity
Neuro-Fuzzy Classifier
0.59
99.54
O(n)
SRPP [1]
3.58
99.08
O(n)
EFRID [7]
7
98.96
O(n)
RIPPER[5]
2.02
94.26
O(n × log2n)
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
‫‪Final System architecture‬‬
‫سيستمهاي خبره و مهندسي دانش‪-‬دكتر كاهاني‬
Proposed System(Data Sources)

The distribution of the samples in the two subsets that were used for the training
SAMPLE DISTRIBUTIONS ON THE FIRST TRAINING AND CHECKING DATA RANDOMLY SELECTED
OF 10% DATA OF KDD CUP 99 DATASET OF 10% DATA OF KDD CUP 99 DATASET
ANFIS-N
ANFIS-P
ANFIS-D
ANFIS-U
ANFIS-R
Training
Checking
Training
Checking
Training
Checking
Training
Checking
Training
Checking
Normal
20000
2500
10000
1000
25000
6000
200
100
4000
2000
Probe
4000
107
4000
107
4000
107
50
25
1000
500
DoS
15000
2000
5000
500
20000
5000
50
25
2000
1000
U2R
40
12
40
12
40
12
46
6
40
12
R2L
1000
126
1000
126
1000
126
50
25
1000
126
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Proposed System(Data Sources) cont.
SAMPLE DISTRIBUTIONS ON THE SECOND TRAINING AND CHECKING DATA RANDOMLY
SELECTED OF 10% DATA OF KDD CUP 99 DATASET OF 10% DATA OF KDD CUP 99 DATASET
ANFIS-N
ANFIS-P
ANFIS-D
ANFIS-U
ANFIS-R
Training
Checking
Training
Checking
Training
Checking
Training
Checking
Training
Checking
Normal
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
Probe
500
500
500
500
500
500
500
500
500
500
DoS
500
500
500
500
500
500
500
500
500
500
U2R
52
0
52
0
52
0
46
6
52
0
R2L
500
500
500
500
500
500
500
500
500
500
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Proposed System(ANFIS Classifiers)




The subtractive clustering method with ra=0.5 (neighborhood
radius) has been used to partition the training sets and
generate an FIS structure for each ANFIS.
For further fine-tuning and adaptation of membership
functions, training sets were used for training ANFIS.
Each ANFIS trains at 50 epochs of learning and final FIS
that is associated with the minimum checking error has been
chosen.
All the MFs of the input fuzzy sets were selected in the form
of Gaussian functions with two parameters.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Proposed System(The Fuzzy Decision Module)






A five-input, single-output of Mamdani fuzzy inference system
Centroid of area defuzzification
Each input output fuzzy set includes two MFs
All the MFs are Gaussian functions which are specified by four parameters.
The output of the fuzzy inference engine, which varies between -1 and 1,
Sspecifies how intrusive the current record is,

1 to show completely intrusive and -1 for completely normal
FUZZY ASSOCIATIVE MEMORY FOR THE PROPOSED FUZZY INFERENCE RULES
High
Low
-
PROBE
¬High
High
Low
DoS
¬High
High
Low
U2R
¬High
High
Low
R2L
¬High
High
Low
Output
Normal
Normal
Attack
Attack
Attack
Attack
Attack
Normal
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Proposed System(Genetic Algorithm Module)


A chromosome consists of 320 bits of binary data.
8 bits of a chromosome determines one parameter out of the four
parameters of an MF.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Proposed System(Some Metrics)
 Cost Per Example
1 m m
CPE   CM (i, j ) * C (i, j )
N i 1 j 1
 Where CM is a confusion matrix
 Each column corresponds to the predicted class, while rows correspond to
the actual classes. An entry at row i and column j, CM (i, j), represents the
number of misclassified instances that originally belong to class i, although
incorrectly identified as a member of class j. The entries of the primary
diagonal, CM (i,i), stand for the number of properly detected instances.
 C is a cost matrix
 As well as CM,Entry C(i,j) represents the cost penalty for misclassifying an
instance belonging to class i into class j.
 N represents the total number of test instances,
 m is the number of the classes in classification.
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Proposed System(Fitness Function For GA)
 Two different fitness functions
 Cost Per Example with equal misclassification costs
Actual
PROBE
DoS
U2R
R2L
Normal
0
1
2
3
4
PROBE
1
0
1
2
2
Predicted
DoS
2
2
0
2
2
U2R
2
2
2
0
2
R2L
2
2
2
2
0
Actual
 cost per examples used for evaluating results of the
KDD'99 competition
Normal
PROBE
DoS
U2R
R2L
Normal
0
1
1
1
1
PROBE
1
0
1
1
1
Predicted
DoS
1
1
0
1
1
U2R
1
1
1
0
1
R2L
1
1
1
1
0
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Proposed System(Data Sources For GA)
THE SAMPLE DISTRIBUTIONS ON THE SELECTED SUBSET OF 10% DATA OF KDD
CUP 99 DATASET FOR THE OPTIMIZATION PROCESS WHICH IS USED BY GA
Number of
Samples
Normal
Probe
DoS
U2R
R2L
200
104
200
52
104
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Results




10 subsets of training data for both series were used for the classifiers.
The genetic algorithm was performed three times, each time for one of the five
series of selected subsets.
Totatally 150 different structures were used and the result is the average of the
results of this 150 structures.
Two different training datasets for training the classifiers and two different fitness
functions to optimize the fuzzy decision-making module were used.
ABBREVIATIONS USED FOR OUR APPROACHES
Abbreviation
ESC-KDD-1
Approach
First Training set with fitness function of KDD
ESC-EQU-1
First Training set with fitness function of equal misclassification cost
ESC-KDD-2
Second Training set with fitness function of KDD
ESC-EQU-2
Second Training set with fitness function of equal misclassification cost
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Results cont.
CLASSIFICATION RATE, DETECTION RATE(DTR), FALSE ALARM RATE (FA) AND COST PER
EXAMPLE OF KDD(CPE) FOR THE DIFFERENT APPROACHES OF ESC-IDS ON THE TEST
DATASET WITH CORRECTED LABELS OF KDD CUP 99 DATASET
Model
ESC-KDD-1
ESC-EQU-1
ESC-KDD-2
ESC-EQU-2
Normal
98.2
98.4
96.5
96.9
Probe
84.1
89.2
79.2
79.1
DoS
99.5
99.5
96.8
96.3
U2R
14.1
12.8
8.3
8.2
R2L
31.5
27.3
13.4
13.1
DTR
95.3
95.3
91.6
88.1
FA
1.9
1.6
3.4
3.2
CPE
0.1579
0.1687
0.2423
0.2493
CLASSIFICATION RATE, DETECTION RATE (DTR), FALSE ALARM RATE (FA) AND COST PER
EXAMPLE OF KDD (CPE) FOR THE DIFFERENT ALGORITHMS PERFORMANCES ON THE
TEST DATASET WITH CORRECTED LABELS OF KDD CUP 99 DATASET (N/R STANDS FOR
NOT REPORTED)
Model
ESC-IDS
RSS-DSS
Parzen-Window
Multi-Classifier
Winner of KDD
Runner Up of KDD
PNrule
Normal
98.2
96.5
97.4
n/r
99.5
99.4
99.5
Probe
84.1
86.8
99.2
88.7
83.3
84.5
73.2
DoS
99.5
99.7
96.7
97.3
97.1
97.5
96.9
U2R
14.1
76.3
93.6
29.8
13.2
11.8
6.6
R2L
31.5
12.4
31.2
9.6
8.4
7.3
10.7
DTR
95.3
94.4
n/r
n/r
91.8
91.5
91.1
FA
1.9
3.5
2.6
n/r
0.6
0.6
0.4
CPE
0.1579
n/r
0.2024
0.2285
0.2331
0.2356
0.2371
‫دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬

KDD cup 99 Dataset

Transcript KDD cup 99 Dataset

Directory