Transcript Slide 1

Uncovering Functional Networks
in Internet Traffic
Mark Meiss
September 25, 2006
1
Who am I?
Mark Meiss
• Ph.D. candidate in Computer Science
– Committee: Filippo Menczer, Alessandro
Vespignani, Katy Börner, Minaxi Gupta, Kay
Connelly
• Researcher at the Advanced Network
Management Laboratory (ANML)
– http://anml.iu.edu/
2
3
What’s the agenda?
The subject of today’s story:
• Finding a way to improve security without
compromising user privacy
• A case study in applied network science
This work is done with Filippo Menczer and
Alessandro Vespignani.
4
What do people do online?
There’s what we imagine…
surfing
sending email
playing games
5
What do people do online?
And there’s what is actually happening…
file sharing
worms & viruses
porn
6
Not just a value judgment
These applications all affect the health of a data
network.
There are legal problems, yes; but also…
• Crowding out other applications.
– (Napster was once over 70% of all IUB traffic)
• Compromised computers are used to launch
further attacks.
• “Common nuisances” are on the ’Net as well.
7
The bottom line
Network administrators
need to be able to identify
what applications
are being used on the network.
…but this can be very difficult.
8
A crash course
in data networks
We’ll use a running example:
• Buddy Bradley wants to read a web page about his
favorite band at Vulgar Entertainment, Inc.
9
10
11
12
13
14
15
16
17
18
19
Quick summary
• Each network conversation is identified by
four pieces of information
– Client address and port number
– Server address and port number
• The server uses a well-known port number
• The client uses an ephemeral port number
20
So why is it hard to identify
applications?
• Well-known ports are a convention, not a rule
– Web, e-mail, etc. do have ports assigned by the IANA
– BitTorrent, Gnutella, Napster, etc. do not
• Client and server ports share the same namespace
• In practice…
– Any application can use any pair of port numbers
• Our focus: discovering what application is running
on a port with no assigned use.
21
The conventional solution
Let’s look inside
all of those packets!
22
23
24
Another problem
• Packet inspection doesn’t scale
– Modern high-speed networks run at 10 gigabits
per second or faster
(that’s one full DVD every few seconds)
– General-purpose computers can’t even copy
that data in real time
25
26
27
Introducing the “flow”
• We can summarize Buddy’s Web surfing as two
flows:
– 192.168.65.33:13029 to 10.99.205.122:80 (456 bytes)
– 10.99.205.122:80 to 192.168.65.33:13029 (63,211 bytes)
28
Where do flows come from?
• Architectural features of Internet routers
allow them to export flow data
• Routers can’t summarize all the data
– Packets are sampled to construct the flows
– Typical sampling rate is around 1:100
29
What can you do
with a flow?
• Usual answer:
–
–
–
–
–
Treat a flow as a record in a relational database
Who talked to port 1337?
What proportion of our traffic is on port 80?
Who is scanning for vulnerable systems?
Which hosts are infected with this worm?
• These are useful and valid questions.
30
What can you do
with a flow?
• Our approach:
– Treat a flow as a directed, weighted edge
– The resulting network describes user behavior
• Hold that thought for now…
31
The Internet2/Abilene
network
• TCP/IP network
connecting research and
educational institutions
in the U.S.
– Over 200 universities
and corporate research
labs
• Also provides transit
service between Pacific
Rim and European
networks
32
Why study Abilene?
• Wide-area network that includes both domestic
and international traffic
• Heterogeneous user base including hundreds of
thousands of undergraduates
• High capacity network (10-Gbps fiber-optic links)
that has never been congested
• Research partnership gives access to
(anonymized) traffic data unavailable from
commercial networks
33
Flow collection
Flows are exported in Cisco’s
netflow-v5 format
and anonymized before being
written to disk.
34
Data dimensions
• Observed Abilene on April 14, 2005
– About 200 terabytes of data exchanged
– This is roughly 25,000 DVDs of information
• 600 million flow records
– Almost 28 gigabytes on disk
– 15 million unique hosts involved
35
Weighted bipartite digraph
37
M
sin   wi ,C
i 1
N
sout   wC , j
j 1
38
Multiple digraphs
Port 80 (Web)
Port 6346 (Gnutella)
Port 25 (Mail)
Port 19101 (???)
39
Application correlation
• Consider the out-strength of a client in the
networks for ports p and q:
s  w
p
i
p
ij
j
s  w
q
i
q
ij
j
40
Application correlation
• Build a pair of vectors from the distribution of
strength values:

p
p
p  (s1 , , s|C| )

q
q
q  (s1 , , s|C| )
41
Application correlation
• Examine the cosine similarity of the vectors:
 
 
pq
 ( p, q )   
pq
• When σ = 0, applications p and q are never used
together.
• When σ = 1, applications p and q are always used
together, and to the same extent.
42
Clustering applications
• We now have σ(p, q) for every pair of ports
• Convert these similarities into distances:
1
d ( p, q ) 
1
 ( p, q )  
• If σ = 0, then d is large; if σ = 1, then d = 0
• Now apply Ward’s hierarchical clustering
algorithm
43
44
Classifying unknown
applications
• To classify an unknown application, see
what known applications it clusters with
• Our classification experiment
– Take 16 unknown ports
– Guess function based on similarity data
– Validate or invalidate guesses based on external
evidence
46
Example #1
• Port 388 is coupled with FTP and Hotline
– FTP is a file transfer application
– Hotline is an early file-sharing application
– Our guess: traditional file transfer application
• Actual identity: Unidata/LDM
– Used for moving large meteorological data sets
47
Example #2
• Port 19101 is coupled with instant
messaging and P2P applications
– Our guess: a P2P application that relies on
individual contact for file transfers
• Actual identity: Clubbox
– Korean file-sharing program
– Users trade large files on virtual hard drives
48
49
Overall results
• For our 16 guesses:
– 8 were unambiguously correct
– 6 were partially correct
• These turned out to be trojans and malware
• We learned that IRC + P2P = evil afoot
– 2 could not be confirmed or disproven
• Ports were in transient use during data collection
50
Implications
• We can identify the type of an application
without examining a single packet!
– Scalable
– Preserves user privacy
– Difficult to do with relational view of flow data
51
52
53
54
55
56
Broader application
• Generic view of the situation:
– Weighted network of entities derived from
activity with labeled classes of interaction
– Find the sub-network for each labeled class
– Use the network distributions to calculate
similarity scores for the classes
– Use the similarity scores to cluster the classes
– Classify unknown classes using these clusters
57
Thank you!
• Questions and comments…
58