The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas, Texas 75275 lyle.smu.edu/~mhd [email protected] This material is based upon work.

Download Report

Transcript The Magnificent EMM Margaret H. Dunham Michael Hahsler, Mallik Kotamarti, Charlie Isaksson CSE Department Southern Methodist University Dallas, Texas 75275 lyle.smu.edu/~mhd [email protected] This material is based upon work.

The Magnificent EMM
Margaret H. Dunham
Michael Hahsler, Mallik Kotamarti, Charlie Isaksson
CSE Department
Southern Methodist University
Dallas, Texas 75275
lyle.smu.edu/~mhd
[email protected]
This material is based upon work supported by the National Science Foundation under Grant No IIS-0948893.
3/11/10, BYU
1
Objectives/Outline
 EMM Overview
 EMM + Stream Clustering
 EMM + Bioinformatics
3/11/10, BYU
2
Objectives/Outline
EMM Overview
 Why
 What
 How
 EMM + Stream Clustering
 EMM + Bioinformatics
3/11/10, BYU
3
Lots of Questions
 Why don’t data miners practice what
they preach?
Continuous
Learning
 Why is training usually viewed as a
one time thing?
Interleave
learning &
application
 Why do we usually ignore the temporal
aspect of data streams?
Add time to
online clustering
3/11/10, BYU
4
MM
A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej | Ei), and at any time the future
behavior of the process is based solely on the current
state
A Markov Model (MM) is a graph with m vertices or states,
S, and directed arcs, A, such that:
 S ={N1,N2, …, Nm}, and
 A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,
Lij = <Ni,Nj> is labeled with a transition probability
Pij = P(Nj | Ni).
3/11/10, BYU
5
Problem with Markov Chains
 The required structure of the MC may not be certain
at the model construction time.
 As the real world being modeled by the MC
changes, so should the structure of the MC.
 Not scalable – grows linearly as number of events.
 Our solution:
 Extensible Markov Model (EMM)
 Cluster real world events
 Allow Markov chain to grow and shrink
dynamically
3/11/10, BYU
6
EMM (Extensible Markov Model)
 Time Varying Discrete First Order Markov
Model
 Continuously evolves
 Nodes are clusters of real world states.
 Learning continues during prediction phase.
 Learning:
 Transition probabilities between nodes
 Node labels (centroid of cluster)
 Nodes are added and removed as data
arrives
3/11/10, BYU
7
EMM Definition
Extensible Markov Model (EMM): at any time
t, EMM consists of an MC with designated
current node, Nn, and algorithms to modify
it, where algorithms include:
 EMMCluster, which defines a technique for
matching between input data at time t + 1
and existing states in the MC at time t.
 EMMIncrement algorithm, which updates
MC at time t + 1 given the MC at time t and
clustering measure result at time t + 1.
 EMMDecrement algorithm, which removes
nodes from the EMM when needed.
3/11/10, BYU
8
EMM Cluster
Nearest Neighbor
If none “close” create new node
Labeling of cluster is centroid of
members in cluster
O(n)
Here n is the number of states
3/11/10, BYU
9
EMM Increment
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
2/3
2/3
2/21
2/3
1/1
1/2
1/2
N3
N1
1/3
N2
1/1
1/2
1/1
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
3/11/10, BYU
10
EMMDecrement
N1
N3
1/3
1/3
2/2
1/3
N2
1/2
N5
3/11/10, BYU
N1
1/3
N3
1/3
1/6
Delete N2
1/6
1/3
N6
N5
1/6
N6
11
EMM Advantages





Dynamic
Adaptable
Use of clustering
Learns rare event
Scalable:
 Growth of EMM is not linear on size of
data.
 Hierarchical feature of EMM
 Creation/evaluation quasi-real time
 Distributed / Hierarchical extensions
3/11/10, BYU
12
EMM Sublinear Growth
Servent Data
3/11/10, BYU
13
Growth Rate Automobile Traffic
3/11/10, BYU
Minnesota Traffic Data
14
EMM River Prediction
8
7
Water Level (m)
6
5
4
3
2
1
0
1
48 95 142 189 236 283 330 377 424 471 518 565 612 659
Input Time Series
RLF Prediction
3/11/10, BYU
EMM Prediction
Observed
15
Determining Rare Event
 Occurrence Frequency (OFi) of an EMM
state Si is normalized count of state:
OF i  n i /  n i
i
 Normalized Transition Probability (NTPmn),
from one state, Sm, to another, Sn, is a
normalized transition Count:
NTP
m, n
 ( C m , n ) /(  n i )
i
3/11/10, BYU
16
EMM Rare Event Detection
Ozone Data, UCI ML, Jaccard similarity,
2536 instances, 73 attributes, 73 ozone days
Intrusion Data, Train DARPA 1999, Test DARPA 2000,
3/11/10, BYU
17
Objectives/Outline
 EMM Overview
EMM + Stream Clustering
 Handle evolving clusters
 Incorporate time in clustering
 EMM + Bioinformatics
3/11/10, BYU
18
Stream Data
A growing number of applications generate streams
of data.
 Computer network monitoring data
 Call detail records in telecommunications
 Highway transportation traffic data
 Online web purchase log records
 Sensor network data
 Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.
Clustering techniques play a key role in modeling
and analyzing this data.
3/11/10, BYU
19
Stream Data Format
 Events arriving in a stream
 At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:
Vt = <S1t, S2t, ..., Snt>
V1
S1
S2
…
Sn
S11
S21
…
Sn1
V2
S12
S22
…
Sn2
…
…
…
…
…
Vq
S1q
S2q
…
Snq
Time
3/11/10, BYU
20
Traditional Clustering
3/11/10, BYU
21
TRAC-DS (Temporal Relationship
Among Clusters for Data Streams)
3/11/10, BYU
22
Motivation
 Temporal Ordering is a major feature of
stream data.
 Many stream applications depend on this
ordering
 Prediction of future values
 Anomaly (rare event) detection
 Concept drift
3/11/10, BYU
23
Stream Clustering Requirements
 Dynamic updating of the clusters
 Completely online
 Identify outliers
 Identify concept drifts
 Barbara [2]:
 compactness
 fast
 incremental processing
3/11/10, BYU
24
Data Stream Clustering
 At each point in time a data stream clustering ζ is
a partitioning of D', the data seen thus far.
 Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,Cck are available and k is
allowed to change over time.
 The summaries Cci with i =1, 2,...,k typically
contain information about the size, distribution
and location of the data points in Ci.
3/11/10, BYU
25
TRAC-DS NOTE
 TRAC-DS is not:
 Another stream clustering
algorithm
 TRAC-DS is:
 A new way of looking at clustering
 Built on top of an existing clustering
algorithm
 TRAC-DS may be used with any
stream clustering algorithm
3/11/10, BYU
26
TRAC-DS Overview
3/11/10, BYU
27
TRAC-DS Definition
Given a data stream clustering ζ, a temporal
relationship among clusters (TRAC-DS) overlays a
data stream clustering ζ with a EMM M, in such a
way that the following are satisfied:
(1) There is a one-to-one correspondence
between the clusters in ζ and the states S in M.
(2) A transition aij in the EMM M represents the
probability that given a data point in cluster i,
the next data point in the data stream will
belong to cluster j with i; j = 1; 2; : : : ; k.
(3) The EMM M is created online together with the
data stream clustering
3/11/10, BYU
28
Stream Clustering Operations *
 qassign point(ζ,x): Assigns the new data point x
to an existing cluster.
 qnew cluster(ζ,x): Create a new cluster.
 qremove cluster(ζ,x): Removes a cluster. Here x
is the cluster, i, to be removed. In this case the
associated summary Cci is removed from ζ and
k is decremented by one.
 qmerge clusters(ζ,x): Merges two clusters.
 qfade clusters(ζ,x): Fades the cluster structure.
 qsplit clusters(ζ,x): Splits a cluster.
* Inspired by MONIC [13]
3/11/10, BYU
29
TRAC-DS Operations
 rassign point(M,sc,y): Assigns the new data point
to the state representing an existing cluster
 rnew cluster(M,sc,y): Create a state for a new
cluster.
 rremove cluster(M,sc,y): Removes state.
 rmerge clusters(M,sc,y): Merges two states.
 rfade clusters(M,sc,y): Fades the transition
probabilities using an exponential decay f(t)=2−λt
 rsplit clusters(M,sc,y): Splits states. Y clustering
operations.
3/11/10, BYU
30
TRAC-DS Example
3/11/10, BYU
31
Objectives/Outline
 EMM Overview
 EMM + Stream Clustering
EMM + Bioinformatics
 Background
 Preprocessing
 Classification
 Differentiation
3/11/10, BYU
32
DNA




Basic building blocks of organisms
Located in nucleus of cells
Composed of 4 nucleotides
Two strands bound together
3/11/10, BYU
http://www.visionlearning.com/library/module_viewer.php?mi
d=63
33
Central Dogma: DNA -> RNA ->
Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
3/11/10, BYU
Amino Acid
www.bioalgorithms.info; chapter 6; Gene Prediction
34
RNA
Ribonucleic Acid
Contains A,C,G but U (Uracil) instead
of T
Single Stranded
May fold back on itself
Needed to create proteins
Move around cells – can act like a
messenger
mRNA – moves out of nucleus to
other parts of cell
3/11/10, BYU
36
The Magical 16s
 Ribosomal RNA (rRNA) is at the heart of the
protein creation process
 16S rRNA
 About 1542 nucleotides in length
 In all living organisms
 Important in the classification of
organisms into phyla and class
 PROBLEM: An organism may actually
contain many different copies of 16S, each
slightly different.
 OUR WORK: Can we use EMM to quantify
this diversity? Can we use it to classify
different species of the same genus?
3/11/10, BYU
37
Using EMM with RNA Data
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
Pos 0-8
Pos 1-9
A
2
1
C
3
3
G
3
3
T
1
2
4
2
1
…
Pos 34-42 2
Construct EMM with nodes
representing clusters of count vectors
3/11/10, BYU
38
EMM for Classification
3/11/10, BYU
39
TRAC-DS and Bioinformatics
 Efficient
 Alignment free sequence analysis
 Clustering reduces size of model
 Flexible
 Any sequence
 Applicability to Metagenomics
 Scoring based on similarity between EMMs
or EMM and input sequence
 Applications
 Classification
 Differentiation
3/11/10, BYU
40
Profile EMMs for Organism Classification
3/11/10, BYU
41
Profile EMM – E Coli
3/11/10, BYU
42
Differentiating Strains
 Is it possible to identify different species of
same genus?
 Initial test with EMM:
Bacillus has 21 species
Construct EMM for each species using
training set (64%)
Test by matching unknown strains (36%)
and place in closest EMM
All unknown strains correctly classified
except one: accuracy of 95%
3/11/10, BYU
43
Bibliography
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. Proceedings of the International
Conference on Very Large Data Bases (VLDB), pp 81-92, 2003.
D. Barbara, “Requirements for clustering data streams,” SIGKDD Explorations, Vol 3, No 2, pp 23-27, 2002.
Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, “Visualization of DNA/RNA Structure using Temporal
CGRs,”Proceedings of the IEEE 6th Symposium on Bioinformatics & Bioengineering (BIBE06), October 16-18, 2006, Washington D.C. ,pp
171-178.
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams: Theory and practice,” IEEE Transactions on
Knowledge and Data Engineering, Vol 15, No 3, pp 515-528, 2003.
Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, submitted
to SIAM International Conference on Data Mining.
Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE ICDM Conference, November 2004, pp 371374.
Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International Journal of Computer
Science and Network Security, Vol 6, No 6, June 2006, pp 258-265.
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLDM
Conference, pp 440-453.
Mallik Kotamarti, Douglas W. Raiford, M. L. Raymer, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for
Microbial Organisms Using Genome-Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics and
Bioengineering, pp 161-167, June 22-24 2009.
Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,” Proceedings of the IEEE PAKDD
Conference, April 2006, Singapore. (Also in Lecture Notes in Computer Science, Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.)
Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,” Journal of Computers, Vol 1, No
3, June 2006, pp 43-50.
MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html,
(2008)
M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706–711, 2006.
3/11/10, BYU
44
3/11/10, BYU
45