Transcript Document

Topic Models Based
Personalized Spam Filter
Sudarsun. S
Director – R & D, Checktronix India Pvt Ltd, Chennai
Venkatesh Prabhu. G
Research Associate, Checktronix India Pvt Ltd, Chennai
Valarmathi B
Professor, SKP Engineering College, Thiruvannamalai
ISCF - 2006
What is Spam ?
 unsolicited, unwanted email
What is Spam Filtering ?
 Detection/Filtering of unsolicited content
What’s Personalized Spam Filtering ?
 Definition of “unsolicited” becomes personal
Approaches
 Origin-Based Filtering [ Generic ]
 Content Based-Filtering [ Personalized ]
ISCF - 2006
Content Based Filtering
What does the message contain ?
Images, Text, URL
Is it “irrelevant” to my preferences ?
How to define relevancy ?
How does the system understands relevancy ?
Supervised Learning
Teach the system about what I like and what I don’t
Unsupervised Learning
Decision made using latent patterns
ISCF - 2006
Content-Based Filtering -- Methods



Bayesian Spam Filtering

Simplest Design / Less computation cost

Based on keyword distribution

Cannot work on contexts

Accuracy is around 60%
Topic Models based Text Mining

Based on distribution of n-grams (key phrases)

Addresses Synonymy and Polysemy

Run-time computation cost is less

Unsupervised technique
Rule based Filtering

Supervised technique based on hand-written rules

Best accuracy for known cases

Cannot adopt to new patterns
ISCF - 2006
Topic Models

Treats every word as a feature

Represents the corpus as a higher-dimensional
distribution

SVD: Decomposes the higher-dimensional data to a
small reduced sub-space containing only the dominant
feature vectors

PLSA: Documents can be understood as a mixture of
topics
Rule Based Approaches

N-Grams – Language Model Approach

More common n-grams  more closer the patterns are.
ISCF - 2006
LSA Model, In Brief
 Describes underlying structure among text.
 Computes similarities between text.
 Represents documents in high-dimensional Semantic Space
(Term – Document Matrix).
 High dimensional space is approximated to low-dimensional
space using Singular Value Decomposition (SVD).
 Decomposes the higher dimensional TDM to U, S, V matrices.
U: Left Singular Vectors ( reduced word vectors )
V: Right Singular Vector ( reduced document vectors )
S: Array of Singular Values ( variances or scaling factor )
ISCF - 2006
PLSA Model
 By PLSA model, a document is a mixture of topics and topics
generate words.
The probabilistic latent factor model can be described as the
following generative model

Select a document di from D with probability Pr(di).

Pick a latent factor zk with probability Pr(zk|di).

Generate a word wj from W with probability Pr(wj|zk).
Where
Pr(di , wj )  Pr(di )  Pr(wj | di ),
l
P r(w j | d i )   P r(w j | zk )  P r(zk | d i )
k 1
 Computing the aspects model parameters using EM Algorithm
ISCF - 2006
N–Gram Approach
 Language Model Approach
 Looks for repeated patterns
 Each word depends probabilistically on the n-1
preceding words.
P(w1...wn )   P(wi | wi n1...wi1 )
 Calculating and Comparing the N-Gram profiles.
ISCF - 2006
Overall System Architecture
Training Mails
Preprocessor
LSA
Model
PLSA
Model
N-Gram
Combiner
Final Result
ISCF - 2006
Test Mail
….
Other
Classifiers
Preprocessing
Feature Extraction
 Tokenizing
Feature Selection
 Pruning
 Stemming
 Weighting
Feature Representation
 Term Document Matrix Generation
Sub Spacing
 LSA / PLSA Model Projection
Feature Reduction
 Principle Component Analysis
ISCF - 2006
Principle Component Analysis - PCA
Data Reduction - Ignore the features of lesser significance
 Given N data vectors from k-dimensions, find c <= k orthogonal
vectors that can be best used to represent data
 The original data set is reduced to one consisting of N data vectors
on c principal components (reduced dimensions)
To detect structure in the relationship between
variables that is used to classify data.
ISCF - 2006
LSA Classification
MxR
M: Vocab Size
R: Rank
Input
Mails
Token
List
RxR’
R: InVar Size
R’: OutVar Size
Score
LSA
Model
Vector 1xR
R: Rank
PCA
Vector 1xR’
ISCF - 2006
BPN
PLSA Classification
MxZ
M: Vocab Size
R: Aspects Count
ZxZ’
Z: InVar Size
Z’: OutVar Size
Score
Input
Mails
Token
List
PLSA
Model
Vector 1xZ
Z: Aspects
PCA
Vector 1xZ’
ISCF - 2006
BPN
(P)LSA Classification
Model Training
 Build the Global (P)LSA model using the training mails.
 Vectorize the training mails using LSI/PSLA model
 Reduce the dimensionality of the matrix of pseudo vectors of
training documents using PCA.
 Feed the reduced matrix into neural networks for learning.
Model Testing
 Test mails is fed to (P)LSA for vectorization.
 Vector is reduced using PCA model.
 Reduced vector is fed into BPN neural network.
 BPN network emits its prediction with a confidence score
ISCF - 2006
N-Gram method
Construct an N-Gram tree out of training docs
Documents make the leaves
Nodes make the identified N-grams from docs
Weight of an N-gram = Number of children
Higher order of N-gram implies more weight
Weight Wt  Wt * S / ( S + L )
P: Total number of docs sharing a N-Gram
S: Number of SPAM docs sharing N-Gram
L: P - S
ISCF - 2006
An Example N-Gram Tree
3rd
1st
2nd
2nd
N1
N
2
T5
T1
T2
ISCF - 2006
N
3
T3
N
4
T4
Combiner
 Mixture of Experts
 Get Predictions from all the Experts
 Use the maximum common prediction
 Use the prediction with maximum
confidence score
ISCF - 2006
Conclusion
 Objective is to Filter mail messages based on the
preference of an individual
 Classification performance increases with
increased (incremental) training
 Initial learning is not necessary for LSA, PLSA &
N-Gram.
 Performs unsupervised filtering
 Performs fast prediction although background
training is a relatively slower process
ISCF - 2006
References
[1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Naïve Bayesian AntiSpam Filtering”, Proc. of the workshop on Machine Learning in the New Information Age, 2000.
[2]W. Cohen, “Learning rules that classify e-mail”, AAAI Spring Symposium on Machine Learning in Information Access, 1996.
[3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, “TiMBL: Tilburg Memory-Based Learner - version 4.0
Reference Guide”, 2001.
[4] H. Drucker, D. Wu, and V. N. Vapnik., “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural networks,
1999.
[5] D. Mertz, “Spam Filtering Techniques. Six approaches to eliminating unwanted e-mail.”, Gnosis Software Inc., September,
2002. Ciencias Físicas, Universidad de Valencia, 1992.
[6] M. Vinther, “Junk Detection using neural networks”,
http://logicnet.dk/reports/JunkDetection/JunkDetection.htm.
MeeSoft
Technical
Report,
June
2002.
Available:
[7] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. “Indexing By Latent Semantic Analysis”, Journal
of the American Society For Information Science, 41, 391-407. (1990)
[8] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, and Sathish Kumar Veeraswamy,”Role of Weighting on TDM in Improvising
Performance of LSA on Text Data”, Proceedings of IEEE INDICON 2006.
[9] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22 Int’l SIGIR Conf. on Research and Development in
Information Retrieval, 1999
[10]Sudarsun Santhiappan, Dalou Kalaivendhan and Venkateswarlu Malapatti .”Unsupervised Contextual Keyword Relevance
Learning and Measurement using PLSA”, Proceedings of IEEE INDICON 2006.
[11]Landauer, T. K., Foltz, P. W., & Laham, D. “Introduction to Latent Semantic Analysis”, DiscourseProcesses, 25, 259-284.
(1998).
[12]G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, "Information retrieval using a
singular value decomposition model of latent semantic structure," in The 11th International Conference on Research and
Development in Information Retrieval, Grenoble, France: ACM Press, pp. 465--480. (1988)
[13] Damashek, M. Gauging , “Similarity via N-Grams:
Science, 267. 843-848.
Language-Independant Sorting, Categorization and Retrieval of
[14] Sholomo Hershkop, Salvatore J.Stolfo , “Combining Email models for False Positive Reduction”, KDD’05, August 2005.
ISCF - 2006
Text”.
Any Queries…. ?
You can post your queries to [email protected]
ISCF - 2006