Transcript Document
Topic Models Based
Personalized Spam Filter
Sudarsun. S
Director – R & D, Checktronix India Pvt Ltd, Chennai
Venkatesh Prabhu. G
Research Associate, Checktronix India Pvt Ltd, Chennai
Valarmathi B
Professor, SKP Engineering College, Thiruvannamalai
ISCF - 2006
What is Spam ?
unsolicited, unwanted email
What is Spam Filtering ?
Detection/Filtering of unsolicited content
What’s Personalized Spam Filtering ?
Definition of “unsolicited” becomes personal
Approaches
Origin-Based Filtering [ Generic ]
Content Based-Filtering [ Personalized ]
ISCF - 2006
Content Based Filtering
What does the message contain ?
Images, Text, URL
Is it “irrelevant” to my preferences ?
How to define relevancy ?
How does the system understands relevancy ?
Supervised Learning
Teach the system about what I like and what I don’t
Unsupervised Learning
Decision made using latent patterns
ISCF - 2006
Content-Based Filtering -- Methods
Bayesian Spam Filtering
Simplest Design / Less computation cost
Based on keyword distribution
Cannot work on contexts
Accuracy is around 60%
Topic Models based Text Mining
Based on distribution of n-grams (key phrases)
Addresses Synonymy and Polysemy
Run-time computation cost is less
Unsupervised technique
Rule based Filtering
Supervised technique based on hand-written rules
Best accuracy for known cases
Cannot adopt to new patterns
ISCF - 2006
Topic Models
Treats every word as a feature
Represents the corpus as a higher-dimensional
distribution
SVD: Decomposes the higher-dimensional data to a
small reduced sub-space containing only the dominant
feature vectors
PLSA: Documents can be understood as a mixture of
topics
Rule Based Approaches
N-Grams – Language Model Approach
More common n-grams more closer the patterns are.
ISCF - 2006
LSA Model, In Brief
Describes underlying structure among text.
Computes similarities between text.
Represents documents in high-dimensional Semantic Space
(Term – Document Matrix).
High dimensional space is approximated to low-dimensional
space using Singular Value Decomposition (SVD).
Decomposes the higher dimensional TDM to U, S, V matrices.
U: Left Singular Vectors ( reduced word vectors )
V: Right Singular Vector ( reduced document vectors )
S: Array of Singular Values ( variances or scaling factor )
ISCF - 2006
PLSA Model
By PLSA model, a document is a mixture of topics and topics
generate words.
The probabilistic latent factor model can be described as the
following generative model
Select a document di from D with probability Pr(di).
Pick a latent factor zk with probability Pr(zk|di).
Generate a word wj from W with probability Pr(wj|zk).
Where
Pr(di , wj ) Pr(di ) Pr(wj | di ),
l
P r(w j | d i ) P r(w j | zk ) P r(zk | d i )
k 1
Computing the aspects model parameters using EM Algorithm
ISCF - 2006
N–Gram Approach
Language Model Approach
Looks for repeated patterns
Each word depends probabilistically on the n-1
preceding words.
P(w1...wn ) P(wi | wi n1...wi1 )
Calculating and Comparing the N-Gram profiles.
ISCF - 2006
Overall System Architecture
Training Mails
Preprocessor
LSA
Model
PLSA
Model
N-Gram
Combiner
Final Result
ISCF - 2006
Test Mail
….
Other
Classifiers
Preprocessing
Feature Extraction
Tokenizing
Feature Selection
Pruning
Stemming
Weighting
Feature Representation
Term Document Matrix Generation
Sub Spacing
LSA / PLSA Model Projection
Feature Reduction
Principle Component Analysis
ISCF - 2006
Principle Component Analysis - PCA
Data Reduction - Ignore the features of lesser significance
Given N data vectors from k-dimensions, find c <= k orthogonal
vectors that can be best used to represent data
The original data set is reduced to one consisting of N data vectors
on c principal components (reduced dimensions)
To detect structure in the relationship between
variables that is used to classify data.
ISCF - 2006
LSA Classification
MxR
M: Vocab Size
R: Rank
Input
Mails
Token
List
RxR’
R: InVar Size
R’: OutVar Size
Score
LSA
Model
Vector 1xR
R: Rank
PCA
Vector 1xR’
ISCF - 2006
BPN
PLSA Classification
MxZ
M: Vocab Size
R: Aspects Count
ZxZ’
Z: InVar Size
Z’: OutVar Size
Score
Input
Mails
Token
List
PLSA
Model
Vector 1xZ
Z: Aspects
PCA
Vector 1xZ’
ISCF - 2006
BPN
(P)LSA Classification
Model Training
Build the Global (P)LSA model using the training mails.
Vectorize the training mails using LSI/PSLA model
Reduce the dimensionality of the matrix of pseudo vectors of
training documents using PCA.
Feed the reduced matrix into neural networks for learning.
Model Testing
Test mails is fed to (P)LSA for vectorization.
Vector is reduced using PCA model.
Reduced vector is fed into BPN neural network.
BPN network emits its prediction with a confidence score
ISCF - 2006
N-Gram method
Construct an N-Gram tree out of training docs
Documents make the leaves
Nodes make the identified N-grams from docs
Weight of an N-gram = Number of children
Higher order of N-gram implies more weight
Weight Wt Wt * S / ( S + L )
P: Total number of docs sharing a N-Gram
S: Number of SPAM docs sharing N-Gram
L: P - S
ISCF - 2006
An Example N-Gram Tree
3rd
1st
2nd
2nd
N1
N
2
T5
T1
T2
ISCF - 2006
N
3
T3
N
4
T4
Combiner
Mixture of Experts
Get Predictions from all the Experts
Use the maximum common prediction
Use the prediction with maximum
confidence score
ISCF - 2006
Conclusion
Objective is to Filter mail messages based on the
preference of an individual
Classification performance increases with
increased (incremental) training
Initial learning is not necessary for LSA, PLSA &
N-Gram.
Performs unsupervised filtering
Performs fast prediction although background
training is a relatively slower process
ISCF - 2006
References
[1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Naïve Bayesian AntiSpam Filtering”, Proc. of the workshop on Machine Learning in the New Information Age, 2000.
[2]W. Cohen, “Learning rules that classify e-mail”, AAAI Spring Symposium on Machine Learning in Information Access, 1996.
[3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, “TiMBL: Tilburg Memory-Based Learner - version 4.0
Reference Guide”, 2001.
[4] H. Drucker, D. Wu, and V. N. Vapnik., “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural networks,
1999.
[5] D. Mertz, “Spam Filtering Techniques. Six approaches to eliminating unwanted e-mail.”, Gnosis Software Inc., September,
2002. Ciencias Físicas, Universidad de Valencia, 1992.
[6] M. Vinther, “Junk Detection using neural networks”,
http://logicnet.dk/reports/JunkDetection/JunkDetection.htm.
MeeSoft
Technical
Report,
June
2002.
Available:
[7] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. “Indexing By Latent Semantic Analysis”, Journal
of the American Society For Information Science, 41, 391-407. (1990)
[8] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, and Sathish Kumar Veeraswamy,”Role of Weighting on TDM in Improvising
Performance of LSA on Text Data”, Proceedings of IEEE INDICON 2006.
[9] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22 Int’l SIGIR Conf. on Research and Development in
Information Retrieval, 1999
[10]Sudarsun Santhiappan, Dalou Kalaivendhan and Venkateswarlu Malapatti .”Unsupervised Contextual Keyword Relevance
Learning and Measurement using PLSA”, Proceedings of IEEE INDICON 2006.
[11]Landauer, T. K., Foltz, P. W., & Laham, D. “Introduction to Latent Semantic Analysis”, DiscourseProcesses, 25, 259-284.
(1998).
[12]G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, "Information retrieval using a
singular value decomposition model of latent semantic structure," in The 11th International Conference on Research and
Development in Information Retrieval, Grenoble, France: ACM Press, pp. 465--480. (1988)
[13] Damashek, M. Gauging , “Similarity via N-Grams:
Science, 267. 843-848.
Language-Independant Sorting, Categorization and Retrieval of
[14] Sholomo Hershkop, Salvatore J.Stolfo , “Combining Email models for False Positive Reduction”, KDD’05, August 2005.
ISCF - 2006
Text”.
Any Queries…. ?
You can post your queries to [email protected]
ISCF - 2006