Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia Email Clustering • Given a set of unclassified emails, the objective is.

Download Report

Transcript Characteristic Identifier Scoring and Clustering for Email Classification By Mahesh Kumar Chhaparia Email Clustering • Given a set of unclassified emails, the objective is.

Characteristic Identifier Scoring and
Clustering for Email Classification
By
Mahesh Kumar Chhaparia
Email Clustering
• Given a set of unclassified emails, the objective is to produce high
purity clusters keeping the training requirements low.
• Outline:
– Characteristic Identifier Scoring and Clustering (CISC),
• Identifier Set
• Scoring
• Clustering
• Directed Training
– Comparison of CISC with some of the traditional ideas in email clustering
– Comparison of CISC with POPFile (Naïve-Bayes classifier),
– Caveats
– Conclusion
Evaluation
• Evaluation on Enron Email Dataset for the following users (purity
measured w.r.t the grouping already available):
User
Number of
folders
Number of
Messages
Messages in
smallest folder
Messages in
largest folder
Lokay-M
11
2489
6
1159
Beck-S
101
1971
3
166
Sanders-R
30
1188
4
420
Williams-w3
18
2769
3
1398
Farmer-D
25
3672
5
1192
Kitchen-L
47
4015
5
715
Kaminski-V
41
4477
3
547
CISC: Identifier Set
• Sender and Recipients
• Words from the subject starting with uppercase
• Tokens from the message body
– Word sequences with each word starting in uppercase (length [2,5] only)
split about stopwords (excluding them)
– Acronyms (length [2,5] only)
– Words followed by an apostrophe and ‘s’ e.g. TW’s extracted to TW
– Words or phrases in quotes e.g. “Trans Western”
– Words where any character (excluding first is in uppercase) e.g. eSpeak,
ThinkBank etc.
CISC: Scoring
• Sender:
– Initial idea: generate clusters of email addresses with frequency of
communication above some threshold,
• (+) Identifies “good” clusters of communication
• (-) Difficult to score when an email has addresses spread across more
than one cluster
• (-) Fixed partitioning and difficult to update
CISC: Scoring (Contd…)
• Sender:
– Need a notion of soft clustering with both recipients and content
– Generate a measure of its non-variability with respect to the addresses it
co-occurs with or the content it discusses in emails
– Example:
• 1  {2,3} {3,4} {2,3,4} in Folder 1
• 2  {1} {3} {4} {1} {3} {1,3} in Folder 2
• Emphasizes social clusters {1,2,3} {1,3,4}
• Classify 2  {1,3,4}
–
–
–
–
Traditionally: Folder 2 (address frequency based)
CISC: Folder 1 (social cluster based)
Difficult to say upfront which is better !
Efficacy discussed later
CISC: Scoring (Contd…)
• Words or Phrases:
– Generate a measure of its importance
– Using context captured through the co-occurring text
– Sample scenarios for score generation:
• Different functional groups in a company mentioning “Conference
Room”  Low score
• A single shipment discussion for company “CERN”  High score
• Several different topic discussions (financial, operational etc.) for
company “TW”  Low score
• Clustering: Pair with highest similarity message and merge clusters
sharing atleast one message to produce disjoint clusters
• Directed Training:
– For each cluster, identify a message likely to belong to majority class
– Suggest the user to classify this message
Efficacy of TF-IDF Cosine Similarity
•
Clustering using the traditional TF-IDF cosine similarity measure for emails
not very effective !
User
TF-IDF
(% Purity before merging)
TF-IDF
(% Purity)
CISC
(% Purity)
Lokay-M
57.69
46.68
77.54
Beck-S
51.44
9.63
59.66
Sanders-R
61.53
37.45
70.03
Williams-w3
58.92
61.71
90.61
Note:
• Both TF-IDF and CISC figures with only word and phrase tokens
• Number of clusters is different in both cases, but the purity figures indicate the
discriminative capability of the respective algorithms
Efficacy of Social Cluster Based Scoring
• Results
User
CISC (with social clusters) CISC (without social clusters)
(% Purity)
(% Purity)
Lokay-M
84.21
77.54
Beck-S
60.52
59.66
Sanders-R
78.28
70.03
Williams-w3
93.31
90.61
CISC vs. POPFile
•
•
Results
# Training
Messages
Lokay-M
Beck-S
Sanders-R
Williams-w3
100
62.63
15.60
15.39
6.75
200
66.62
19.70
31.79
35.50
300
69.53
20.01
50.68
35.72
1000
72.68
24.63
36.51
18.40
CISC
80.47 (265)
52.81 (218)
75.67 (146)
91.40 (153)
85.22 (614)
71.47 (587)
84.79 (332)
93.38 (365)
Purity may sometimes (marginally) decrease with increasing training set in
POPFile !
Conclusion
• Given a set of unclassified emails, the proposed strategy obtains higher
clustering purity with lower training requirements than POPFile and
TF-IDF based method.
• Key differentiators:
– Incorporates a combination of communication cluster and content
variability based scoring for senders instead of the usual tf-idf scoring or
naïve-bayes word model (POPFile),
– Picks a set of high-selectivity features for final message similarity model
than retaining most content of messages (i.e. all non-stopwords),
– Observes and uses the fact that any email in a class may be “close” to only
a small number of emails than to all in that class,
– Finally, helps lower training requirements through “directed training” than
indiscriminate training over as many emails as possible.
Future Work
• Design and evaluation for non-corporate datasets
• Tuning of message similarity scoring
– Different weights for the score components
– Different range normalization for different components to boost
proportionally
– Test feature score proportional to its length
• Richer feature set
– Phrases following ‘the’
– Test with substring-free collection e.g. “TW Capacity Release Report” and
“TW” are replaced with “Capacity Release Report” and “TW”
• Hierarchical word scoring to change granularity of clustering
• Online classification using training directed feature extraction
• Merging high purity clusters effectively to further reduce training
requirements
Q &A