Transcript Document

An Experimental Framework
for Email Categorization and
Management
Kenrick Mock
[email protected]
Project Overview
• Motivation: Email Overload
• Potential solution: Automatic categorization and
management techniques
• Problem: The potential solution is very
experimental. Email use and user interaction is
difficult to model, requiring a prototype that users
can try on actual email
• The purpose of this work is to present a Microsoft
Outlook 2000TM add-in that:
– Can be used as a first step toward more experimental
research into automatic email management techniques
– Helps manage the inbox via classification and
relevancy-based search
What’s the Problem with Email?
• Too much
• 6/26/2001 USA Today
– “Workers polled this year by market
researcher Gartner spent an average of 49
minutes a day on e-mail, 30% to 35% more
time than they did a year ago. Ferris
Research estimates management-level
workers will spend four hours a day on email by 2002.”
Solutions?
• Educate users
– Don’t send so much mail, don’t subscribe to lists
• Use technology in some way
– Current efforts are toward some type of
classification system that learns New
Folder “Conferences”
with emails
regarding conferences
Training: System
learns what email
belongs to
“Conferences”
SIGIR
email
New Miss
Cleo email
Classify into “Conferences”
Classify into “Trash”
This Project
• An architecture for exploring automatic
email management techniques
• Built on Outlook 2000
– Primary code in Visual Basic
• Produces DLL add-in for Outlook
– Visual C++ DLL component
• Hashes strings to longs (logical operators not
available in VB)
• Referenced from VB
– Not tested with Outlook 2002!
Architectural Overview
VB Add-In DLL
Outlook Object Model
Events
C++ Helper DLL
(Hash Strings)
Outlook / Class Interface Glue
Outlook
Message Class
AddTerms()
Display()
Get Vals
CompareMsg()
Folder Class
AddMsg()
GetMessages
via Dictionary
CompareMsg()
Add-In Interface : Messages
• Message Class
– Mail folders scanned on startup, class instance created for
each mail item (except Trash, Sent Items).
– Message text is tokenized and stoplisted using
•
•
•
•
Sender
Recipients
Subject
Text Body
(possible to use more fields if desired)
– Text tokens are hashed to 32-bit longs to save space,
greatly increase token comparison time
• Hash function by Bob Jenkins
• 2 collisions on 87111 dictionary words
• 10x faster to compare longs vs. strings via strcmp on Pentium II
– CompareMsg function computes similarity between two
email messages
Add-In Interface : Folders
• Folder Class
– User-created mail folders are scanned on
startup and a folder instance created for each
mail folder (except Trash, Sent Items).
– Messages that the user has placed in each folder
are added to the folder’s classifier for training
– CompareMsg function computes similarity
between a new message and the classifier for
the folder
• i.e. can use to classify a new message into folders
Classifier Implementation
• CompareMsg
– It is the goal of this project to experiment with different
classifiers and algorithms as the implementation of
CompareMsg to find out what works and what doesn’t
– A simple classification scheme is implemented for now
• Nearest Neighbor, common terms & frequencies
– Others schemes that have been examined in the past:
• TF-IDF, Neural Networks, Bayesian, Rule Induction, SVM
• What should the classifier do when new email
arrives?
– Some options
• Move new email directly to classified folder
• Annotate email with a category tag
Classifier Usage Challenges
• In previous work, we built a proprietary rule
induction and tf-idf classifier into Outlook and
GroupWise that classified messages into
categories. It was tested on managers and
developers.
• Problems we encountered were usage-driven:
1. The need for constant re-training to keep up with
dynamically changing categories.
2. Classification errors are puzzling and instill distrust on
behalf of the users.
3. Insufficient data may be available as training examples.
4. It is difficult for a user to examine or manually edit a
classifier.
Challenge 1: Categories Change
• Common for Categories to change over time; “Topic
Drift” as in Newsgroups
– Project ends or changes direction
– Conversation slowly changes topics
– General discussion might turn more technical
• Problems for learning algorithms
– Classifiers need to be re-trained; how well can they handle
it? How fast is it?
• Our users were willing to wait seconds, not minutes
• Most classifiers are not incremental; require re-training using all
positive/negative examples, not just new ones
• Often too slow for many algorithms (e.g. rule induction)
– Vector-based classifiers
• Fast to re-train but may have problems with threshold calculations
or new vocabulary not in the vector
Challenge 2: Classifiers Make
Errors, Destroy User Trust
• Users tolerate few errors
• Want immediate corrections so the same error
won’t happen again
– Vector classifier may require several examples before
centroid shifts enough to include similar message
– Rule classifiers need explicit retrain
• Classification errors are inevitable
– Classifier may over-generalize or be too specific
– Errors could “break” users hard work setting up a folder
– In some cases it’s more work to fix errors than the
savings the tool is intended to provide!
• Trust is easy to lose, users abandon the system
Challenge 3: Insufficient Data
Available
• Many classifiers require a large amount of training
data, e.g. statistical-based classifiers
– May not have enough email available
– Users expect system to work well given only 6-12
training examples
– Effort to find more examples typically too high
– One solution: Bootstrap using data in existing folders
• What about negative examples? Can be problematic for some
classification algorithms
Challenge 4: Model Editing and
Understanding
• Some users want to manually fix or edit the
classifier
– These are naïve users, not programmers!
• Easy to understand, modify
– Rule-based classifiers
• More difficult
– Vector classifiers, may have many keywords
• Very difficult
– Neural Network
– SVM
Current Implementation
• Publicly available source, binaries for open
development purposes
• Simple nearest-neighbor classifier for Folders
– Speed, easy to train and classify
– May help classify user-created folders that really
encompass multiple sub-folders (e.g. “work” where
there are many work projects) better than classification
techniques that rely on global data
• Individual term frequencies of sub-folders topics will be low
• But message-to-message comparison may be high
– Don’t need negative examples
• Tag messages with category rather than move into
a folder
– Hopefully not too critical when misclassification occur
Current Implementation : User
Interface
Upon startup of Outlook :
Scan outlook folders, create classifiers and messages
View inbox grouped by category
Current Interface : New Email
New email automatically classified into the
Best-matching folder (but not moved, only grouped)
Current Interface : Related Email
• Interface also supports finding other email
similar to the current one
– Iterate through all email message class objects
invoking the comparison function
• Simple term-frequency comparison of both emails
for now
• Linear time, but not too bad
– 300 of the author’s messages scanned per second on
400Mhz PII
Current Interface: Related Email
Select a message,
Click on button
List of similar
messages
displayed, click
to open
Comments on Personal Use
• No formal user studies performed yet
• But, I’ve been using it…some anecdotes:
– Nearest Neighbor classifier OK, could be better
– Would be useful to index trash or sent-items
• If not indexed, there is no folder to classify into when junk
mail arrives so it gets put somewhere else
• Temporary solution: Make a “Trash” folder with examples
• But indexing trash could be a lot of messages…
– Grouping if incoming email useful?
• Not really needed for frequent email reading
• Useful when returning from a trip and need to triage the mail
– Relevant email
• Useful for finding uncoupled email threads
• Sent-Items would be useful to index here
Lots of Work To Do
• Experiment with other classifiers
– Need to see relation with users on training issues, speed,
etc. not just classification accuracy
• Latch onto more events
– Better mail detection, drag & drop events
• Clean up code implementation
– Support persistence, speed issues on startup scan
– Implementation issues
– Compatibility with Outlook 2002, VB .NET
• Other forms of visualization / categorization
– E.g., color, thread information, graphical techniques
• Extend to other forms of Outlook data
– Calendaring, Notes, Files
Try It Out
• Source Code & Binaries available online
– http://www.math.uaa.alaska.edu/~afkjm/emailaddin/
– Only tested with Windows 2000 & Outlook 2000
– Feel free to use or modify code as you see fit
– Warning: Developer docs and code cleanup still needs to
be done!
• But I’ll be glad to answer any questions!