Transcript Document
An Experimental Framework for Email Categorization and Management Kenrick Mock [email protected] Project Overview • Motivation: Email Overload • Potential solution: Automatic categorization and management techniques • Problem: The potential solution is very experimental. Email use and user interaction is difficult to model, requiring a prototype that users can try on actual email • The purpose of this work is to present a Microsoft Outlook 2000TM add-in that: – Can be used as a first step toward more experimental research into automatic email management techniques – Helps manage the inbox via classification and relevancy-based search What’s the Problem with Email? • Too much • 6/26/2001 USA Today – “Workers polled this year by market researcher Gartner spent an average of 49 minutes a day on e-mail, 30% to 35% more time than they did a year ago. Ferris Research estimates management-level workers will spend four hours a day on email by 2002.” Solutions? • Educate users – Don’t send so much mail, don’t subscribe to lists • Use technology in some way – Current efforts are toward some type of classification system that learns New Folder “Conferences” with emails regarding conferences Training: System learns what email belongs to “Conferences” SIGIR email New Miss Cleo email Classify into “Conferences” Classify into “Trash” This Project • An architecture for exploring automatic email management techniques • Built on Outlook 2000 – Primary code in Visual Basic • Produces DLL add-in for Outlook – Visual C++ DLL component • Hashes strings to longs (logical operators not available in VB) • Referenced from VB – Not tested with Outlook 2002! Architectural Overview VB Add-In DLL Outlook Object Model Events C++ Helper DLL (Hash Strings) Outlook / Class Interface Glue Outlook Message Class AddTerms() Display() Get Vals CompareMsg() Folder Class AddMsg() GetMessages via Dictionary CompareMsg() Add-In Interface : Messages • Message Class – Mail folders scanned on startup, class instance created for each mail item (except Trash, Sent Items). – Message text is tokenized and stoplisted using • • • • Sender Recipients Subject Text Body (possible to use more fields if desired) – Text tokens are hashed to 32-bit longs to save space, greatly increase token comparison time • Hash function by Bob Jenkins • 2 collisions on 87111 dictionary words • 10x faster to compare longs vs. strings via strcmp on Pentium II – CompareMsg function computes similarity between two email messages Add-In Interface : Folders • Folder Class – User-created mail folders are scanned on startup and a folder instance created for each mail folder (except Trash, Sent Items). – Messages that the user has placed in each folder are added to the folder’s classifier for training – CompareMsg function computes similarity between a new message and the classifier for the folder • i.e. can use to classify a new message into folders Classifier Implementation • CompareMsg – It is the goal of this project to experiment with different classifiers and algorithms as the implementation of CompareMsg to find out what works and what doesn’t – A simple classification scheme is implemented for now • Nearest Neighbor, common terms & frequencies – Others schemes that have been examined in the past: • TF-IDF, Neural Networks, Bayesian, Rule Induction, SVM • What should the classifier do when new email arrives? – Some options • Move new email directly to classified folder • Annotate email with a category tag Classifier Usage Challenges • In previous work, we built a proprietary rule induction and tf-idf classifier into Outlook and GroupWise that classified messages into categories. It was tested on managers and developers. • Problems we encountered were usage-driven: 1. The need for constant re-training to keep up with dynamically changing categories. 2. Classification errors are puzzling and instill distrust on behalf of the users. 3. Insufficient data may be available as training examples. 4. It is difficult for a user to examine or manually edit a classifier. Challenge 1: Categories Change • Common for Categories to change over time; “Topic Drift” as in Newsgroups – Project ends or changes direction – Conversation slowly changes topics – General discussion might turn more technical • Problems for learning algorithms – Classifiers need to be re-trained; how well can they handle it? How fast is it? • Our users were willing to wait seconds, not minutes • Most classifiers are not incremental; require re-training using all positive/negative examples, not just new ones • Often too slow for many algorithms (e.g. rule induction) – Vector-based classifiers • Fast to re-train but may have problems with threshold calculations or new vocabulary not in the vector Challenge 2: Classifiers Make Errors, Destroy User Trust • Users tolerate few errors • Want immediate corrections so the same error won’t happen again – Vector classifier may require several examples before centroid shifts enough to include similar message – Rule classifiers need explicit retrain • Classification errors are inevitable – Classifier may over-generalize or be too specific – Errors could “break” users hard work setting up a folder – In some cases it’s more work to fix errors than the savings the tool is intended to provide! • Trust is easy to lose, users abandon the system Challenge 3: Insufficient Data Available • Many classifiers require a large amount of training data, e.g. statistical-based classifiers – May not have enough email available – Users expect system to work well given only 6-12 training examples – Effort to find more examples typically too high – One solution: Bootstrap using data in existing folders • What about negative examples? Can be problematic for some classification algorithms Challenge 4: Model Editing and Understanding • Some users want to manually fix or edit the classifier – These are naïve users, not programmers! • Easy to understand, modify – Rule-based classifiers • More difficult – Vector classifiers, may have many keywords • Very difficult – Neural Network – SVM Current Implementation • Publicly available source, binaries for open development purposes • Simple nearest-neighbor classifier for Folders – Speed, easy to train and classify – May help classify user-created folders that really encompass multiple sub-folders (e.g. “work” where there are many work projects) better than classification techniques that rely on global data • Individual term frequencies of sub-folders topics will be low • But message-to-message comparison may be high – Don’t need negative examples • Tag messages with category rather than move into a folder – Hopefully not too critical when misclassification occur Current Implementation : User Interface Upon startup of Outlook : Scan outlook folders, create classifiers and messages View inbox grouped by category Current Interface : New Email New email automatically classified into the Best-matching folder (but not moved, only grouped) Current Interface : Related Email • Interface also supports finding other email similar to the current one – Iterate through all email message class objects invoking the comparison function • Simple term-frequency comparison of both emails for now • Linear time, but not too bad – 300 of the author’s messages scanned per second on 400Mhz PII Current Interface: Related Email Select a message, Click on button List of similar messages displayed, click to open Comments on Personal Use • No formal user studies performed yet • But, I’ve been using it…some anecdotes: – Nearest Neighbor classifier OK, could be better – Would be useful to index trash or sent-items • If not indexed, there is no folder to classify into when junk mail arrives so it gets put somewhere else • Temporary solution: Make a “Trash” folder with examples • But indexing trash could be a lot of messages… – Grouping if incoming email useful? • Not really needed for frequent email reading • Useful when returning from a trip and need to triage the mail – Relevant email • Useful for finding uncoupled email threads • Sent-Items would be useful to index here Lots of Work To Do • Experiment with other classifiers – Need to see relation with users on training issues, speed, etc. not just classification accuracy • Latch onto more events – Better mail detection, drag & drop events • Clean up code implementation – Support persistence, speed issues on startup scan – Implementation issues – Compatibility with Outlook 2002, VB .NET • Other forms of visualization / categorization – E.g., color, thread information, graphical techniques • Extend to other forms of Outlook data – Calendaring, Notes, Files Try It Out • Source Code & Binaries available online – http://www.math.uaa.alaska.edu/~afkjm/emailaddin/ – Only tested with Windows 2000 & Outlook 2000 – Feel free to use or modify code as you see fit – Warning: Developer docs and code cleanup still needs to be done! • But I’ll be glad to answer any questions!