Slides - Cognitive Computation Group

Download Report

Transcript Slides - Cognitive Computation Group

Incorporating Machine Learning in your application:

Text classification

Vivek Srikumar

Goals of this tutorial At the end of this session, you will be able to 1.

Get started with Learning Based Java 2.

Use a generic, black box text classifier for different applications …and write your own text classifier, if needed 3.

Understand how features can impact the classifier performance … and add features to improve your application

What is text classification?

A document A classifier (black box) ✗ ✗ ✓ ✗ Some labels

Several applications fit this framework Spam detection Sentiment classification What else can you do, if you had such a black box system that can classify text? Try to spend 30 seconds brainstorming

Outline of this session  Getting started with LBJ  Writing our first classifier: Spam/Ham  Playing with features  Sentiment analysis, news group classification  Your turn: The fame classifier

Writing classifiers

LEARNING BASED JAVA

What is L earning B ased J ava?

 A modeling language for learning and inference  Supports    Programming using learned models High level specification of features and constraints between classifiers Inference with constraints  The learning operator   Classifiers are functions defined in terms of data Learning happens at compile time

What does LBJ do for you?

 Abstracts away the feature representation, learning and inference  Allows you to write learning based programs  Application developers can reason about the application at hand

Demo  A learning based program First, we will write an application that assumes the existence of a black box classifier

SPAM DETECTION

Spam detection Which of these (if any) are email spam?

Subject: save over 70 % on name brand software Subject: please keep in touch ppharmacy devote fink tungstate brown lexicon pawnshop crescent application elegy donnelly How do you know?

hydrochloride common embargo shakespearean bassett trustee nucleolus chicano narbonne telltale tagging swirly lank delphinus bragging bravery cornea asiatic susanne just like to say that it has been great meeting and working with you all . i will be leaving enron effective july 5 th to do investment banking in hong kong . i will initially be based in new york and will be moving to hong kong after a few months . do contact me when you are in the vicinity .

What do we need to build a classifier?

1.

2.

3.

Annotated documents * A learning algorithm A feature representation of the documents * Here we are dealing with supervised learning

Our first LBJ program /** A learned text classifier; its definition comes from data. */ Defines a classifier The object being classified discrete TextClassifier(Document d) < learn TextLabel using WordFeatures from new DocumentReader( "data/spam/train" ) The function being learned The feature representation with SparseAveragedPerceptron learningRate = 0.1 ; thickness = 3.5; } 5 rounds { testFrom end new DocumentReader( "data/spam/test” ) The source of the training data The learning algorithm

Demo  Let’s build a spam detector  How to train?

 How do different learning algorithms perform? Does this choice matter much?

Features  Our current spam detector uses words as features  Can we do better?

Let’s try it out

So far  What is LBJ? How do we use it?

 Writing a simple spam detector  Playing with features  How much do we need to change to move to a different application?

MORE TEXT CLASSIFICATION

Sentiment classification Which of these product reviews is positive?

I recently made the switch from PC to Mac, and I can say that I'm not sure why I waited so long. How do you know?

my computer a few weeks I can't say much about the durability and longevity of the hardware, but I can say that the operating system (mine shipped with Lion) and software is top notch.

I've been an Apple user for a long time, but my most recent MacBook Pro purchase has convinced me to reconsider. I've had several hardware issues, including a failed keyboard, battery failure, and a bad DVD drive. Now, the backlight on the display fails to turn on when waking from sleep

Classifying news groups Which mailing list should this message be posted to?

I am looking for Quick C or Microsoft C code for image decoding from file for VGA viewing and saving images from/to GIF, TIFF, PCX, or JPEG format. I have scoured the Internet, but its like trying to find a Dr. Seuss spell checker TSR. It must be out there, and there's no need to reinvent the wheel.

alt.atheism

comp.graphics

How do you know?

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

misc.forsale

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

soc.religion.christian

talk.politics.guns

talk.politics.mideast

talk.politics.misc

talk.religion.misc

Demo  Converting our spam classifier into a  Sentiment classifier  A newsgroup classifier  Note: How different are these at the implementation level?

Most of the engineering lies in the features A document A classifier (black box) ✗ ✗ ✓ ✗ Some labels

THE FAMOUS PEOPLE CLASSIFIER

The Famous People Classifier

f( ) = Politician f( ) = Athlete f( ) = Corporate Mogul

The NLP version of the fame classifier All sentences in the news, which the string Barack Obama occurs Represented by All sentences in the news, which the string Roger Federer occurs All sentences in the news, which the string Bill Gates occurs

Our goal  Find famous athletes, corporate moguls and politicians Athlete • Michael Schumacher • Michael Jordan • … Politician • Bill Clinton • George W. Bush • … Corporate Mogul • Warren Buffet • Larry Ellison • …

Let ’s brainstorm  How do we build a fame classifier?

Remember, we start off with just raw text from a news website

One solution  Let us label entities using features defined on mentions All sentences in the news, which the string Barack Obama occurs    Identify mentions using the named entity recognizer Define features based on the words , parts of speech and dependency trees Train a classifier

Summary 1.

Get started with Learning Based Java 2.

Use a generic, black box text classifier for different applications …and write your own text classifier, if needed 3.

Understand how features can impact the classifier performance … and add features to improve your application Questions