Slides - Computer Science

Download Report

Transcript Slides - Computer Science

Chapter 7: Text mining
UIC - CS 594
Bing Liu
1
Text mining



It refers to data mining using text
documents as data.
There are many special techniques for
pre-processing text documents to make
them suitable for mining.
Most of these techniques are from the
field of “Information Retrieval”.
UIC - CS 594
Bing Liu
2
Information Retrieval (IR)




Conceptually, information retrieval (IR) is the
study of finding needed information. I.e., IR helps
users find information that matches their
information needs.
Historically, information retrieval is about
document retrieval, emphasizing document as the
basic unit.
Technically, IR studies the acquisition,
organization, storage, retrieval, and distribution
of information.
IR has become a center of focus in the Web era.
UIC - CS 594
Bing Liu
3
Information Retrieval
User
Info. Needs
Translating info.
needs to queries
Search/select
Information
Queries
Stored Information
Matching queries
To stored information
Query result evaluation
Does information found match user’s information needs?
UIC - CS 594
Bing Liu
4
Text Processing




Word (token) extraction
Stop words
Stemming
Frequency counts
UIC - CS 594
Bing Liu
5
Stop words

Many of the most frequently used words in English are
worthless in IR and text mining – these words are
called stop words.




the, of, and, to, ….
Typically about 400 to 500 such words
For an application, an additional domain specific stop words list
may be constructed
Why do we need to remove stop words?
 Reduce indexing (or data) file size


stopwords accounts 20-30% of total word counts.
Improve efficiency


stop words are not useful for searching or text mining
stop words always have a large number of hits
UIC - CS 594
Bing Liu
6
Stemming

Techniques used to find out the root/stem of a
word:

E.g.,





stem:
user
users
used
using
use
engineering
engineered
engineer
engineer
Usefulness
 improving effectiveness of IR and text mining


matching similar words
reducing indexing size

combing words with same roots may reduce
indexing size as much as 40-50%.
UIC - CS 594
Bing Liu
7
Basic stemming methods

remove ending






if a word ends with a consonant other than s,
followed by an s, then delete s.
if a word ends in es, drop the s.
if a word ends in ing, delete the ing unless the remaining word
consists only of one letter or of th.
If a word ends with ed, preceded by a consonant, delete the ed
unless this leaves only a single letter.
…...
transform words

if a word ends with “ies” but not “eies” or “aies” then “ies -->
y.”
UIC - CS 594
Bing Liu
8
Frequency counts



Counts the number of times a word
occurred in a document.
Counts the number of documents in a
collection that contains a word.
Using occurrence frequencies to indicate
relative importance of a word in a
document.

if a word appears often in a document, the
document likely “deals with” subjects related
to the word.
UIC - CS 594
Bing Liu
9
Vector Space Representation

A document is represented as a vector:


Binary:



Wi= 1 if the corresponding term i (often a word) is in the
document
Wi= 0 if the term i is not in the document
TF: (Term Frequency)


(W1, W2, … … , Wn)
Wi= tfi where tfi is the number of times the term
occurred in the document
TF*IDF: (Inverse Document Frequency)

Wi =tfi*idfi=tfi*log(N/dfi)) where dfi is the number of
documents contains term i, and N the total number of
documents in the collection.
UIC - CS 594
Bing Liu
10
Vector Space and Document
Similarity


Each indexing term is a dimension. A indexing
term is normally a word.
Each document is a vector



Di = (ti1, ti2, ti3, ti4, ... tin)
Dj = (tj1, tj2, tj3, tj4, ..., tjn)
Document similarity is defined as
n
Similarity (Di, Dj) 
t
ik
* tjk
 tik 
n
k 1
n
2
k 1
UIC - CS 594
2
t
jk

k 1
Bing Liu
11
Query formats

Query is a representation of the user’s
information needs


Query as a simple question in natural
language


Normally a list of words.
The system translates the question into
executable queries
Query as a document


“Find similar documents like this one”
The system defines what the similarity is
UIC - CS 594
Bing Liu
12
An Example

A document Space is defined by three
terms:


A set of documents are defined as:





hardware, software, users
A1=(1, 0, 0),
A4=(1, 1, 0),
A7=(1, 1, 1)
A2=(0, 1, 0),
A5=(1, 0, 1),
A8=(1, 0, 1).
A3=(0, 0, 1)
A6=(0, 1, 1)
A9=(0, 1, 1)
If the Query is “hardware and software”
what documents should be retrieved?
UIC - CS 594
Bing Liu
13
An Example (cont.)

In Boolean query matching:



document A4, A7 will be retrieved (“AND”)
retrieved:A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
In similarity matching (cosine):





q=(1, 1, 0)
S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
S(q, A4)=1,
S(q, A5)=0.5, S(q, A6)=0.5
S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
Document retrieved set (with ranking)=

{A4, A7, A1, A2, A5, A6, A8, A9}
UIC - CS 594
Bing Liu
14
Relevance judgment for IR



A measurement of the outcome of a search or
retrieval
The judgment on what should or should not
be retrieved.
There is no simple answer to what is relevant
and what is not relevant: need human users.




difficult to define
subjective
depending on knowledge, needs, time,, etc.
The central concept of information retrieval
UIC - CS 594
Bing Liu
15
Precision and Recall

Given a query:



Are all retrieved documents relevant?
Have all the relevant documents been
retrieved ?
Measures for system performance:


The first question is about the precision of
the search
The second is about the completeness
(recall) of the search.
UIC - CS 594
Bing Liu
16
Precision and Recall (cont)
Relevant
Not Relevant
Retrieved
a
b
Not retrieved
c
d
a
P = -------------a+b
UIC - CS 594
a
R = -------------a+c
Bing Liu
17
Precision and Recall (cont)

Precision measures how precise a search is.


the higher the precision,
the less unwanted documents.
Number of relevant documents retrieved
Precision = -------------------------------------------Total number of documents retrieved

Recall measures how complete a search is.


the higher the recall,
the less missing documents.
Number of relevant documents retrieved
Recall = ----------------------------------------------------Number of all the relevant documents in the database
UIC - CS 594
Bing Liu
18
Relationship of R and P

Theoretically,


Practically,



Only when none of the retrieved documents is
relevant.
When will p=1?


High Recall is achieved at the expense of precision.
High Precision is achieved at the expense of recall.
When will p = 0?


R and P not depend on each other.
Only when every retrieved documents are relevant.
Depending on application, you may want a
higher precision or a higher recall.
UIC - CS 594
Bing Liu
19
P-R diagram
P
1.0
System A
System B
0.5
System C
0.1
0.1
UIC - CS 594
1.0
0.5
Bing Liu
R
20
Alternative measures

Combining recall and precision, F score
2PR
F = -----------------R + P



Breakeven point: when p = r
These two measures are commonly used in text
mining: classification and clustering.
Accuracy is not normally used in text domain
because the set of relevant documents is almost
always very small compared to the set of
irrelevant documents.
UIC - CS 594
Bing Liu
21
Web Search as a huge IR system



A Web crawler (robot) crawls the Web to
collect all the pages.
Servers establish a huge inverted
indexing database and other indexing
databases
At query (search) time, search engines
conduct different types of vector query
matching
UIC - CS 594
Bing Liu
22
Different search engines

The real differences among different search
engines are




their indexing weight schemes
their query process methods
their ranking algorithms
None of these are published by any of the
search engines firms.
UIC - CS 594
Bing Liu
23
Vector Space Based Document
Classification
UIC - CS 594
Bing Liu
24
Vector Space Representation


Each doc j is a vector, one component for
each term (= word).
Have a vector space



terms are attributes
n docs live in this space
even with stop word removal and stemming, we
may have 10000+ dimensions, or even
1,000,000+
UIC - CS 594
Bing Liu
25
Classification in Vector space



Each training doc is a point (vector) labeled by its
topic (= class)
Hypothesis: docs of the same topic form a
contiguous region of space
Define surfaces to delineate topics in space
Government
Science
Arts
UIC - CS 594
Bing Liu
26
Test doc = Government
Government
Science
Arts
UIC - CS 594
Bing Liu
27
Rocchio Classification Method


Given training documents compute a
prototype vector for each class.
Given test doc, assign to topic whose
prototype (centroid) is nearest using cosine
similarity.
UIC - CS 594
Bing Liu
28
Rocchio Classification

Constructing document
vectors
into
a

prototype vector c j for each class cj.


1
d
1
 
cj  
C j dC || d ||
DCj

j



dD C j

d

|| d ||
 and  are parameters that adjust the
relative impact of relevant and irrelevant
training examples. Normally,

 = 16 and  = 4.
UIC - CS 594
Bing Liu
29
Naïve Bayesian Classifier

Given a set of training documents D,




each document is considered an ordered list of words.
wdi,k denotes the word wt in position k of document di,
where each word is from the vocabulary V = < w1,
w2, … , w|v| >.
Let C = {c1, c2, … , c|C|} be the set of pre-defined
classes.
There are two naïve Bayesian models,


One based on multi-variate Bernoulli model (a word
occurs or does not occurs in a document).
One based on the multinomial model (the number of
word occurrences is considered)
UIC - CS 594
Bing Liu
30
Naïve Bayesian Classifier
(multinomial model)

 (c ) 
| D|
i 1
 (c j | d i )
j
(1)
|D|
1  i 1 N ( wt , di )(c j | di )
| D|
(wt | c j ) 
| V | s 1 i 1 N (ws , di )(c j | di )
|V |
| D|
(2)
N(wt, di) is the number of times the word wt occurs in document
di. P(cj|di) is in {0, 1}
(c j )k 1 ( wdi ,k | c j )
|d i |
 (c j | d i ) 

|C |
r 1
UIC - CS 594
(3)
(cr )k 1 ( wdi ,k | cr )
|d i |
Bing Liu
31
k Nearest Neighbor
Classification





To classify document d into class c
Define k-neighborhood N as k nearest
neighbors of d
Count number of documents n in N that
belong to c
Estimate P(c|d) as n/k
No training is needed (?). Classification
time is linear in training set size.
UIC - CS 594
Bing Liu
32
Example
Government
Science
Arts
UIC - CS 594
Bing Liu
33
Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
UIC - CS 594
Bing Liu
34
Linear classifiers:
Binary Classification


Consider 2 class problems
Assume linear separability for now:



in 2 dimensions, can separate by a line
in higher dimensions, need hyperplanes
Can find separating hyperplane by linear
programming (e.g. perceptron):

separator can be expressed as ax + by = c
UIC - CS 594
Bing Liu
35
Linear programming /
Perceptron
UIC - CS 594
Find a,b,c, such that
ax + by  c for red points
ax + by  c for green
Bing Liu
points.
36
Linear Classifiers (cont.)


Many common text classifiers are linear
classifiers
Despite this similarity, large performance
differences


For separable problems, there is an infinite
number of separating hyperplanes. Which one
do you choose?
What to do for non-separable problems?
UIC - CS 594
Bing Liu
37
Which hyperplane?
In general, lots of possible
solutions for a,b,c.
Support Vector Machine (SVM)
finds an optimal solution
UIC - CS 594
Bing Liu
38
Support Vector Machine (SVM)




SVMs maximize the margin
around the separating
hyperplane.
The decision function is fully
specified by a subset of
training samples, the
support vectors.
Quadratic programming
Maximize
margin
problem.
SVM: very good for text
classification
UIC - CS 594
Support vectors
Bing Liu
39
Optimal hyperplane


Let the training examples be (xi, ,yi) i = 1, 2,…, n, where
xi is n-dimensional vector. yi is its class, -1 or 1.
The class represented by the subset with yi = -1 and the
class represented by the subset with yi = +1 are linearly
separable if there exists (w, b) such that
• wTxi + b  0 for yi = +1
• wTxi + b < 0 for yi = -1


The margin of separation m is the separation between the
hyperplane wTx + b = 0 and the closest data points
(support vectors).
The goal of a SVM is to find the optimal hyperplane with
the maximum margin of separation.
UIC - CS 594
Bing Liu
40
A Geometrical Interpretation

The decision boundary should be as far away
from the data of both classes as possible

We maximize the margin, m
Class 2
Class 1
UIC - CS 594
m
Bing Liu
41
SVM formulation: separable case


Thus, support vector machines (SVM) are
linear functions of the form f(x) = wTx +
b, where w is the weight vector and x is
the input vector.
To find the linear function:
1 T
Minimize:
w w
2
Subject to: yi (w xi  b)  1, i  1, 2, ..., n
Quadratic programming.
T

UIC - CS 594
Bing Liu
42
Non-separable case
Soft margin SVM

To deal with cases where there may be no separating
hyperplane due to noisy labels of both positive and
negative training examples, the soft margin SVM is
proposed:
Minimize:
n
1 T
w w  C  i
2
i 1
Subject to:
yi (w xi  b)  1  i , i  1, 2, ..., n
T
i  0,
i = 1, …, n
where C  0 is a parameter that controls the amount of
training errors allowed.
UIC - CS 594
Bing Liu
43
Illustration:Non-separable case
Support Vectors:
1 margin s.v.
2 non-margin s.v.
3 non-margin s.v.
i = 0
i < 1
I > 1
Correct
Correct (in margin)
Error
1
1
1
3
3
2
3
1
UIC - CS 594
Bing Liu
44
Extension to Non-linear
Decision surface
In general, complex real world applications may not be
expressed with linear functions.
 Key idea: transform xi into a higher dimensional space to
“make life easier”
 Input space: the space xi are in
 Feature space: the space of f(xi) after transformation

f(.)
Input space
UIC - CS 594
f( )
f( )
f( )
f( ) f( ) f( )
f( )
f( )
f( )
f( ) f( )
f( ) f( )
f( )
f( ) f( )
f( )
f( )
Feature space
Bing Liu
45
Kernel Trick




The mapping function f(.) is used to project data
into a higher dimensional feature space.
x =(x1, .., xn)  f(x) = (f1(x), …, fN(x))
With a higher dimensional space, the data are
more likely to be linearly separable.
In SVM, the projection can be done implicitly,
rather than explicitly because the optimization
does not actually need the explicit projection.
It only needs a way to compute inner products
between pairs of training examples (e.g., x, z)
Kernel: K(x, z) = <f (x)  f (z)>

If you know how to compute K, you do not need to
know f.
UIC - CS 594
Bing Liu
46
Comments of SVM





SVM are seen as best-performing method by
many.
Statistical significance of most results not clear.
Kernels are an elegant and efficient way to map
data into a better representation.
SVM can be expensive to train (quadratic
programming).
For text classification, linear kernel is common and
often sufficient.
UIC - CS 594
Bing Liu
47
Document clustering



We can still use the normal clustering
techniques, e.g., partition and hierarchical
methods.
Documents can be represented using
vector space model.
For distance function, cosine similarity
measure is commonly used.
UIC - CS 594
Bing Liu
48
Summary


Text mining applies and adapts data
mining techniques to text domain.
A significant amount of pre-processing is
needed before mining, using information
retrieval techniques.
UIC - CS 594
Bing Liu
49