Morteza zihayat
Download
Report
Transcript Morteza zihayat
Information Retrieval
and
Vector Space Model
Computational Linguiestic Course
Instructor: Professor Cercone
Presenter: Morteza zihayat
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VS Model
2
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VS Model
3
Information Retrieval and Vector Space Model
Introduction to IR
The world's total yearly production of unique
information stored in the form of print, film, optical,
and magnetic content would require roughly 1.5
billion gigabytes of storage. This is the equivalent of
250 megabytes per person for each man, woman,
and child on earth.
(Lyman & Hal 00)
4
Information Retrieval and Vector Space Model
Growth of textual information
Literature
WWW
How can we help manage and
exploit all the information?
News
5
Email
Desktop
Blog
Intranet Retrieval and Vector Space Model
Information
Information overflow
6
Information Retrieval and Vector Space Model
What is Information Retrieval (IR)?
Narrow-sense:
IR= Search Engine Technologies (IR=Google, library info
system)
IR= Text matching/classification
Broad-sense: IR = Text Information Management:
General problem: how to manage text information?
How to find useful information? (retrieval)
Example: Google
How to organize information? (text classification)
Example: Automatically assign emails to different folders
How to discover knowledge from text? (text mining)
Example: Discover correlation of events
7
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
8
Information Retrieval and Vector Space Model
Formalizing IR Tasks
Vocabulary: V = {w1,w2, …, wT} of a language
Query: q = q1, q2, …, qm where qi ∈V.
Document: di= di1, di2, …, dimi where dij∈V.
Collection: C = {d1, d2, …, dN}
Relevant document set: R(q) ⊆C:Generally
unknown and user-dependent
Query provides a “hint” on which documents
should be in R(q)
IR: find the approximate relevant document set
R’(q)
9
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
Evaluation measures
The quality of many retrieval systems depends on
how well they manage to rank relevant
documents.
How can we evaluate rankings in IR?
IR researchers have developed evaluation measures
specifically designed to evaluate rankings.
Most of these measures combine precision and recall in a
way that takes account of the ranking.
10
Information Retrieval and Vector Space Model
Precision & Recall
11
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
In other words:
Precision is the percentage of relevant items in the
returned set
Recall is the percentage of all relevant documents
in the collection that is in the returned set.
12
Information Retrieval and Vector Space Model
Evaluating Retrieval Performance
13
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
IR System Architecture
docs
INDEXING
Query
Doc
Rep
Rep
SEARCHING
Ranking
Feedback
query
User
results
INTERFACE
judgments
QUERY MODIFICATION
14
Information Retrieval and Vector Space Model
Indexing Document
Break
documents
into words
Stop list
Stemming
Construct
Index
Information Retrieval and Vector Space Model
15
Searching
Given a query, score documents efficiently
The basic question:
Given a query, how do we know if document A is more
relevant than B?
If document A uses more query words than document B
Word usage in document A is more similar to that in
query
….
We should find a way to compute relevance
Query and documents
16
Information Retrieval and Vector Space Model
The Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
Regression
Model
(Fox 83)
…
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Today’s lecture
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Generative
Model
Doc
generation
Query
generation
Different
inference system
Prob. concept
space model
(Wong & Yao, 95)
Classical
LM
prob. Model
approach
(Robertson &
(Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
Inference
network
model
(Turtle & Croft, 91)
Infor
17
Relevance = Similarity
Assumptions
Query and document are represented similarly
A query can be regarded as a “document”
Relevance(d,q) similarity(d,q)
R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
Key issues
How to represent query/document?
Vector Space Model (VSM)
How to define the similarity measure ?
Information Retrieval and Vector Space Model
18
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
19
Information Retrieval and Vector Space Model
Vector Space Model (VSM)
The vector space model is one of the most widely
used models for ad-hoc retrieval
Used in information filtering, information
retrieval, indexing and relevancy rankings.
20
Information Retrieval and Vector Space Model
VSM
Represent a doc/query by a term vector
Term: basic concept, e.g., word or phrase
Each term defines one dimension
N terms define a high-dimensional space
E.g., d=(x1,…,xN), xi is “importance” of term I
Measure relevance by the distance between the
query vector and document vector in the vector
space
21
Information Retrieval and Vector Space Model
VS Model: illustration
Starbucks
D2
D9
D11
??
??
D5
D3
D10
D4 D6
Query
D7
D8
Microsoft
D1
Java
??
Infor
22
Some Issues about VS Model
There is no consistent definition for basic concept
Assigning weights to words has not been
determined
Weight in query indicates importance of term
24
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
25
Information Retrieval and Vector Space Model
How to Assign Weights?
Different terms have different importance in a text
A term weighting scheme plays an important role
for the similarity measure.
Higher weight = greater impact
We now turn to the question of how to weight
words in the vector space model.
26
Information Retrieval and Vector Space Model
There are three components in a weighting
scheme:
gi: the global weight of the ith term,
tij: is the local weight of the ith term in the jth document,
dj:the normalization factor for the jth document
27
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
29
Information Retrieval and Vector Space Model
TF Weighting
Idea: A term is more important if it occurs more
frequently in a document
Formulas: Let f(t,d) be the frequency count of term t
in doc d
Raw TF: TF(t,d) = f(t,d)
Log TF: TF(t,d)=log f(t,d)
Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)
Normalization of TF is very important!
Information Retrieval and Vector Space Model
30
TF Methods
31
Information Retrieval and Vector Space Model
IDF Weighting
Idea: A term is more discriminative if it occurs
only in fewer documents
Formula:
IDF(t) = 1 + log(n/k)
n : total number of docs
k : # docs with term t (doc freq)
Information Retrieval and Vector Space Model
32
IDF weighting Methods
33
Information Retrieval and Vector Space Model
TF Normalization
Why?
Document length variation
“Repeated occurrences” are less informative than the “first
occurrence”
Two views of document length
A doc is long because it uses more words
A doc is long because it has more contents
Generally penalize long doc, but avoid overpenalizing
Information Retrieval and Vector Space Model
34
TF-IDF Weighting
TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
Common in doc high tf high weight
Rare in collection high idf high weight
Imagine a word count profile, what kind of terms
would have high weights?
Information Retrieval and Vector Space Model
35
How to Measure Similarity?
Di ( wi1 ,...,wiN )
Q ( wq1 ,...,wqN )
Dot product similarity:
w 0 if a termis absent
N
SC(Q, Di ) wqj wij
j 1
N
Cosine :
sim(Q, Di )
w
j 1
qj
N
wij
(wqj )
2
j 1
N
2
(
w
)
ij
j 1
( normalizeddot product)
Information Retrieval and Vector Space Model
36
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
37
Information Retrieval and Vector Space Model
VS Example: Raw TF & Dot Product
doc1
information
retrieval
search
engine
information
Sim(q,doc1)=4.8*2.4+4.5*4.5
query=“information retrieval”
Sim(q,doc2)=2.4*2.4
travel
information
doc2
map
travel
doc3
Sim(q,doc3)=0
government IDF
president (fake)
congress Doc1
Doc2
…
Info.
Retrieval
Travel
Map
Search
Engine
Govern.
President
Congress
2.4
4.5
2.8
3.3
2.1
5.4
2.2
3.2
4.3
2(4.8)
1(4.5)
1(2.1)
1(5.4)
1(2.2)
1(3.2)
1(4.3)
1(2.4)
2(5.6)
1(3.3)
Doc3
Query
1(2.4)
1(4.5)
Information Retrieval and Vector Space
Model
38
Example
Q: “gold silver truck”
• D1: “Shipment of gold delivered in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Document Frequency of the jth term (dfj )
• Inverse Document Frequency (idf) = log10(n / dfj)
Tf*idf is used as term weight here
Information Retrieval and Vector Space
Model
39
Example (Cont’d)
Id
1
2
3
4
5
6
7
8
9
10
11
Term
a
arrived
damaged
delivery
fire
gold
in
of
silver
shipment
truck
df
3
2
1
1
1
1
3
3
1
2
2
Information Retrieval and Vector Space
Model
idf
0
0.176
0.477
0.477
0.477
0.176
0
0
0.477
0.176
0.176
40
Example(Cont’d)
Tf*idf is used here
doc
t1
t2
t3
t4
t5
D1
0
0
0.477
0
D2
0
0.176
0
0.477
0
D3
0
0.176
0
0
Q
0
0
0
0
t6
t7
t8
t9
t10
t11
0
0
0
0.176
0
0
0
0
0.954
0
0.176
0
0.176
0
0
0
0
0.176
0
0
0.477
0.477 0.176
0.176 0.176
0
0.176
SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031
SC(Q, D2 ) = 0.486
SC(Q,D3) = 0.062
The ranking would be D2,D3,D1.
• This SC uses the dot product.
Information Retrieval and Vector Space
Model
41
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VSM Model
42
Information Retrieval and Vector Space Model
Advantages of VS Model
Empirically effective! (Top TREC performance)
Intuitive
Easy to implement
Well-studied/Most evaluated
The Smart system
Developed at Cornell: 1960-1999
Still widely used
Warning: Many variants of TF-IDF!
Information Retrieval and Vector Space Model
43
Disadvantages of VS Model
Assume term independence
Assume query and document to be the same
Lots of parameter tuning!
Information Retrieval and Vector Space Model
44
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VSM Model
45
Information Retrieval and Vector Space Model
Improving the VSM Model
We can improve the model by:
Reducing the number of dimensions
eliminating all stop words and very common terms
stemming terms to their roots
Latent Semantic Analysis
Not retrieving documents below a defined cosine threshold
Normalized frequency of a term i in document j is given
by :
Normalized Document Frequencies
Normalized Query Frequencies
[1]
46
Information Retrieval and Vector Space Model
Stop List
Function words do not bear useful information for IR
of, not, to, or, in, about, with, I, be, …
Stop list: contain stop words, not to be used as index
Prepositions
Articles
Pronouns
Some adverbs and adjectives
Some frequent words (e.g. document)
The removal of stop words usually improves IR
effectiveness
A few “standard” stop lists are commonly used.
Information Retrieval and Vector Space Model
47
Stemming
Reason:
◦ Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
Stemming:
◦ Removing some endings of word
dancer
dancers
dance
danced
dancing
dance
48
Information Retrieval and Vector Space Model
Stemming(Cont’d)
Two main methods :
Linguistic/dictionary-based stemming
high stemming accuracy
high implementation and processing costs and higher
coverage
Porter-style stemming
lower stemming accuracy
lower implementation and processing costs and lower
coverage
Usually sufficient for IR
Information Retrieval and Vector Space Model
49
Latent Semantic Indexing (LSI) [3]
Reduces the dimensions of the term-document
space
Attempts to solve the synonomy and polysemy
Uses Singular Value Decomposition (SVD)
identifies patterns in the relationships between the terms
and concepts contained in an unstructured collection of text
Based on the principle that words that are used in
the same contexts tend to have similar meanings.
50
Information Retrieval and Vector Space Model
LSI Process
In general, the process involves:
constructing a weighted term-document matrix
performing a Singular Value Decomposition on the
matrix
using the matrix to identify the concepts contained in the
text
LSI statistically analyses the patterns of word
usage across the entire document collection
51
Information Retrieval and Vector Space Model
References
Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schuetze
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf
https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf
https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt
https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_modelupdated.ppt
https://wiki.cse.yorku.ca/course_archive/201112/F/6339/_media/lecture_13_ir_and_vsm_.ppt
Document Classification based on Wikipedia Content,
http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim
54
estamp=1318275702299
Information Retrieval and Vector Space Model
Thanks For Your Attention ….
55
Infor