INFM 700: Session 7 - University of Maryland, College Park
Download
Report
Transcript INFM 700: Session 7 - University of Maryland, College Park
INFM 700: Session 7
Search (Part I)
Introduction to Information Retrieval
Paul Jacobs
The iSchool
University of Maryland
Monday, November 9, 2009
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Goals for Search Sessions
Understand the basic issues in information
retrieval (searching primarily unstructured text)
Know the techniques generally used by modern
search engines
Learn how search engines can be used most
effectively in information architecture
iSchool
Today’s Topics
Introduction to Information Retrieval
Keywords, inverted indices, and Boolean retrieval
The vector space model, ranked retrieval
Major issues
Some additional tricks
Examples: web search and site search
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Levels of Structure
Different types of data
Structured data
Semi-structured data
Unstructured data
How do you provide access to unstructured data?
Manually develop an organization system (add
structure)
Provide search capabilities
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
What is search?
Search is query-based access
How is this different from browsing?
Things one can search on:
Content
Metadata
Organization systems
Labels
…
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Some Key Concepts
Different search paradigms
IR Intro
Boolean, “keyword”
“Natural language” or “free text” (full text) search
Current search engines are primarily full text and
statistical
The fundamental challenge: words & concepts
The basic method: weighting and context
Other tricks (there are many!)
Boolean
Vector Space
Issues & Tricks
Structuring
Popularity and importance (of pages, documents)
Metadata and thesauri
User feedback
iSchool
The Central Problem in IR
Authors
Searcher
Concepts
Concepts
IR Intro
Boolean
Vector Space
Query
Documents
Issues & Tricks
Do these represent the same concepts?
iSchool
Architecture of IR Systems
Query
Documents
online offline
IR Intro
Boolean
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Vector Space
Issues & Tricks
Hits
iSchool
How do we represent text?
Remember: computers don’t “understand”
documents or queries
Simple, yet effective approach: “bag of words”
Assumptions
IR Intro
Boolean
Vector Space
Issues & Tricks
Treat all the words in a document as index terms
Assign a “weight” to each term based on “importance”
Disregard order, structure, meaning, etc. of the words
Term occurrence is independent (of other terms)
Document relevance is independent (of other
documents)
“Words” can be defined
iSchool
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。
الناطق باسم- وقال مارك ريجيف
إن شارون قبل- الخارجية اإلسرائيلية
الدعوة وسيقوم للمرة األولى بزيارة
التي كانت لفترة طويلة المقر،تونس
1982 الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام.
Выступая в Мещанском суде Москвы экс-глава ЮКОСа
заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.
IR Intro
भारत सरकार ने आर्थिक सर्वेक्षण में वर्वत्तीय र्वर्ि 2005-06 में सात फ़ीसदी
वर्वकास दर हाससल करने का आकलन ककया है और कर सुधार पर ज़ोर ददया है
Boolean
Vector Space
日米連合で台頭中国に対処…アーミテージ前副長官提言
Issues & Tricks
조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안
에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의
보도를 부인했다.
iSchool
Sample Document
McDonald's slims down spuds
“Bag of Words”
Fast-food chain to reduce certain types of
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items
healthier.
But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
IR Intro
Boolean
Vector Space
Issues & Tricks
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately
be reached for comment.
14 × McDonald’s
12 × fat
11 × fries
8 × new
6 × company, french, nutrition
5 × food, oil, percent, reduce,
taste, Tuesday
…
…
iSchool
Why does “bag of words” work
(at all)?
IR Intro
Boolean
Vector Space
Words alone tell us a lot about content!
Words are our main tool for describing concepts
Words in context are especially powerful
Getting beyond words is hard
Structure usually (but not always) can be
guessed from content
“355 back correction Dow pulls signaling”
“blind Venetian” vs. “Venetian blind”
Issues & Tricks
iSchool
Boolean Retrieval
Users express queries as a Boolean (logical)
expression
Difference between “term” and “keyword”?
Retrieval is based on the notion of sets
IR Intro
Boolean
Vector Space
Issues & Tricks
“terms” (usually words or phrases) joined by AND, OR,
NOT
Can be arbitrarily nested
Any given query divides the collection into two sets:
retrieved, not-retrieved (complement)
Pure Boolean systems do not define an ordering of the
results (no ranking)
iSchool
AND/OR/NOT
All documents
A
B
IR Intro
Boolean
Vector Space
Issues & Tricks
C
iSchool
Logic Tables
B
0
1
0
0
1
1
1
1
A
B
0
1
1
0
NOT B
A OR B
B
0
1
0
0
0
1
0
1
A
B
0
1
0
0
0
1
1
0
A
IR Intro
Boolean
Vector Space
Issues & Tricks
A AND B
A NOT B
(= A AND NOT B)
iSchool
The quick brown
fox jumped over
the lazy dog’s
back.
Document 2
IR Intro
Boolean
Vector Space
Issues & Tricks
Now is the time
for all good men
to come to the
aid of their party.
Term
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
Document 2
Document 1
Document 1
Representing Documents
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
1
0
1
0
1
1
iSchool
Stopword
List
for
is
of
the
to
Term
IR Intro
Boolean
Vector Space
Issues & Tricks
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Boolean View of a Collection
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
Each column represents the view of
a particular document: What terms
are contained in this document?
Each row represents the view of a
particular term: What documents
contain this term?
To execute a query, pick out rows
corresponding to query terms and
then apply logic table of
corresponding Boolean operator
iSchool
Boolean
Vector Space
Issues & Tricks
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
IR Intro
Term
dog
fox
0 0 1 0 1 0 0 0
0 0 1 0 1 0 1 0
dog fox
0 0 1 0 1 0 0 0
dog AND fox Doc 3, Doc 5
dog fox
0 0 1 0 1 0 1 0
dog OR fox Doc 3, Doc 5, Doc 7
dog fox
0 0 0 0 0 0 0 0
dog NOT fox empty
fox dog
0 0 0 0 0 0 1 0
fox NOT dog Doc 7
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Sample Queries
good
party
0 1 0 1 0 1 0 1
0 0 0 0 0 1 0 1
gp
over
0 0 0 0 0 1 0 1
1 0 1 0 1 0 1 1
good AND party Doc 6, Doc 8
gpo
0 0 0 0 0 1 0 0
good AND party NOT over Doc 6
iSchool
Term
IR Intro
Boolean
Vector Space
Issues & Tricks
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Inverted Index
Term
Postings
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
4
2
1
1
2
3
3
2
3
1
2
2
1
6
1
1
2
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
iSchool
8
4
3
3
4
5
5
4
3
4
6
3
8
3
5
4
6
7
5
6
7
8
7
6
8
5
8
8
5
7
6
7
7
8
Boolean Retrieval
To execute a Boolean query:
Build query syntax tree
( fox or dog ) and quick
3
3
5
5
OR
fox
dog
7
Traverse postings and apply Boolean operator
dog
fox
IR Intro
quick
For each clause, look up postings
dog
fox
AND
3
3
5
5
7
OR = union
3
5
7
Boolean
Vector Space
Efficiency analysis
Issues & Tricks
Postings traversal is linear (assuming sorted postings)
Start with shortest posting first
iSchool
Why Boolean Retrieval Works
Boolean operators approximate concepts
How so?
AND can identify relationships between concepts
• (e.g., interest rate, web design)
OR can identify alternate terminology
• (e.g., interest percentage, HTML layout, etc.)
NOT can filter alternate meanings
• (e.g., conflict AND interest AND NOT rate, NOT spider)
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Why Boolean Retrieval Fails
It’s really hard to come up with the “right” queries
Casual searchers have difficulty with the logic
Some concepts are just hard to express, e.g.
“corporate mergers & acquisitions” – IBM
acquired Lotus
Relevance is not absolute, some documents are
more relevant, or more helpful, than others
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Ranked Retrieval in the Vector
Space Model
Order documents by how likely they are to be
relevant to the information need
Estimate relevance(q, di)
Sort documents by relevance
Display sorted results, usually one screen at a time
How do we estimate relevance?
IR Intro
Boolean
Vector Space
Issues & Tricks
Assume that document d is relevant to query q if they
share terms in common
Replace relevance(q, di) with sim(q, di) (similarity)
Compute similarity of vector representations
iSchool
Vector Representation
“Bags of words” can be represented as vectors
Why? Computational efficiency, ease of manipulation
Geometric metaphor: “arrows”
A vector is a set of values recorded in any
consistent order
“The quick brown fox jumped over the lazy dog’s back”
[111111112]
IR Intro
Boolean
Vector Space
Issues & Tricks
1st position corresponds to “back”
2nd position corresponds to “brown”
3rd position corresponds to “dog”
4th position corresponds to “fox”
5th position corresponds to “jump”
6th position corresponds to “lazy”
7th position corresponds to “over”
8th position corresponds to “quick”
9th position corresponds to “the”
iSchool
Vector Space Model
t3
d2
d3
d1
θ
φ
t1
d5
t2
IR Intro
d4
Boolean
Vector Space
Issues & Tricks
Assumption: Documents that are “close together” in
vector space “talk about” the same things
Therefore, retrieve documents based on how close the
document is to the query (i.e., similarity ~ “closeness”)
iSchool
Similarity Metric
How about |d1 – d2|?
Instead of Euclidean distance, use “angle”
between the vectors
It all boils down to the inner product (dot product) of
vectors
d j dk
cos( )
d j dk
IR Intro
Boolean
Vector Space
Issues & Tricks
d j dk
sim (d j , d k )
d j dk
n
i 1
i 1 w
n
wi , j wi ,k
2
i, j
2
w
i 1 i,k
iSchool
n
Components of Similarity
IR Intro
Boolean
Vector Space
The “inner product” (aka dot product) is the key to
the similarity function
n
d j d k i 1 wi , j wi ,k
Example:
1 2 3 0 2 2
0 1 0 2
1 2 2 0 3 1 0 0 2 2 9
The denominator handles document length
normalization
n
d j i 1 wi2,k
Example:
1 2 3 0 2
Issues & Tricks
1 4 9 0 4 18 4.24
iSchool
Term Weighting
Term weights consist of two components
Here’s the intuition:
IR Intro
Boolean
Vector Space
Issues & Tricks
Local: how important is the term in this doc?
Global: how important is the term in the collection?
Terms that appear often in a document should get high
weights
Terms that appear in many documents should get low
weights
How do we capture this mathematically?
Term frequency (local)
Inverse document frequency (global)
iSchool
TF.IDF Term Weighting
N
wi , j tfi , j log
ni
wi , j
weight assigned to term i in document j
tfi, j
number of occurrence of term i in document j
N
number of documents in entire collection
ni
number of documents with term i
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
TF.IDF Example
tf
1
2
complicated
contaminated 4
fallout
5
information
6
interesting
IR Intro
nuclear
1
3
3
4
idf
5
2
0.301
complicated
0.301
3,5 4,2
0.125
contaminated
0.125
1,4 2,1 3,3
3
4
3
0.125
fallout
0.125
1,5 3,4 4,3
3
2
0.000
information
0.000
1,6 2,3 3,3 4,2
0.602
interesting
0.602
2,1
0.301
nuclear
0.301
1,3 3,7
0.125
retrieval
0.125
2,6 3,1 4,4
0.602
siberia
0.602
1,2
1
3
7
Boolean
Vector Space
Issues & Tricks
retrieval
siberia
6
2
1
4
iSchool
Document Scoring Algorithm
Initialize accumulators to hold document scores
For each query term t in the user’s query
Fetch t’s postings
For each document, scoredoc += wt,d wt,q
Apply length normalization to the scores at end
Return top N documents
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Summary thus far…
Represent documents (and queries) as “bags of
words” (terms)
Derive term weights based on frequency
Use weighted term vectors for each document,
query
Compute a vector-based similarity score
Display sorted, ranked results
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Issues and Tricks
What’s a word/term?
Query formulation/suggestion
Type of information need
Popularity
IR Intro
Boolean
Vector Space
Issues & Tricks
We can ignore words (“stop words”), combine
(phrases), split up (“stem”) words
Other special treatment (e.g. names, categories)
Based on link analysis/page rank
Based on click through, other
Structuring and tagging (e.g., “best bets”)
iSchool
Issues and Tricks (cont’d)
Thesaurus/query expansion
Based on meaning, conceptual relationships
Based on decomposition/type
User feedback/”More like this”
Clustering/grouping of results
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Morphological Variation
Handling morphology: related concepts have
different forms
Inflectional morphology: same part of speech
dogs = dog + PLURAL
broke = break + PAST
Derivational morphology: different parts of speech
destruction = destroy + ion
researcher = research + er
IR Intro
Boolean
Different morphological processes:
Vector Space
Issues & Tricks
Prefixing
Suffixing
Infixing
Reduplication
iSchool
Stemming
Dealing with morphological variation: index stems
instead of words
Stem: a word equivalence class that preserves the
central concept
How much to stem?
organization organize organ?
resubmission resubmit/submission submit?
reconstructionism?
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Does Stemming Work?
Generally, yes! (in English)
Helps more for longer queries, fewer results
Lots of work done in this area
But used very sparingly in web search – why?
Donna Harman (1991) How Effective is Suffixing? Journal of the
American Society for Information Science, 42(1):7-15.
IR Intro
Robert Krovetz. (1993) Viewing Morphology as an Inference Process.
Proceedings of SIGIR 1993.
Boolean
Vector Space
Issues & Tricks
David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed
Evaluation. Journal of the American Society for Information Science,
47(1):70-84.
And others…
iSchool
Beyond Words…
Stemming/tokenization = specific instance of a
general problem: what is it?
Other units of indexing
Concepts (e.g., from WordNet)
Named entities
Relations
…
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Recap
IR Intro
Introduction to Information Retrieval
Boolean retrieval
Ranked retrieval – term weighting, the vector
space model
Advanced methods, things to think about
Next time: Deploying search engines
Boolean
Vector Space
Issues & Tricks
iSchool