INFM 700: Session 7 - University of Maryland, College Park

Download Report

Transcript INFM 700: Session 7 - University of Maryland, College Park

INFM 700: Session 7
Search (Part I)
Introduction to Information Retrieval
Paul Jacobs
The iSchool
University of Maryland
Monday, November 9, 2009
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Goals for Search Sessions

Understand the basic issues in information
retrieval (searching primarily unstructured text)

Know the techniques generally used by modern
search engines

Learn how search engines can be used most
effectively in information architecture
iSchool
Today’s Topics

Introduction to Information Retrieval

Keywords, inverted indices, and Boolean retrieval

The vector space model, ranked retrieval

Major issues

Some additional tricks

Examples: web search and site search
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Levels of Structure

Different types of data




Structured data
Semi-structured data
Unstructured data
How do you provide access to unstructured data?


Manually develop an organization system (add
structure)
Provide search capabilities
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
What is search?

Search is query-based access


How is this different from browsing?
Things one can search on:





Content
Metadata
Organization systems
Labels
…
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Some Key Concepts

Different search paradigms



IR Intro
Boolean, “keyword”
“Natural language” or “free text” (full text) search
Current search engines are primarily full text and
statistical

The fundamental challenge: words & concepts

The basic method: weighting and context

Other tricks (there are many!)
Boolean

Vector Space

Issues & Tricks


Structuring
Popularity and importance (of pages, documents)
Metadata and thesauri
User feedback
iSchool
The Central Problem in IR
Authors
Searcher
Concepts
Concepts
IR Intro
Boolean
Vector Space
Query
Documents
Issues & Tricks
Do these represent the same concepts?
iSchool
Architecture of IR Systems
Query
Documents
online offline
IR Intro
Boolean
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Vector Space
Issues & Tricks
Hits
iSchool
How do we represent text?

Remember: computers don’t “understand”
documents or queries

Simple, yet effective approach: “bag of words”




Assumptions

IR Intro
Boolean

Vector Space
Issues & Tricks
Treat all the words in a document as index terms
Assign a “weight” to each term based on “importance”
Disregard order, structure, meaning, etc. of the words

Term occurrence is independent (of other terms)
Document relevance is independent (of other
documents)
“Words” can be defined
iSchool
What’s a word?
天主教教宗若望保祿二世因感冒再度住進醫院。
這是他今年第二度因同樣的病因住院。
‫ الناطق باسم‬- ‫وقال مارك ريجيف‬
‫ إن شارون قبل‬- ‫الخارجية اإلسرائيلية‬
‫الدعوة وسيقوم للمرة األولى بزيارة‬
‫ التي كانت لفترة طويلة المقر‬،‫تونس‬
1982 ‫الرسمي لمنظمة التحرير الفلسطينية بعد خروجها من لبنان عام‬.
Выступая в Мещанском суде Москвы экс-глава ЮКОСа
заявил не совершал ничего противозаконного, в чем
обвиняет его генпрокуратура России.
IR Intro
भारत सरकार ने आर्थिक सर्वेक्षण में वर्वत्तीय र्वर्ि 2005-06 में सात फ़ीसदी
वर्वकास दर हाससल करने का आकलन ककया है और कर सुधार पर ज़ोर ददया है
Boolean
Vector Space
日米連合で台頭中国に対処…アーミテージ前副長官提言
Issues & Tricks
조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안
에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의
보도를 부인했다.
iSchool
Sample Document
McDonald's slims down spuds
“Bag of Words”
Fast-food chain to reduce certain types of
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items
healthier.
But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
IR Intro
Boolean
Vector Space
Issues & Tricks
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately
be reached for comment.
14 × McDonald’s
12 × fat
11 × fries
8 × new
6 × company, french, nutrition
5 × food, oil, percent, reduce,
taste, Tuesday
…
…
iSchool
Why does “bag of words” work
(at all)?
IR Intro
Boolean
Vector Space

Words alone tell us a lot about content!

Words are our main tool for describing concepts

Words in context are especially powerful

Getting beyond words is hard

Structure usually (but not always) can be
guessed from content


“355 back correction Dow pulls signaling”
“blind Venetian” vs. “Venetian blind”
Issues & Tricks
iSchool
Boolean Retrieval

Users express queries as a Boolean (logical)
expression



Difference between “term” and “keyword”?

Retrieval is based on the notion of sets

IR Intro
Boolean
Vector Space
Issues & Tricks
“terms” (usually words or phrases) joined by AND, OR,
NOT
Can be arbitrarily nested

Any given query divides the collection into two sets:
retrieved, not-retrieved (complement)
Pure Boolean systems do not define an ordering of the
results (no ranking)
iSchool
AND/OR/NOT
All documents
A
B
IR Intro
Boolean
Vector Space
Issues & Tricks
C
iSchool
Logic Tables
B
0
1
0
0
1
1
1
1
A
B
0
1
1
0
NOT B
A OR B
B
0
1
0
0
0
1
0
1
A
B
0
1
0
0
0
1
1
0
A
IR Intro
Boolean
Vector Space
Issues & Tricks
A AND B
A NOT B
(= A AND NOT B)
iSchool
The quick brown
fox jumped over
the lazy dog’s
back.
Document 2
IR Intro
Boolean
Vector Space
Issues & Tricks
Now is the time
for all good men
to come to the
aid of their party.
Term
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
Document 2
Document 1
Document 1
Representing Documents
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
1
0
1
0
1
1
iSchool
Stopword
List
for
is
of
the
to
Term
IR Intro
Boolean
Vector Space
Issues & Tricks
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Boolean View of a Collection
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
Each column represents the view of
a particular document: What terms
are contained in this document?
Each row represents the view of a
particular term: What documents
contain this term?
To execute a query, pick out rows
corresponding to query terms and
then apply logic table of
corresponding Boolean operator
iSchool
Boolean
Vector Space
Issues & Tricks
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
IR Intro
Term
dog
fox
0 0 1 0 1 0 0 0
0 0 1 0 1 0 1 0
dog  fox
0 0 1 0 1 0 0 0
dog AND fox  Doc 3, Doc 5
dog  fox
0 0 1 0 1 0 1 0
dog OR fox  Doc 3, Doc 5, Doc 7
dog  fox
0 0 0 0 0 0 0 0
dog NOT fox  empty
fox  dog
0 0 0 0 0 0 1 0
fox NOT dog  Doc 7
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Sample Queries
good
party
0 1 0 1 0 1 0 1
0 0 0 0 0 1 0 1
gp
over
0 0 0 0 0 1 0 1
1 0 1 0 1 0 1 1
good AND party  Doc 6, Doc 8
gpo
0 0 0 0 0 1 0 0
good AND party NOT over  Doc 6
iSchool
Term
IR Intro
Boolean
Vector Space
Issues & Tricks
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Inverted Index
Term
Postings
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
4
2
1
1
2
3
3
2
3
1
2
2
1
6
1
1
2
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
iSchool
8
4
3
3
4
5
5
4
3
4
6
3
8
3
5
4
6
7
5
6
7
8
7
6
8
5
8
8
5
7
6
7
7
8
Boolean Retrieval

To execute a Boolean query:

Build query syntax tree
( fox or dog ) and quick

3
3
5
5
OR
fox
dog
7
Traverse postings and apply Boolean operator
dog
fox
IR Intro
quick
For each clause, look up postings
dog
fox

AND
3
3
5
5
7
OR = union
3
5
7
Boolean
Vector Space

Efficiency analysis
Issues & Tricks


Postings traversal is linear (assuming sorted postings)
Start with shortest posting first
iSchool
Why Boolean Retrieval Works

Boolean operators approximate concepts

How so?

AND can identify relationships between concepts
• (e.g., interest rate, web design)

OR can identify alternate terminology
• (e.g., interest percentage, HTML layout, etc.)

NOT can filter alternate meanings
• (e.g., conflict AND interest AND NOT rate, NOT spider)
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Why Boolean Retrieval Fails

It’s really hard to come up with the “right” queries

Casual searchers have difficulty with the logic

Some concepts are just hard to express, e.g.
“corporate mergers & acquisitions” – IBM
acquired Lotus

Relevance is not absolute, some documents are
more relevant, or more helpful, than others
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Ranked Retrieval in the Vector
Space Model

Order documents by how likely they are to be
relevant to the information need




Estimate relevance(q, di)
Sort documents by relevance
Display sorted results, usually one screen at a time
How do we estimate relevance?

IR Intro
Boolean
Vector Space

Issues & Tricks

Assume that document d is relevant to query q if they
share terms in common
Replace relevance(q, di) with sim(q, di) (similarity)
Compute similarity of vector representations
iSchool
Vector Representation

“Bags of words” can be represented as vectors



Why? Computational efficiency, ease of manipulation
Geometric metaphor: “arrows”
A vector is a set of values recorded in any
consistent order
“The quick brown fox jumped over the lazy dog’s back”
[111111112]
IR Intro
Boolean
Vector Space
Issues & Tricks
1st position corresponds to “back”
2nd position corresponds to “brown”
3rd position corresponds to “dog”
4th position corresponds to “fox”
5th position corresponds to “jump”
6th position corresponds to “lazy”
7th position corresponds to “over”
8th position corresponds to “quick”
9th position corresponds to “the”
iSchool
Vector Space Model
t3
d2
d3
d1
θ
φ
t1
d5
t2
IR Intro
d4
Boolean
Vector Space
Issues & Tricks
Assumption: Documents that are “close together” in
vector space “talk about” the same things
Therefore, retrieve documents based on how close the
document is to the query (i.e., similarity ~ “closeness”)
iSchool
Similarity Metric

How about |d1 – d2|?

Instead of Euclidean distance, use “angle”
between the vectors

It all boils down to the inner product (dot product) of
vectors
 
d j  dk
cos( )   
d j dk
IR Intro
Boolean
Vector Space
Issues & Tricks
 
d j  dk
sim (d j , d k )    
d j dk

n
i 1
i 1 w
n
wi , j wi ,k
2
i, j
2
w
i 1 i,k
iSchool
n
Components of Similarity


IR Intro
Boolean
Vector Space
The “inner product” (aka dot product) is the key to
the similarity function
 
n
d j  d k  i 1 wi , j wi ,k
Example:
1 2 3 0 2 2
0 1 0 2
 1 2  2  0  3  1  0  0  2  2  9
The denominator handles document length
normalization

n
d j  i 1 wi2,k
Example:
1 2 3 0 2
Issues & Tricks
 1  4  9  0  4  18  4.24
iSchool
Term Weighting

Term weights consist of two components



Here’s the intuition:


IR Intro
Boolean
Vector Space
Issues & Tricks

Local: how important is the term in this doc?
Global: how important is the term in the collection?
Terms that appear often in a document should get high
weights
Terms that appear in many documents should get low
weights
How do we capture this mathematically?


Term frequency (local)
Inverse document frequency (global)
iSchool
TF.IDF Term Weighting
N
wi , j  tfi , j  log
ni
wi , j
weight assigned to term i in document j
tfi, j
number of occurrence of term i in document j
N
number of documents in entire collection
ni
number of documents with term i
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
TF.IDF Example
tf
1
2
complicated
contaminated 4
fallout
5
information
6
interesting
IR Intro
nuclear
1
3
3
4
idf
5
2
0.301
complicated
0.301
3,5 4,2
0.125
contaminated
0.125
1,4 2,1 3,3
3
4
3
0.125
fallout
0.125
1,5 3,4 4,3
3
2
0.000
information
0.000
1,6 2,3 3,3 4,2
0.602
interesting
0.602
2,1
0.301
nuclear
0.301
1,3 3,7
0.125
retrieval
0.125
2,6 3,1 4,4
0.602
siberia
0.602
1,2
1
3
7
Boolean
Vector Space
Issues & Tricks
retrieval
siberia
6
2
1
4
iSchool
Document Scoring Algorithm

Initialize accumulators to hold document scores

For each query term t in the user’s query


Fetch t’s postings
For each document, scoredoc += wt,d  wt,q

Apply length normalization to the scores at end

Return top N documents
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Summary thus far…

Represent documents (and queries) as “bags of
words” (terms)

Derive term weights based on frequency

Use weighted term vectors for each document,
query

Compute a vector-based similarity score

Display sorted, ranked results
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Issues and Tricks

What’s a word/term?



Query formulation/suggestion

Type of information need

Popularity

IR Intro

Boolean
Vector Space
Issues & Tricks
We can ignore words (“stop words”), combine
(phrases), split up (“stem”) words
Other special treatment (e.g. names, categories)

Based on link analysis/page rank
Based on click through, other
Structuring and tagging (e.g., “best bets”)
iSchool
Issues and Tricks (cont’d)

Thesaurus/query expansion


Based on meaning, conceptual relationships
Based on decomposition/type

User feedback/”More like this”

Clustering/grouping of results
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Morphological Variation

Handling morphology: related concepts have
different forms

Inflectional morphology: same part of speech
dogs = dog + PLURAL
broke = break + PAST

Derivational morphology: different parts of speech
destruction = destroy + ion
researcher = research + er
IR Intro
Boolean

Different morphological processes:

Vector Space
Issues & Tricks



Prefixing
Suffixing
Infixing
Reduplication
iSchool
Stemming

Dealing with morphological variation: index stems
instead of words


Stem: a word equivalence class that preserves the
central concept
How much to stem?



organization  organize  organ?
resubmission  resubmit/submission  submit?
reconstructionism?
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Does Stemming Work?

Generally, yes! (in English)



Helps more for longer queries, fewer results
Lots of work done in this area
But used very sparingly in web search – why?
Donna Harman (1991) How Effective is Suffixing? Journal of the
American Society for Information Science, 42(1):7-15.
IR Intro
Robert Krovetz. (1993) Viewing Morphology as an Inference Process.
Proceedings of SIGIR 1993.
Boolean
Vector Space
Issues & Tricks
David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed
Evaluation. Journal of the American Society for Information Science,
47(1):70-84.
And others…
iSchool
Beyond Words…

Stemming/tokenization = specific instance of a
general problem: what is it?

Other units of indexing




Concepts (e.g., from WordNet)
Named entities
Relations
…
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Recap
IR Intro

Introduction to Information Retrieval

Boolean retrieval

Ranked retrieval – term weighting, the vector
space model

Advanced methods, things to think about

Next time: Deploying search engines
Boolean
Vector Space
Issues & Tricks
iSchool