Transcript Document

Digital Libraries: Steps toward
information finding
Many slides in this presentation are from
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
Introduction to Information Retrieval, Cambridge University Press.
2008.
Book available online at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
A brief introduction to Information
Retrieval
• Resource: Christopher D. Manning,
Prabhakar Raghavan and Hinrich Schütze,
Introduction to Information Retrieval,
Cambridge University Press. 2008.
• The entire book is available online, free, at
http://nlp.stanford.edu/IRbook/information-retrieval-book.html
• I will use some of the slides that they
provide to go with the book.
Author’s definition
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
• Note the use of the word “usually.” We have
seen DL examples where the material is not
documents, and not text.
Examples and Scaling
• IR is about finding a needle in a haystack –
finding some particular thing in a very large
collection of similar things.
• Our examples are necessarily small, so that we
can comprehend them. Do remember, that all
that we say must scale to very large
quantities.
Just how much information?
• Libraries are about access to information.
– What sense do you have about information
quantity?
– How fast is it growing?
– Are there implications for the quantity and rate of
increase?
How much information is there?
Soon most everything will be
recorded and indexed
Most bytes will never be seen by
humans.
Data summarization,
trend detection
anomaly detection
are key technologies
Zetta
Everything
Recorded !
These require
algorithms, data and
knowledge
representation, and
knowledge of the
domain
Yotta
Exa
All Books
MultiMedia
See Mike Lesk:
How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
All books
(words)
A movie
A Photo
Peta
Tera
Giga
Mega
http://www.sims.berkeley.edu/research/projects/how-much-info/
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Slide source Jim Gray – Microsoft Research (modified)
A Book
Kilo
Where does the information come
from?
• Many sources
– Corporations
– Individuals
– Interest groups
– News organizations
• Accumulated through crawling
Once we have a collection
• How will we ever find the needle in the
haystack? The one bit of information needed?
• After crawling, or other resource acquisition
step, we need to create a way to query the
information we have
– Next step: Index
• Example content: Shakespeare’s plays
Searching Shakespeare
• Which plays of Shakespeare contain the words
Brutus AND Caesar but NOT Calpurnia?
– See http://www.rhymezone.com/shakespeare/
• One could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines
containing Calpurnia?
• Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Other operations (e.g., find the word Romans near
countrymen) not feasible
– Ranked retrieval (best documents to return)
Term-document incidence
Brutus AND Caesar BUT NOT
Calpurnia
First approach – make a matrix with
terms on one axis and plays on the other
All the plays  

All the terms 

Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0

1 if play contains word,
0 otherwise
Incidence Vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar
and Calpurnia (complemented)  bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Answer to query
• Antony and Cleopatra, Act III,
Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why,
Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
• Hamlet, Act III, Scene ii
Lord Polonius: I did enact Julius Caesar I
was killed i' the Capitol;
Brutus killed me.
Try another one
• What is the vector for the query
– Antony and mercy
• What would we do to find Antony OR mercy?
Basic assumptions about information
retrieval
• Collection: Fixed set of documents
• Goal: Retrieve documents with information
that is relevant to the user’s information
need and helps the user complete a task
The classic search model
TASK
Ultimately, some task to perform.
Info
Need
Some information is required in order
to perform the task.
Verbal
form
The information need must be
expressed in words (usually).
Query
The information need must be expressed in the
form of a query that can be processed.
It may be necessary
to rephrase the
query and try again
Query
Refinement
SEARCH
ENGINE
Results
Corpus
The classic search model
Potential pitfalls between
task and query results
Get rid of mice in a
politically correct way
TASK
Misconception?
Info about removing mice
without killing them
Info
Need
Mistranslation?
How do I trap mice alive?
Verbal
form
Misformulation?
mouse trap
Query
SEARCH
ENGINE
Query
Refinement
Results
Corpus
How good are the results?
• Precision: How well do the results match the
information need?
• Recall: What fraction of the available correct
results were retrieved?
• These are the basic concepts of information
retrieval evaluation.
Stop and think
• If you had to choose between precision and
recall, which would you choose?
• Give an example when each would be
preferable. Everyone, provide an example of
each. Everyone, comment on an example of
each provided by someone else.
– Ideally, do this in real time. However, you may
need to come back to do your comments after
others have finished.
Discuss
• What was the best example of precision as
being more important?
• What was the best example of recall being
more important?
• Try to come to consensus in your discussion.
Size considerations
• Consider N = 1 million documents, each with
about 1000 words.
• Avg 6 bytes/word including
spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among
these.
The matrix does not work
• 500K x 1M matrix has half-a-trillion
0’s and 1’s.
• But it has no more than one billion
1’s.
– matrix is extremely sparse.
• What’s a better representation?
– We only record the 1 positions.
– i.e. We don’t need to know which
documents do not have a term, only
those that do.
Why?
Inverted index
• For each term t, we must store a list of all
documents that contain t.
– Identify each by a docID, a document serial
number
• Can we used fixed-size arrays for this?
Brutus
1
2
4
11
31
45 173 174
Caesar
1
2
4
5
6
16
Calpurnia
2
31 54
57 132
101
What happens if we add document 14, which contains “Caesar.”
Inverted index
 We need variable-size postings lists
 On disk, a continuous run of postings is normal and
best
 In memory, can use linked lists or variable length arrays
 Some tradeoffs in size/ease of insertion
Brutus
1
Caesar
1
Calpurnia
Dictionary
2
2
2
31
Posting
4
11
31
45 173
4
5
6
16
174
57 132
54 101
Postings
Sorted by docID (more later on why).
23
Inverted index construction
Documents to
be indexed.
Token stream.
Modified
tokens.
Friends, Romans, countrymen.
Tokenizer
Romans
Countrymen
friend
roman
countryman
Linguistic modules
Stop words, stemming,
capitalization, cases, etc.
Inverted index.
Friends
Indexer
friend
2
4
roman
1
2
countryman
13
16
Stop and think
• Look at the previous slide.
• Describe exactly what happened at each
stage.
– What does tokenization do?
– What did the linguistic modules do?
• Why would we want these transformations?
Indexer steps: Token
sequence
 Sequence of (Modified token, Document
ID) pairs.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Note the words in the table are exactly the same as in
the two documents, and in the same order.
Indexer steps: Sort
 Sort by terms
 And then docID
Core indexing step
Indexer steps: Dictionary & Postings
 Multiple
term entries
in a single
document
are merged.
 Split into
Dictionary
and Postings
 Doc.
frequency
information
is added.
Number of documents in which the term appears
Spot check
(See the separate assignment for this in Blackboard
• Complete the indexing for the following two
“documents.”
– Of course, the examples have to be very small to be
manageable. Imagine that you are indexing the entire
news stories.
– Construct the charts as seen on the previous slide
– Put your solution in the Blackboard Indexing – Spot
Check 1. There is a discussion board in Blackboard.
You will find it on the content homepage.
Document 1:
Pearson and Google Jump
Into Learning
Management With a
New, Free System
Document 2:
Pearson adds free learning
management tools to Google
Apps for Education
How do we process a query?
• Using the index we just built, examine the
terms in some order, looking for the terms in
the query.
Query processing: AND
• Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings:
2
4
8
16
1
2
3
5
32
8
64
13
128
21
34
Brutus
Caesar
31
Sec. 1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2
8
2
4
8
16
1
2
3
5
32
8
64
13
128
21
34
Brutus
Caesar
If the list lengths are x and y, the merge takes O(x+y) operations.
What does that mean?
Crucial: postings sorted by docID.
32
Spot Check
• Let’s assume that
– the term mercy appears in documents 1, 2, 13, 18,
24,35, 54
– the term noble appears in documents 1, 5, 7, 13,
22, 24, 56
• Show the document lists, then step through
the merge algorithm to obtain the search
results.
Stop and try it
• Make up a new pair of postings lists (like the
ones we just saw).
• Post it on the discussion board.
• Take a pair of postings lists that someone else
posted and walk through the merge process.
Post a comment on the posting saying how
many comparisons you had to do to complete
the merge.
Intersecting two postings lists
(a “merge” algorithm)
35
Boolean queries: Exact match
• The Boolean retrieval model is being able to
ask a query that is a Boolean expression:
– Boolean Queries are queries using AND, OR and
NOT to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
– Perhaps the simplest model to build
• Primary commercial retrieval tool for 3
decades.
• Many search systems you still use are Boolean:
– Email, library catalog, Mac OS X Spotlight
36
Query optimization
•
•
Consider a query that is an and of n terms, n > 2
For each of the terms, get its postings list, then and them
together
•
Example query: BRUTUS AND CALPURNIA AND
CAESAR
•
37
What is the best order for processing this query?
37
Query optimization
•
Example query: BRUTUS AND CALPURNIA AND
CAESAR
•
•
•
38
Simple and effective optimization: Process in order of
increasing frequency
Start with the shortest postings list, then keep cutting
further
In this example, first CAESAR, then CALPURNIA,
then BRUTUS
38
Optimized intersection algorithm for
conjunctive queries
39
39
More General optimization
•
•
•
•
40
Example query: (MADDING OR CROWD)
and (IGNOBLE OR STRIFE)
Get frequencies for all terms
Estimate the size of each or by the sum of its
frequencies (conservative)
Process in increasing order of or sizes
40
Scaling
• These basic techniques are pretty simple
• There are challenges
– Scaling
• as everything becomes digitized, how well do the
processes scale?
– Intelligent information extraction
• I want information, not just a link to a place that might
have that information.
Problem with Boolean search:
feast or famine
• Boolean queries often result in either too
few (=0) or too many (1000s) results.
• A query that is too broad yields hundreds
of thousands of hits
• A query that is too narrow may yield no hits
• It takes a lot of skill to come up with a
query that produces a manageable number
of hits.
– AND gives too few; OR gives too many
Ranked retrieval models
• Rather than a set of documents satisfying a
query expression, in ranked retrieval
models, the system returns an ordering
over the (top) documents in the collection
with respect to a query
• Free text queries: Rather than a query
language of operators and expressions, the
user’s query is just one or more words in a
human language
• In principle, these are different options, but
in practice, ranked retrieval models have
normally been associated with free text
queries and vice versa
43
Feast or famine: not a problem in
ranked retrieval
• When a system produces a ranked result
set, large result sets are not an issue
– Indeed, the size of the result set is not an issue
– We just show the top k ( ≈ 10) results
– We don’t overwhelm the user
– Premise: the ranking algorithm works
Scoring as the basis of ranked retrieval
• We wish to return, in order, the documents
most likely to be useful to the searcher
• How can we rank-order the documents in
the collection with respect to a query?
• Assign a score – say in [0, 1] – to each
document
• This score measures how well document
and query “match”.
Query-document matching scores
• We need a way of assigning a score to a
query/document pair
• Let’s start with a one-term query
• If the query term does not occur in the
document: score should be 0
• The more frequent the query term in the
document, the higher the score (should be)
• We will look at a number of alternatives for
this.
Take 1: Jaccard coefficient
• A commonly used measure of overlap of
two sets A and B
– jaccard(A,B) = |A ∩ B| / |A ∪ B|
– jaccard(A,A) = 1
– jaccard(A,B) = 0 if A ∩ B = 0
• A and B don’t have to be the same size.
• Always assigns a number between 0 and 1.
Jaccard coefficient: Scoring example
• What is the query-document match score
that the Jaccard coefficient computes for
each of the two “documents” below?
• Query: ides of march
• Document 1: caesar died in march
• Document 2: the long march
Jaccard Example done
•
•
•
•
•
Query: ides of march
Document 1: caesar died in march
Document 2: the long march
A = {ides, of, march}
B1 = {caesar, died, in, march}
AI B1  {m arch}
AUB1  {caesar,died,ides,in,of ,m arch}
AI B1/ AUB1  1/6
• B2 = {the, long, march}

AI B2  {m arch}
AUB2  {ides,long,m arch,of,the}
AI B2 / AUB2 1/5
Issues with Jaccard for scoring
• It doesn’t consider term frequency (how
many times a term occurs in a document)
– Rare terms in a collection are more informative
than frequent terms. Jaccard doesn’t consider
this information
• We need a more sophisticated way of
normalizing for length
The problem with the first example was that document 2
“won” because it was shorter, not because it was a better
match. We need a way to take into account document length
so that longer documents are not penalized in calculating the
match score.
Next week
• We will have time for questions and discussion
of this material, but I will be looking to see
how much you have handled on your own.
• You get credit for posting a good question
• You get credit for a reasonable answer to a
good question
• You get credit for pointing out problems or
expanding on answers to questions.
References
•
•
Marcos André Gonçalves, Edward A. Fox, Layne T. Watson, and Neill A. Kipp. 2004.
Streams, structures, spaces, scenarios, societies (5s): A formal model for digital
libraries. ACM Trans. Inf. Syst. 22, 2 (April 2004), 270-312.
DOI=10.1145/984321.984325
– http://doi.acm.org/10.1145/984321.984325
– Let me know if you would like a copy of the paper.
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to
Information Retrieval, Cambridge University Press. 2008.
–
–
Book available online at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
Many of these slides are taken directly from the authors’ slides from the first chapter of the book.