Transcript Title

Information Retrieval using the Boolean Model

Query

  Which plays of Shakespeare contain the words

Brutus AND Caesar

but NOT

Calpurnia

?

Could grep all of Shakespeare’s plays for

Brutus

and

Caesar,

then strip out lines containing

Calpurnia

?

 Slow (for large corpora)  

NOT Calpurnia

is non-trivial Other operations (e.g., find the phrase

Romans and countrymen

) not feasible

Term-document incidence

Antony Brutus Caesar Calpurnia Cleopatra mercy worser Antony and Cleopatra 1 1 1 1 1 0 1 Julius Caesar The Tempest 0 0 1 1 1 1 0 0 1 0 0 0 0 1 Hamlet 0 1 0 1 1 0 1 Othello 0 1 0 0 1 0 1 Macbeth 0 1 1 0 1 0 0

1 if play contains word , 0 otherwise

Incidence vectors

   So we have a 0/1 vector for each term.

To answer query: take the vectors for

Brutus, Caesar

and

Calpurnia

(complemented)  bitwise AND.

110100 AND 110111 AND 101111 = 100100.

Answers to query

    

Antony and Cleopatra, Act III, Scene ii

Agrippa

[Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius

Caesar

dead, He cried almost to roaring; and he wept When at Philippi he found

Brutus

slain.

  

Hamlet, Act III, Scene ii

Lord Polonius:

I did enact Julius

Caesar

I was killed i' the Capitol;

Brutus

killed me.

Bigger document collections

   Consider N = 1million documents, each with about 1K terms.

Avg 6 bytes/term incl spaces/punctuation  6GB of data in the documents.

Say there are M = 500K distinct terms among these.

Can’t build the matrix

   500K x 1M matrix has half-a-trillion 0’s and 1’s.

But it has no more than one billion 1’s.

 matrix is extremely sparse.

Why?

What’s a better representation?

 We only record the 1 positions.

Inverted index

  For each term T: store a list of all documents that contain T.

Do we use an array or a list for this?

Brutus Calpurnia Caesar

2 1 4 2 13 16 8 16 32 64 128 3 5 8 13 21 34 What happens if the word

Caesar

is added to document 14?

Inverted index

 Linked lists generally preferred to arrays  Dynamic space allocation   Insertion of terms into documents easy Space overhead of pointers

Brutus Calpurnia Caesar

2 1 13 4 2 16 8 3 16 5 8 32 13 64 21 128 34 Dictionary Postings Sorted by docID (more later on why).

Inverted index construction

Documents to be indexed.

Friends, Romans, countrymen.

Token stream.

More on these later.

Modified tokens.

Inverted index.

Tokenizer Linguistic modules Friends friend Romans roman Indexer

friend roman countryman

Countrymen countryman 2 1 13 4 2 16

Indexer steps

 Sequence of (Modified token, Document ID) pairs.

Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious I Term did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious Doc # 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

Sort by terms.

Core indexing step.

I Term did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 let me noble so the the told you was was with Term ambitious Doc # be brutus brutus capitol caesar caesar I I i' it caesar did enact hath julius killed killed 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 2 2 1 2 2 1 2 2 2

  Multiple term entries in a single document are merged.

Frequency information is added.

Why frequency?

Will discuss later.

let me noble so the the told you was was with Term ambitious Doc # be brutus brutus capitol caesar caesar I I i' it caesar did enact hath julius killed killed 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 2 2 2 1 2 2 1 2 2 2 the the told you was was with Term ambitious be brutus brutus capitol caesar caesar did I enact hath i' it julius killed let me noble so Doc # 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 Freq 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

 The result is split into a Dictionary file and a Postings file.

the the told you was was with i' it julius killed let me noble so Term ambitious be brutus brutus capitol caesar caesar did I enact hath Doc # 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 Freq 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 Term ambitious be brutus capitol caesar did enact hath I i' it julius killed let me noble so the told you was with N docs 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 Tot Freq 1 1 2 1 3 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 2 Doc # 2 1 2 2 2 1 2 2 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 Freq 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1

 Where do we pay in storage?

Terms Term ambitious be brutus capitol caesar did enact hath I i' it julius killed let me noble so the told you was with N docs 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 Tot Freq 1 1 2 1 3 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 2 1 Pointers Doc # 2 2 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 1 2 1 2 2 1 2 2 2 Freq 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1

The index we just built

 How do we process a Boolean query?

 Today’s focus Later - what kinds of queries can we process?

Query processing

 Consider processing the query:

Brutus AND Caesar

   Locate

Brutus

 in the Dictionary; Retrieve its postings.

Locate Caesar in the Dictionary;  Retrieve its postings.

“Merge” the two postings: 2 1 4 2 8 3 16 5 8 32 1 3 64 21 128 34

Brutus Caesar

The merge

 Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 8

Brutus Caesar

If the list lengths are x and y, the merge takes O(x+y) operations.

Crucial: postings sorted by docID.

Basic postings intersection

Boolean queries: Exact match

   Queries using AND, OR and NOT together with query terms  Views each document as a set of words  Is precise: document matches condition or not.

Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., Lawyers) still like Boolean queries:  You know exactly what you’re getting.

Example: WestLaw

http://www.westlaw.com/      Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users

still

use boolean queries Example query:  What is the statute of limitations in cases involving the federal tort claims act?

 LIMIT! /3 STATUTE ACTION /s FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search

Query optimization

   What is the best order for query processing?

Consider a query that is an AND of t terms.

For each of the t terms, get its postings, then AND together.

Brutus Calpurnia Caesar

2 1 4 2 13 16 8 16 32 64 128 3 5 8 16 21 34 Query:

Brutus AND Calpurnia AND Caesar

Query optimization example

 Process in order of increasing freq: 

start with smallest set, then keep cutting

further.

This is why we kept freq in dictionary

Brutus Calpurnia Caesar

2 1 4 2 13 16 8 16 32 64 128 3 5 8 13 21 34 Execute the query as (

Caesar AND Brutus) AND Calpurnia

.

Query optimization

More general optimization

    e.g., (

madding ( ignoble OR OR strife ) crowd ) AND

Get freq’s for all terms.

Estimate the size of each OR by the sum of its freq’s (conservative).

Process in increasing order of OR sizes.

Exercise

 Recommend a query processing order for

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

Term eyes kaleidoscope marmalade skies tangerine trees Freq 213312 87009 107913 271658 46653 316812

Beyond Boolean term search

   What about phrases?

Proximity: Find 

Gates NEAR Microsoft

.

Need index to capture position information in docs. More later.

Zones in documents: Find documents with (author =

Ullman

) AND (text contains

automata

).

Evidence accumulation

    1 vs. 0 occurrence of a search term  2 vs. 1 occurrence  3 vs. 2 occurrences, etc.

Need term frequency information in docs.

Used to compute a score for each document Matching documents rank-ordered by this score.

Evaluating search engines

Measures for a search engine

   How fast does it index  Number of documents/hour  (Average document size) How fast does it search  Latency as a function of index size Expressiveness of query language  Speed on complex queries

Measures for a search engine

   All of the preceding criteria are measurable: we can quantify speed/size; we can make expressiveness precise The key measure: user happiness  What is this?

  Speed of response/size of index are factors But blindingly fast, useless answers won’t make a user happy Need a way of quantifying user happiness

Measuring user happiness

   Issue: who is the user we are trying to make happy?

 Depends on the setting Web engine: user finds what they want and return to the engine  Can measure rate of return users eCommerce site: user finds what they want and make a purchase   Is it the end-user, or the eCommerce site, whose happiness we measure?

Measure time to purchase, or fraction of searchers who become buyers?

Measuring user happiness

 Enterprise (company/govt/academic): Care about “user productivity”  How much time do my users save when looking for information?

 Many other criteria having to do with breadth of access, secure access … more later

Happiness: elusive to measure

   Most common proxy: relevance of search results  But how do you measure relevance?

Will detail a methodology here, then examine its issues Requires 3 elements: 1. A benchmark document collection 2. A benchmark suite of queries 3. A binary assessment of either Relevant or Irrelevant for each query-doc pair

Evaluating an IR system

    Note: information need query is translated into a Relevance is assessed relative to the information need not the query E.g., Information need: I'm looking for

information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query:

wine red white heart attack effective

Standard relevance benchmarks

    TREC - National Institute of Standards and Testing (NIST) has run large IR benchmark for many years Reuters and other benchmark doc collections used “Retrieval tasks” specified  sometimes as queries Human experts mark, for each query and for each doc, Relevant or Irrelevant  or at least for subset of docs that some system returned for that query

Precision and Recall

  Precision : fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall : fraction of relevant docs that are retrieved = P(retrieved|relevant) Retrieved Relevant tp Not Retrieved fn Not Relevant fp tn   Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Accuracy – a different measure

  Given a query an engine classifies each doc as “Relevant” or “Irrelevant”.

Accuracy of an engine: the fraction of these classifications that is correct.

Why not just use accuracy?

 How to build a 99.9999% accurate search engine on a low budget….

Search for:

0 matching results found.

 People doing information retrieval want to find something and have a certain tolerance for junk.

Precision/Recall

  Can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved  Precision usually decreases (in a good system)

Difficulties in using precision/recall

    Should average over large corpus/query ensembles Need human relevance assessments  People aren’t reliable assessors Assessments have to be binary  Nuanced assessments?

Heavily skewed by corpus/authorship  Results may not translate from one domain to another

Information Retrieval Prabhakar Raghavan Yahoo! Research Lecture 1 From Chapters 1,8 of IIR