Progress

Transcript Progress

CS144 Discussion Week 4
Information Retrieval
Young Cha
Oct. 25, 2013
Projects
• Project 2 deadline is 11pm today (10/25)
• 2 grace days  11pm 10/27 (Sun)
• Please double check your implementation before submission
• Project 3 has 3 parts and 2 submission deadlines
•
•
•
•
•
Part A: Building indexes (-11/1) No Grace Period Allowed!
Part B: Implementing Java search functions (-11/8)
Part C: Publishing Java class as Web service (-11/8)
You may resubmit your project 2 after fixing bugs
We don’t grade your Part A submission but we may check how
different it is from your Part B/C submission  if it is largely
different, briefly write down what has changed in README.txt
Boolean Model
• 3 documents
• Bag of words
• Order doesn’t matter
• Boolean query
• AND/OR/NOT
• doc1: Bruins beat Trojans
• doc2: Trojans envy Bruins
• doc3: Bruins! Go Bruins!
bruins
1
beat
1
trojans
1
envy
2
go
3
lexicon/dictionary
2
2
postings list
3
Vector Model
• 3 documents
• Tf-idf
• f x log (N/n)
• Cosine similarity
 
qd

cos  
| q || d |
• doc1: Bruins beat Trojans
• doc2: Trojans envy Bruins
• doc3: Bruins! Go Bruins!
bruins
1*
1
1
beat
3
1
1
trojans
1.5
1
1
envy
3
2
1
go
3
3
1
* Used N/n instead of log(N/n) for simplicity
2
1
2
1
3
2
Precision & Recall
• 1K docs in a corpus
• 50 relevant docs
• Among 10 docs retrieved by a search engine,
• 3 are relevant
• 7 are irrelevant
• Precision? |R&D|/|D| = 0.3
• Recall?
|R&D|/|R| = 0.06
All
R:Relevant
D:Retrieved
7
Recall
Search Engine B
3
47
Search Engine A
Precision
Index Size Estimation
• Given that
•
•
•
•
•
100 M docs
5 KB/doc
400 unique words/doc
20 bytes/word
10 bytes/docid
• Questions
• Document collection size?
100M x 5KB = 500GB
• Inverted index size?
400GB + 200KB
• Size of postings list?
• Size of lexicion?
(C=1, k=0.5 in Cˑnk )
100M x 400 x 10B = 400GB
(100M)0.5 x 20B = 200KB
Topic-model based IR
• Topic models assume that there are hidden topics behind
words
• An IR system with topic models can match a doc containing
automobile for a query vehicle as it assumes they come
from the same topic car
…
automobile
…
Author
Can be matched
Topic-model based IR
Searcher
Example
• Document corpus (textual dataset)  matrix
• Assumed hidden (latent) topics behind docs/words
– We can infer topics by analyzing co-occurrence of docs and words
– We can generate docs by multiplying assumed doc-topic and topic-word matrices
Inference
D-W
auto vehicle
film theater
doc1
25
15
0
0
doc2
0
0
12
12
doc3
15
9
6
6
D-T
=
doc1
doc2
doc3
movie Document Corpus
T-W auto vehicle film Theater
5doc1: auto
0
auto ... vehicle vehicle …
car
5
3
0
0
0doc2: film
4 theater film theater …
movie
0
0
3
3
3doc3: film
2 … theater … vehicle … auto …
car
X
Generation
Document-Word
Observed
Document-Topic
Assumed
Topic-Word
Assumed
Latent Semantic Indexing (LSI) by SVD
W (words)
D
(docs)
W
D
C
nxp
Rank
Reduction to k
S
(diagonal)
X
U
nxn
nxp
X
VT
pxp
W (words)
T (Topics)
W
D Uk
nxk
D-T
X
Sk
kxk
X
VkT
T
kxp
T-W
D
(Docs)
Ck
(Rank-k
appr.)
nxp
Latent Semantic Indexing (LSI) by SVD
•
Query is viewed as a document  query matching is a
process to find a similar document
W (words)
q
q
W
1xp
W (words)
D
(docs)
Ck
(Rank-k
appr.)
nxp
q
D
Ck
(Rank-k
appr.)
nxp
X
W
px1
D
nx1
Each value in the vector
represents the similarity
between q and di
Example Topics - PLSI
• We can group words with
Topic-Word matrix
W
T
Lucene Example
• Goal: build index for hotels to support keyword search
• Each Hotel item has id, name, city, description
• E.g. 1, Hotel Rivoli, Paris, If you like historical Paris …
• 40 hotels
• Requirements
• Search over name, city, description or full text
• In a search result page, you should show name, city and
description
• May need to be incorporated with RDB for a complex
query
• E.g. modern hotel in New York with price < $100
Lucene Example
• We first need to create an IndexWriter
Analyzer
Description
StandardAnalyzer
A sophisticated general-purpose analyzer.
WhitespaceAnalyzer
A very simple analyzer that just separates tokens using white space.
StopAnalyzer
Removes common English words that are not usually useful for
indexing. (e.g. the, a, is, …)
SnowballAnalyzer
An interesting experimental analyzer that works on word roots (a
search on rain should also return entries with raining, rained, and
so on).
Lucene Example
• Which field to store? to index?
Lucene Example
• Now we can perform search using the index

Progress

Transcript Progress

Directory