Transcript Progress
CS144 Discussion Week 4 Information Retrieval Young Cha Oct. 25, 2013 Projects • Project 2 deadline is 11pm today (10/25) • 2 grace days 11pm 10/27 (Sun) • Please double check your implementation before submission • Project 3 has 3 parts and 2 submission deadlines • • • • • Part A: Building indexes (-11/1) No Grace Period Allowed! Part B: Implementing Java search functions (-11/8) Part C: Publishing Java class as Web service (-11/8) You may resubmit your project 2 after fixing bugs We don’t grade your Part A submission but we may check how different it is from your Part B/C submission if it is largely different, briefly write down what has changed in README.txt Boolean Model • 3 documents • Bag of words • Order doesn’t matter • Boolean query • AND/OR/NOT • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! bruins 1 beat 1 trojans 1 envy 2 go 3 lexicon/dictionary 2 2 postings list 3 Vector Model • 3 documents • Tf-idf • f x log (N/n) • Cosine similarity qd cos | q || d | • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! bruins 1* 1 1 beat 3 1 1 trojans 1.5 1 1 envy 3 2 1 go 3 3 1 * Used N/n instead of log(N/n) for simplicity 2 1 2 1 3 2 Precision & Recall • 1K docs in a corpus • 50 relevant docs • Among 10 docs retrieved by a search engine, • 3 are relevant • 7 are irrelevant • Precision? |R&D|/|D| = 0.3 • Recall? |R&D|/|R| = 0.06 All R:Relevant D:Retrieved 7 Recall Search Engine B 3 47 Search Engine A Precision Index Size Estimation • Given that • • • • • 100 M docs 5 KB/doc 400 unique words/doc 20 bytes/word 10 bytes/docid • Questions • Document collection size? 100M x 5KB = 500GB • Inverted index size? 400GB + 200KB • Size of postings list? • Size of lexicion? (C=1, k=0.5 in Cˑnk ) 100M x 400 x 10B = 400GB (100M)0.5 x 20B = 200KB Topic-model based IR • Topic models assume that there are hidden topics behind words • An IR system with topic models can match a doc containing automobile for a query vehicle as it assumes they come from the same topic car … automobile … Author Can be matched Topic-model based IR Searcher Example • Document corpus (textual dataset) matrix • Assumed hidden (latent) topics behind docs/words – We can infer topics by analyzing co-occurrence of docs and words – We can generate docs by multiplying assumed doc-topic and topic-word matrices Inference D-W auto vehicle film theater doc1 25 15 0 0 doc2 0 0 12 12 doc3 15 9 6 6 D-T = doc1 doc2 doc3 movie Document Corpus T-W auto vehicle film Theater 5doc1: auto 0 auto ... vehicle vehicle … car 5 3 0 0 0doc2: film 4 theater film theater … movie 0 0 3 3 3doc3: film 2 … theater … vehicle … auto … car X Generation Document-Word Observed Document-Topic Assumed Topic-Word Assumed Latent Semantic Indexing (LSI) by SVD W (words) D (docs) W D C nxp Rank Reduction to k S (diagonal) X U nxn nxp X VT pxp W (words) T (Topics) W D Uk nxk D-T X Sk kxk X VkT T kxp T-W D (Docs) Ck (Rank-k appr.) nxp Latent Semantic Indexing (LSI) by SVD • Query is viewed as a document query matching is a process to find a similar document W (words) q q W 1xp W (words) D (docs) Ck (Rank-k appr.) nxp q D Ck (Rank-k appr.) nxp X W px1 D nx1 Each value in the vector represents the similarity between q and di Example Topics - PLSI • We can group words with Topic-Word matrix W T Lucene Example • Goal: build index for hotels to support keyword search • Each Hotel item has id, name, city, description • E.g. 1, Hotel Rivoli, Paris, If you like historical Paris … • 40 hotels • Requirements • Search over name, city, description or full text • In a search result page, you should show name, city and description • May need to be incorporated with RDB for a complex query • E.g. modern hotel in New York with price < $100 Lucene Example • We first need to create an IndexWriter Analyzer Description StandardAnalyzer A sophisticated general-purpose analyzer. WhitespaceAnalyzer A very simple analyzer that just separates tokens using white space. StopAnalyzer Removes common English words that are not usually useful for indexing. (e.g. the, a, is, …) SnowballAnalyzer An interesting experimental analyzer that works on word roots (a search on rain should also return entries with raining, rained, and so on). Lucene Example • Which field to store? to index? Lucene Example • Now we can perform search using the index