Transcript PowerPoint
CS 430: Information Discovery Lecture 23 Query Refinement 1 Course Administration Midterm Examination Grades for Midterm Examination have been sent. Contact [email protected] if you have not received a grade. Assignment 4 This will be an optional assignment for extra credit. It is due on Friday, December 7 at 5 p.m. 2 The Human in the Loop Return objects Return hits Browse repository Search index 3 From the Midterm Question 2 (a) Why are precision and recall difficult measures to use for retrieval effectiveness when there is a user in loop? 4 Query Refinement Query formulation and search no hits new query Display number of hits Reformulate query or display Display retrieved information new query 5 Decide next step reformulate query Reformulation of Query Manual • • • Add or remove search terms Change Boolean operators Change wild cards Automatic 6 • • Remove search terms Change weighting of search terms • Add new search terms Query Reformulation: Vocabulary Tools Feedback • Information about stop lists, stemming, etc. • Numbers of hits on each term or phrase Suggestions 7 • Thesaurus • Browse lists of terms in the inverted index • Controlled vocabulary Query Reformulation: Document Tools Feedback to user consists of document excerpts or surrogates • Shows the user how the system has interpreted the query Effective at suggesting how to restrict a search • Shows examples of false hits Less good at suggesting how to expand a search • 8 No examples of missed items Example: Tilebars The figure represents a set of hits from a text search. Each large rectangle represents a document or section of text. Each row represents a search term or subquery. The density of each small square indicates the frequency with which a term appears in a section of a document. Hearst 1995 9 Document Vectors as Points on a Surface • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface From Lecture 8 10 Theoretically Best Query optimal query o x x o x o x x x x x x o x x x x x o o x x x x x non-relevant documents o relevant documents 11 Theoretically Best Query For a specific query, Q, let: DR be the set of all relevant documents DN-R be the set of all non-relevant documents sim (Q, DR) be the mean similarity between query Q and documents in DR sim (Q, DN-R) be the mean similarity between query Q and documents in DN-R The theoretically best query would maximize: F = sim (Q, DR) - sim (Q, DN-R) 12 Estimating the Best Query In practice, DR and DN-R are not known. (The objective is to find them.) However, the results of an initial query can be used to estimate sim (Q, DR) and sim (Q, DN-R). 13 Relevance Feedback (concept) x x o x o x hits from original search o x documents identified as non-relevant o documents identified as relevant original query reformulated query 14 Rocchio's Modified Query Modified query vector = Original query vector + Mean of relevant documents found by original query - Mean of non-relevant documents found by original query 15 Query Modification Q1 = Q0 + 1 n1 n1 Ri i =1 - 1 n2 n2 Si i =1 Q0 = vector for the initial query Q1 = vector for the modified query Ri = vector for relevant document i Si = vector for non-relevant document i n1 = number of relevant documents n2 = number of non-relevant documents Rocchio 1971 16 Difficulties with Relevance Feedback optimal query o x x o x o 17 x x x x o x x x x o x x o x x x x x x non-relevant documents o relevant documents original query reformulated query Hits from the initial query are contained in the gray shaded area Effectiveness of Relevance Feedback Best when: 18 • Relevant documents are tightly clustered (similarities are large) • Similarities between relevant and non-relevant documents are small Positive and Negative Feedback 1 Q1 = Q0 + n 1 n1 1 R i n2 i =1 n2 Si i =1 , and are weights that adjust the importance of the three vectors. If = 0, the weights provide positive feedback, by emphasizing the relevant documents in the initial set. If = 0, the weights provide negative feedback, by reducing the emphasis on the non-relevant documents in the initial set. 19 When to Use Relevance Feedback Relevance feedback is most important when the user wishes to increase recall, i.e., it is important to find all relevant documents. Under these circumstances, users can be expected to put effort into searching: 20 • Formulate queries thoughtfully with many terms • Review results carefully to provide feedback • Iterate several times • Combine automatic query enhancement with studies of thesauruses and other manual enhancements Latent Semantic Indexing A very rough sketch of the basic idea, without any details or justification. Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the index term vector space into a lower dimensional space, using singular value decomposition. 21 The index term vector space The space has as many dimensions as there are terms in the word list. t3 d1 d2 t2 t1 22 Mathematical concepts Vector space theory (Singular Value Decomposition) Define M as the term-document matrix, with t rows (number of index terms) and n columns (number of documents). There exist matrices K, S and D, such that: M = KSD K is the matrix of eigenvectors of MMT D is the matrix of eigenvectors of MTM S is an r x r diagonal matrix, where r is the rank of M, usually the smaller of t and n, and every element of S is non-negative. 23 Reduction of dimension Select the s largest elements of S and the corresponding columns of K and D. This gives a reduced matrix: Ms = KsSsDs It is claimed that the rows of this matrix represent concepts. Therefore calculation of the similarity between a query expressed in this space and a document is more effective than in the index term vector space. 24