LSU/SLIS Search Engines and Information Retrieval (IR) Session 11 LIS 7008 Information Technologies Agenda • The search process • Information retrieval (IR) • Recommender systems • Multimedia information retrieval.
Download ReportTranscript LSU/SLIS Search Engines and Information Retrieval (IR) Session 11 LIS 7008 Information Technologies Agenda • The search process • Information retrieval (IR) • Recommender systems • Multimedia information retrieval.
LSU/SLIS Search Engines and Information Retrieval (IR) Session 11 LIS 7008 Information Technologies Agenda • The search process • Information retrieval (IR) • Recommender systems • Multimedia information retrieval (MMIR) • Evaluation of information retrieval systems Questions • What is the Web? Who defines it? • Why Google pays $125 million to scan books? – You’ll get an idea by the end of this session Web Information Retrieval Browsers Search Engines Web (a portion) The Memex Machine Vannevar Bush, 1945. As we May Think. DIKW Information Hierarchy More refined and abstract connectedness wisdom Understanding principles Wisdom knowledge Understanding patterns Knowledge information Information Understanding relations Data data DB | IR | KS understanding http://www.systems-thinking.org/dikw/dikw.htm DIKW Information Hierarchy • Data – • Information – – • Symbols: exists, usable or not, meaningful or not Data organized and presented in a particular manner Data been given meaning by way of relational connection Knowledge – “Justified true belief”; application of data & information – Intends to be useful • Wisdom – – Distilled and integrated knowledge Evaluated understanding; the process by which we discern, or judge, between right and wrong, good and bad. Slide adapted from http://www.systems-thinking.org/dikw/dikw.htm and http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (session 5) A (Facetious) Example • Data – 98.6º F, 99.5º F, 100.3º F, 101º F, … (we do not know what they mean) • Information – John’s hourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F, … • Knowledge – If you have a temperature above 100º F, you most likely have a fever • Wisdom – If you have a fever and don’t feel well, go see a doctor Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (session 5) Databases (DB) IR Structured data. Clear semantics based on a formal model. Formally (mathematically) defined queries. Unambiguous. SQL. Exact. Always correct in a formal sense. Mostly unstructured information. Free text with some metadata. Vague, imprecise information needs (often expressed in natural language). Sometimes relevant, often not. Interaction with system One-shot queries. IR is a process. Interaction is important. Other issues Concurrency, recovery, atomicity are critical. Efficiency important. Effectiveness and usability are critical. What we’re retrieving Queries we’re posing Results we get How about searching for a book in www.lib.lsu.edu catalog? Information “Retrieval” Tasks • Find something that you want/need – The information need may or may not be explicit • E.g., cat, movie for dog, jaguar • Known item search – Find the class home page • Answer seeking – Is Lexington or Louisville the capital of Kentucky? – When did Michael Jackson die? • This is not database search, because information is unstructured • Directed exploration – Who make(s) videoconferencing systems? • Self-guided teaching, exploration The Big Picture • The four components of the information retrieval environment: – – – – User (user needs) Process support System Data/info (static) What we care about! What computer geeks care about! Information Retrieval Paradigm Search Select Query Document Delivery Browse Examine Document IR: Supporting the Search Process Source Selection IR System Query Formulation Predict Nominate Choose Query Search Query Reformulation and Relevance Feedback Ranked List Selection Document Examination Source Reselection Document Delivery Supporting the Search Process Source Selection IR System Query Formulation Query Search Ranked List Selection Indexing Document Index Examination Acquisition Collection Document Cache Delivery Human-Machine Synergy • Machines are good at: – Doing simple things accurately and quickly – Scaling to larger collections in sublinear time • People are better at: – Accurately recognizing what they are looking for – Evaluating intangibles such as “quality” • Both are pretty bad at: – Mapping consistently between words and concepts Search Component Model Utility Human Judgment Document Query Formulation Query Representation Function Representation Function Query Representation Document Representation Comparison Function Retrieval Status Value (retrieved or not) Document Processing Query Processing Information Need Ways of Finding Text • Searching content – Characterize documents by the words they contain • Searching metadata (fields) – Using controlled or uncontrolled vocabularies • Searching behavior – User-Item: Find similar users based on the items they search • E.g., users who looked for “movies for dog” – Item-Item: Find items that cause similar reactions • E.g., Amazon.com recommends books based on the books you bought Two Ways of Searching Controlled Vocabulary Searcher Free-Text Searcher Author Indexer Construct query from terms that may appear in documents Write the document using terms to convey meaning Choose appropriate concept descriptors Query Terms Content-Based Query-Document Matching Document Terms Document Descriptors Retrieval Status Value Construct query from available concept descriptors Metadata-Based Query-Document Matching Query Descriptors 3 Types of Information Retrieval Systems • Exact-match retrieval – Boolean retrieval systems • Ranked retrieval – Similarity-based retrieval • Rating-based Recommendation “Exact Match” Retrieval • Query: Find all documents with some characteristics: – Indexed as “Presidents -- United States” – Containing the words “Obama” and “health” – Read by my boss • Result: A set of documents is returned – Hopefully, not too many or too few – Usually listed in date or alphabetical order • E.g., Boolean retrieval search engines The Perfect Query Paradox • Every information need has a perfect document – Finding that set is the goal of search • Every document set has a perfect query – “AND” every word in the doc to get a query for document – Repeat for each document in the set – “OR” every document query to get the set query • The problem isn’t the system … it’s the query! How can a human being formulate such a complicated query?! Queries on the Web (1999) • Low query construction effort – 2.35 (often imprecise) terms per query – 20% of Web users use searching operators – 22% of queries are subsequently modified • Low browsing effort – Only 15% of users view more than one page – Most users look only “above the fold” • One study showed that 10% don’t know how to scroll! 3 Types of User Needs • Informational (30-40% of AltaVista queries) – What is a quark? How does a car engine work? • Navigational – Find the Web page of United Airlines talking about luggage policy • Transactional – Data: – Shopping: – Proprietary: What is the weather in Paris? Who sells a Viao Z505RX? Obtain a journal article Ranked Retrieval • Boolean search engine per se is perfect – Searchers are not because of perfect query paradox – Make a search engine not as good as Boolean search engine • Put most useful documents near top of a list – Possibly useful documents go lower in the list • Users can read down as far as they like – Based on what they need, time available, ... • Provides useful results from weak queries – Untrained users find exact-match (Boolean) systems harder to use due to lack of query formulation skills Similarity-Based Retrieval • Assume “most useful” = most similar to query • Weight terms based on two criteria: – Repeated words are good cues to meaning of a document – Rarely used words make searches more selective • Compare term weights with query – Add up the weights for each query term – Put the documents with the highest total first Simple Example: Counting Words Build an index for documents: Query: “recall and fallout measures for information retrieval” d1 Documents: d1: Nuclear fallout contaminated Texas. d2: Information retrieval is interesting. fallout 1 1 1 1 1 information interesting d3: Information retrieval is complicated. nuclear Texas Query 1 1 1 1 1 retrieval Vector of term weights for d1: (0, 1, 1, 0, 0, 1, 0, 1) d3 1 complicated contaminated d2 1 1 Discussion Point: Which Terms to Emphasize? • Major factors – Uncommon terms are more selective • E.g., in a collection of chemistry abstract, “Monica Lewinsky” can be a very selective term – Repeated terms provide evidence of meaning • E.g., a doc mentions “cat” many times, it is probably about cat. • Adjustments – Give more weight to terms in certain positions • Title, first paragraph, etc. – Give less weight each term in longer documents – Ignore documents that try to “spam” the index • Invisible text, excessive use of the “meta” field, … “Okapi” Term Weights E.g., cat, cat, cat, Lewinsky N DFj 0.5 wi , j * log DF 0.5 Li j 1.5 TFi , j 0.5 L TFi , j TF component IDF component 6.0 1.0 5.8 0.8 5.6 5.4 0.5 1.0 IDF Okapi TF L/L 0.6 Classic 5.2 Okapi 2.0 0.4 5.0 4.8 0.2 4.6 0.0 4.4 0 5 10 15 Raw TF 20 25 0 5 10 15 Raw DF 20 25 Index Quality of Web Search Engines Index quality is affected by: • Crawl quality – Comprehensiveness, dead links, duplicate detection • Document analysis – Frames, metadata, imperfect HTML, … • Document extension – Anchor text, source authority, category, language, … • Document restriction (ephemeral text suppression) – Banner ads, keyword spam, … Other Web Search Quality Factors • Spam suppression – “Adversarial information retrieval” • detect, isolate, and defeat spamming – Every source of evidence has been spammed • Text, queries, links, access patterns, … • “Family filter” accuracy – Family filter reduces objectionable content – Link analysis can be very helpful • E.g., few serious web pages make links to pornography sites Indexing Anchor Text • A type of “document expansion” – Terms near links describe content of the target • Works even when you can’t index content – Image retrieval, uncrawled links, … Rating-Based Recommendation • Use ratings as to describe objects – Personal recommendations, peer review, … • Beyond topicality: – Accuracy, coherence, depth, novelty, style, … • Has been applied to many modalities – Recommending books, Usenet news, movies, music, jokes, beer, … Using Positive Information Joe Ellen Mickey Goofy John Ben Nathan Small World Space Mtn D A A D A F D A F A A C A Mad Dumbo SpeedTea Pty way B D A A A D A C C ? F A Cntry Bear ? A A F A Using Negative Information Joe Ellen Mickey Goofy John Ben Nathan Small World Space Mtn D A A D A F D A F A A C A Mad Dumbo SpeedTea Pty way B D A A A D A C C ? F A Cntry Bear ? A A F A Problems with Explicit Ratings • Cognitive load on users -- people don’t like to provide ratings • Rating sparsity -- needs a number of raters to make recommendations • No ways to detect new items that have not been rated by any users Implicit Evidence for Ratings Segment Examine View Object Class Select Bookmark Save Purchase Subscribe Retain Print Delete Cite Link Reference Quote Cut&Paste Reply Forward Rate Interpret Annotate Publish Organize Implicit Evidence for Rating: Click Streams • Browsing histories are easily captured – Browsers send all links to a central site – Record from and to pages and user’s cookie – Redirect the browser to the desired page • User’s reading time is correlated with interest – Can be used to build individual profiles – Used to target advertising by doubleclick.com Slide adapted from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12) Estimating Authority from Links Google PageRank Hub Authority Authority Why Google Scans Books? • Index is large • Interface is intuitive Putting It All Together (3 Types of Evidence for Finding Info) Free Text Topicality Quality Reliability Cost Flexibility good poor Behavior Metadata Information Retrieval Types Genre Time Source: Ayse Goker Expanding the Search Space: CrossLanguage Information Retrieval Scanned Docs Identity: Harriet “… Later, I learned that John had not heard …” Document Image Retrieval: Page Layer Segmentation • Document image generation model – A document consists many layers, such as handwriting, machine printed text, background patterns, tables, figures, noise, etc. High Payoff Investments (on language recognition problems) MT Searchable Fraction Handwriting Speech Transducer Capabilities AccuratelyRe cognizedWords Words Pr oduced OCR (Google Books) Speech Retrieval Approaches • Controlled vocabulary indexing – Manually describe speech content • Ranked retrieval based on associated text – Manual/automatic transcript, notes, anchor text • Automatic feature-based indexing – Use phonetic features (e.g., phonemes) – E.g., M ae n ih jh: maen, aenih, nihjh • Social filtering based on other users’ ratings Slide from http://www.glue.umd.edu/~oard/teaching/796/spring04/syllabus.html BBN Radio News Retrieval Muscle Fish Audio Retrieval Muscle Fish Audio Retrieval • Compute 4 acoustic features for each time slice – Pitch, amplitude, brightness, bandwidth • Segment at major discontinuities – Find average, variance, and smoothness of segments • Store pointers to segments in 13 sorted lists – Use a commercial database for proximity matching • 4 features, 3 parameters for each, plus duration – Then rank order using statistical classification • Display file name and audio Slide from http://www.glue.umd.edu/~oard/teaching/796/spring04/syllabus.html Music • Search by metadata (in text) • Search by singing/humming • http://www.midomi.com Image Retrieval • Three traditional approaches – Controlled vocabulary indexing – Ranked retrieval based on associated captions – Social filtering based on other users’ ratings • Today’s focus is on content-based retrieval – An analogue of content-based text retrieval Webseek: http://www.ctr.columbia.edu/webseek/ Color Histogram Matching • Represent image as a rectangular pixel raster – e.g., 1024 columns and 768 rows • Represent each pixel as a quantized color – e.g., 256 colors ranging from red through violet • Count the number of pixels in each color bin – Produces vector representations • Compute color vector similarity – e.g., normalized inner product of 2 vectors (I am talking about linear algebra terminology) Color Histogram Example Image Retrieval Summary • Query by – Keywords, image example, sketch • Matching query against image – – – – – Caption text Segmentation Similarity of color, texture, shape Spatial arrangement (orientation, position) Specialized techniques (e.g., face recognition) • Selection – Thumbnails as surrogates • For your project if you use a lot of images! E.g., Banned Book Exhibition project (Fall 2008 project) Video Retrieval Approaches • Visual content based approach – A video is a sequence of images • Semantic content based approach – keywords search – browsing Evaluation of IR Systems • What can be measured that reflects the searcher’s ability to use a system? (Cleverdon, 1966) – – – – – – Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall Precision Effectiveness Evaluating IR Systems: 2 Strategies • User-centered strategy – Given a group of users, and at least 2 retrieval systems – Have each user try the same task on both systems – Measure which system works the “best” • System-centered strategy – Test collection (given documents, queries, and relevance judgments) – Try several variations on the retrieval system – Measure which ranks more good docs near the top Which is the Best Rank Order? A. B. C. D. E. F. = relevant document User may stop at any point, compute Precision, then take average precision Precision and Recall • Precision – How much of what was found is relevant? – Often of interest, particularly for interactive searching • Recall – How much of what is relevant was found? – Particularly important for law, patents, and medicine Measures of Effectiveness Retrieved Ret+ Rel | Ret Rel | Precision | Ret | Relevant | Ret Rel | Recall | Rel | Slide adapted from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12) Effectiveness: Set-Based Measures Relevant • • • • Not relevant Retrieved A B Not retrieved C D Collection size = A+B+C+D Relevant = A+C Retrieved = A+B Precision = A /(A+B) Recall = A /(A+C) Miss = C /(A+C) False alarm (fallout) = B / (B+D) When is precision important? When is recall important? Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11) Precision-Recall Curves 1 0.9 0.8 Precision 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Source: Ellen Voorhees, NIST User Studies • Goal is to account for interface issues – By studying the interface component – By studying the complete system • Formative evaluation – Provide a basis for system development – Have users in the development team to give feedback instantly • Summative evaluation – Designed to assess performance of the system Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11) Quantitative User Studies • Select independent variable(s) – e.g., what info to display in selection interface: title, snippet • Select dependent variable(s) – e.g., time to find a known relevant document • Run subjects in different orders – Average out learning and fatigue effects • Compute statistical significance – Null hypothesis: independent variable has no effect – Rejected if p<0.05 Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11) Qualitative User Studies • Observe user behavior – – – – Instrumented software, eye trackers, etc. Face and keyboard cameras Think-aloud protocols Interviews and focus groups • Organize the data – For example, group it into overlapping categories • Look for patterns and themes • Develop a “grounded theory” Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11) Questionnaires • Demographic data – For example, computer experience – Basis for interpreting results • Subjective self-assessment – Which did they think was more effective? – Often at variance with objective results! • Preference – Which interface did they prefer? Why? Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11) Affective Evaluation of Systems • Measure stickiness through frequency of use – Non-comparative, long-term study • Key factors (from cognitive psychology): – Worst experience – Best experience – Most recent experience • Highly variable effectiveness is undesirable – Bad experiences are particularly memorable Slide adapted from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12) By now you should know… • Why information retrieval is hard • Why information retrieval is more than just querying a search engine • The difference between Boolean and ranked retrieval (and their advantages /disadvantages) • Basics of evaluating information retrieval systems • Roughly how IR systems work • Roughly how Web search engines work Example IR Interfaces • • • • • • • • Google: keyword in context Microsoft Live: query refinement suggestions Exalead: faceted refinement Vivisimo/Clusty: clustered results Kartoo: cluster visualization WebBrain: structure visualization Grokker: “map view” PubMed: related article search Hands On: Try Some Search Engines • Web Pages (using spatial layout) – http://kartoo.com/ • Multimedia (based on metadata) – http://singingfish.com • Movies (based on recommendations) – http://www.movielens.umn.edu • Grey literature (based on citations) – http://citeseer.ist.psu.edu/ • Images (based on image similarity) – http://elib.cs.berkeley.edu/photos/blobworld/ – rot link or system down? • Cont’d on next slide Slide from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12) Hands on: Just play with them • Query formulation - Advanced search interface? - Help? - Query refinement prompt? • Presentation of result sets - What information were presented to you? - In which way? 1 dimension, 2+ dimensions? Categories? • Rank criteria - by similarity scores? - by date? alphabetic order? It’s ok if you cannot tell. Summary • Search is a process engaged in by people • Co-design problem • Human-machine synergy is the key • Content and behavior offer useful evidence • Evaluation must consider many factors