Information Retrieval January 14, 2005 Handout #2 (C) 2003, The University of Michigan
Download ReportTranscript Information Retrieval January 14, 2005 Handout #2 (C) 2003, The University of Michigan
January 14, 2005 Information Retrieval Handout #2 (C) 2003, The University of Michigan 1 Course Information • • • • • • Instructor: Dragomir R. Radev ([email protected]) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M 11-12 & Th 12-1 or via email Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Fridays, 2:10-4:55 PM in 409 West Hall (C) 2003, The University of Michigan 2 Evaluation (C) 2003, The University of Michigan 3 Relevance • Difficult to change: fuzzy, inconsistent • Methods: exhaustive, sampling, pooling, search-based (C) 2003, The University of Michigan 4 Contingency table retrieved not retrieved relevant w x not relevant y z n2 = w + y (C) 2003, The University of Michigan n1 = w + x N 5 Precision and Recall Recall: w w+x Precision: w w+y (C) 2003, The University of Michigan 6 Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries. (C) 2003, The University of Michigan 7 n Doc. no Relevant? Recall Precision 1 2 3 4 5 6 588 589 576 590 986 592 7 8 9 10 11 12 13 14 0.2 0.4 0.4 0.6 0.6 0.8 1.00 1.00 0.67 0.75 0.60 0.67 984 988 578 985 103 0.8 0.8 0.8 0.8 0.8 0.57 0.50 0.44 0.40 0.36 591 772 990 0.8 1.0 1.0 0.33 0.38 0.36 (C) 2003, The University of Michigan x x x x x [From Salton’s book] 8 P/R graph 1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5? (C) 2003, The University of Michigan 9 Issues • • • • • • Why not use accuracy A=(w+z)/N? Average precision Average P at given “document cutoff values” Report when P=R F measure: F=(b2+1)PR/(b2P+R) F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R • When do F and F1 report the wrong results? (C) 2003, The University of Michigan 10 Kappa P( A) P( E ) 1 P( E ) • N: number of items (index i) • n: number of categories (index j) • k: number of annotators m ij i 1 Nk N N n 1 1 2 P( A) mij Nk (k 1) i 1 j 1 k 1 (C) 2003, The University of Michigan n P( E ) j 1 2 11 Kappa example (from Manning, Schuetze, Raghavan) J1+ J1- J2+ 300 10 J2- 20 70 (C) 2003, The University of Michigan 12 Kappa (cont’d) • • • • P(A) = 370/400 P (-) = (10+20+20+70)/800 = 0.2125 P (+) = (10+20+300+300)/800 = 0.7878 P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878 = 0.665 • K = (0.925-0.665)/(1-0.665) = 0.776 • Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good (C) 2003, The University of Michigan 13 Relevance collections • TREC ad hoc collections, 2-6 GB • TREC Web collections, 2-100GB (C) 2003, The University of Michigan 14 Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> (C) 2003, The University of Michigan LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208 15 <DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC> (C) 2003, The University of Michigan 16 TREC (cont’d) • http://trec.nist.gov/tracks.html • http://trec.nist.gov/presentations/presentations.htm l (C) 2003, The University of Michigan 17 Word distribution models (C) 2003, The University of Michigan 18 Shakespeare • Romeo and Juliet: • • • And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; … A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html (C) 2003, The University of Michigan 19 The BNC (Adam Kilgarriff) • • • • • • • • • • • • • • • • • • • • 1 6187267 the det 2 4239632 be v 3 3093444 of prep 4 2687863 and conj 5 2186369 a det 6 1924315 in prep 7 1620850 to infinitive-marker 8 1375636 have v 9 1090186 it pron 10 1039323 to prep 11 887877 for prep 12 884599 i pron 13 760399 that conj 14 695498 you pron 15 681255 he pron 16 680739 on prep 17 675027 with prep 18 559596 do v 19 534162 at prep 20 517171 by prep (C) 2003, The University of Michigan Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) 1997. Pp 135--155 20 Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63 (C) 2003, The University of Michigan 21 Zipf’s law Rank x Frequency Constant Rank Term Freq. Z Rank Term Freq. Z 1 the 69,971 0.070 6 in 21,341 0.128 2 of 36,411 0.073 7 that 10,595 0.074 3 and 28,852 0.086 8 is 10,099 0.081 4 to 26.149 0.104 9 was 9,816 0.088 5 a 23,237 0.116 10 he 9,543 0.095 (C) 2003, The University of Michigan 22 Zipf's law is fairly general! • Frequency of accesses to web pages • in particular the access counts on the Wikipedia page, with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5 • Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5 • Sizes of settlements • Income distributions amongst individuals • Size of earthquakes • Notes in musical performances http://en.wikipedia.org/wiki/Zipf's_law (C) 2003, The University of Michigan 23 Zipf’s law (cont’d) • Limitations: – Low and high frequencies – Lack of convergence • Power law with coefficient c = -1 – Y=kxc • Li (1992) – typing words one letter at a time, including spaces (C) 2003, The University of Michigan 24 Heap’s law • Size of vocabulary: V(n) = Knb • In English, K is between 10 and 100, β is between 0.4 and 0.6. V(n) http://en.wikipedia.org/wiki/Heaps%27_law n (C) 2003, The University of Michigan 25 Heap’s law (cont’d) • Related to Zipf’s law: generative models • Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335. (C) 2003, The University of Michigan 26 Indexing (C) 2003, The University of Michigan 27 Methods • Manual: e.g., Library of Congress subject headings, MeSH • Automatic (C) 2003, The University of Michigan 28 LOC subject headings A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) http://www.loc.gov/catdir/cpso/lcco/lcco.html (C) 2003, The University of Michigan 29 Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine (C) 2003, The University of Michigan 30 Finding the most frequent terms in a document • • • • Typically stop words: the, and, in Not content-bearing Terms vs. words Luhn’s method (C) 2003, The University of Michigan 31 Luhn’s method FREQUENCY E WORDS (C) 2003, The University of Michigan 32 Computing term salience • Term frequency (IDF) • Document frequency (DF) • Inverse document frequency (IDF) DF ( w) IDF ( w) log N (C) 2003, The University of Michigan 33 Scripts to compute tf and idf cd /clair4/class/ir-w03/tf-idf ./tf.pl 053.txt | sort -nr +1 | more ./tfs.pl 053.txt | sort -nr +1 | more ./stem.pl reasonableness ./build-df.pl ./idf.pl | sort -n +2 | more (C) 2003, The University of Michigan 34 Applications of TFIDF • Cosine similarity • Indexing • Clustering (C) 2003, The University of Michigan 35 Variants of TF*IDF • E.g., Okapi (Robertson) • TF/(k+TF) • k is from 1 to 2 (C) 2003, The University of Michigan 36 Vector-based matching • The cosine measure S (d . c . idf(k)) S (d ) . S (c ) k sim (D,C) = k k (C) 2003, The University of Michigan k k 2 k k 2 37 IDF: Inverse document frequency TF * IDF is used for automated indexing and for topic discrimination: N: number of documents dk: number of documents containing term k fik: absolute frequency of term k in document i wik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1 (C) 2003, The University of Michigan 38 Asian and European news 622.941 306.835 196.725 153.608 152.113 124.591 108.777 102.894 85.173 71.898 68.820 43.402 38.166 deng china beijing chinese xiaoping jiang communist body party died leader state people (C) 2003, The University of Michigan 97.487 92.151 74.652 46.657 34.778 34.778 33.803 32.571 14.095 9.389 9.154 8.459 6.059 nato albright belgrade enlargement alliance french opposition russia government told would their which 39 Other topics 120.385 99.487 90.128 70.224 59.992 50.160 49.722 47.782 47.782 40.889 35.778 27.063 shuttle space telescope hubble rocket astronauts discovery canaveral cape mission florida center (C) 2003, The University of Michigan 74.652 65.321 55.989 29.996 27.994 27.198 15.890 15.271 11.647 11.174 6.781 6.315 compuserve massey salizzoni bob online executive interim chief service second world president 40 Software • KEA: http://www.nzdl.org/Kea/ • Example: – Paper: “Protocols for secure, atomic transaction execution in electronic commerce” – Author: anonymity, atomicity, auction, electronic commerce, privacy, real-time, security, transaction – Kea: atomicity, auction, customer, electronic commerce, intruder, merchant, protocol, security, third party, transaction (C) 2003, The University of Michigan 41