Information Retrieval January 14, 2005 Handout #2 (C) 2003, The University of Michigan

Download Report

Transcript Information Retrieval January 14, 2005 Handout #2 (C) 2003, The University of Michigan

January 14, 2005
Information Retrieval
Handout #2
(C) 2003, The University of Michigan
1
Course Information
•
•
•
•
•
•
Instructor: Dragomir R. Radev ([email protected])
Office: 3080, West Hall Connector
Phone: (734) 615-5225
Office hours: M 11-12 & Th 12-1 or via email
Course page: http://tangra.si.umich.edu/~radev/650/
Class meets on Fridays, 2:10-4:55 PM in 409 West Hall
(C) 2003, The University of Michigan
2
Evaluation
(C) 2003, The University of Michigan
3
Relevance
• Difficult to change: fuzzy, inconsistent
• Methods: exhaustive, sampling, pooling,
search-based
(C) 2003, The University of Michigan
4
Contingency table
retrieved
not retrieved
relevant
w
x
not relevant
y
z
n2 = w + y
(C) 2003, The University of Michigan
n1 = w + x
N
5
Precision and Recall
Recall:
w
w+x
Precision:
w
w+y
(C) 2003, The University of Michigan
6
Exercise
Go to Google (www.google.com) and search for documents on
Tolkien’s “Lord of the Rings”. Try different ways of phrasing
the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien”
+Lord of the Rings”, etc. For each query, compute the precision
(P) based on the first 10 documents returned by AltaVista.
Note! Before starting the exercise, have a clear idea of what a
relevant document for your query should look like. Try
different information needs.
Later, try different queries.
(C) 2003, The University of Michigan
7
n
Doc. no Relevant? Recall Precision
1
2
3
4
5
6
588
589
576
590
986
592
7
8
9
10
11
12
13
14
0.2
0.4
0.4
0.6
0.6
0.8
1.00
1.00
0.67
0.75
0.60
0.67
984
988
578
985
103
0.8
0.8
0.8
0.8
0.8
0.57
0.50
0.44
0.40
0.36
591
772
990
0.8
1.0
1.0
0.33
0.38
0.36
(C) 2003, The University of Michigan
x
x
x
x
x
[From Salton’s book]
8
P/R graph
1
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Interpolated average precision (e.g., 11pt)
Interpolation – what is precision at recall=0.5?
(C) 2003, The University of Michigan
9
Issues
•
•
•
•
•
•
Why not use accuracy A=(w+z)/N?
Average precision
Average P at given “document cutoff values”
Report when P=R
F measure: F=(b2+1)PR/(b2P+R)
F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of
P and R
• When do F and F1 report the wrong results?
(C) 2003, The University of Michigan
10
Kappa
P( A)  P( E )

1  P( E )
• N: number of items (index i)
• n: number of categories (index j)
• k: number of annotators


m
  ij 
 i 1

 Nk 




N
N n
1
1
2
P( A) 
mij 

Nk (k  1) i 1 j 1
k 1
(C) 2003, The University of Michigan
n
P( E )  
j 1
2
11
Kappa example (from Manning,
Schuetze, Raghavan)
J1+
J1-
J2+
300
10
J2-
20
70
(C) 2003, The University of Michigan
12
Kappa (cont’d)
•
•
•
•
P(A) = 370/400
P (-) = (10+20+20+70)/800 = 0.2125
P (+) = (10+20+300+300)/800 = 0.7878
P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878
= 0.665
• K = (0.925-0.665)/(1-0.665) = 0.776
• Kappa higher than 0.67 is tentatively
acceptable; higher than 0.8 is good
(C) 2003, The University of Michigan
13
Relevance collections
• TREC ad hoc collections, 2-6 GB
• TREC Web collections, 2-100GB
(C) 2003, The University of Michigan
14
Sample TREC query
<top>
<num> Number: 305
<title> Most Dangerous Vehicles
<desc> Description:
Which are the most crashworthy, and least crashworthy,
passenger vehicles?
<narr> Narrative:
A relevant document will contain information on the
crashworthiness of a given vehicle or vehicles that can be
used to draw a comparison with other vehicles. The
document will have to describe/compare vehicles, not
drivers. For instance, it should be expected that vehicles
preferred by 16-25 year-olds would be involved in more
crashes, because that age group is involved in more crashes.
I would view number of fatalities per 100 crashes to be more
revealing of a vehicle's crashworthiness than the number of
crashes per 100,000 miles, for example.
</top>
(C) 2003, The University of Michigan
LA031689-0177
FT922-1008
LA090190-0126
LA101190-0218
LA082690-0158
LA112590-0109
FT944-136
LA020590-0119
FT944-5300
LA052190-0048
LA051689-0139
FT944-9371
LA032390-0172
LA042790-0172
LA021790-0136
LA092289-0167
LA111189-0013
LA120189-0179
LA020490-0021
LA122989-0063
LA091389-0119
LA072189-0048
FT944-15615
LA091589-0101
LA021289-0208
15
<DOCNO> LA031689-0177 </DOCNO>
<DOCID> 31701 </DOCID>
<DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE>
<SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION>
<LENGTH><P>586 words </P></LENGTH>
<HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE>
<BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE>
<TEXT>
<P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over
accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P>
<P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the
Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after
Consumer Reports magazine charged that the vehicle had basic design flaws. </P>
<P>Several Fatalities </P>
<P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,
particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation
conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P>
<P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle
roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving
the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the
accident report, NHTSA declined to investigate the Samurai. </P>
...
</TEXT>
<GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher
number of single-vehicle, first event roll-overs," a federal official
said. </P></GRAPHIC>
<SUBJECT>
<P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;
RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P>
</SUBJECT>
</DOC>
(C) 2003, The University of Michigan
16
TREC (cont’d)
• http://trec.nist.gov/tracks.html
• http://trec.nist.gov/presentations/presentations.htm
l
(C) 2003, The University of Michigan
17
Word distribution models
(C) 2003, The University of Michigan
18
Shakespeare
• Romeo and Juliet:
•
•
•
And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me,
262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her,
150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall,
107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If,
80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There,
64; Hath, 63; Which, 60;
…
A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1;
Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1;
Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1;
Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1;
Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1;
Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1;
All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;
http://www.mta75.org/curriculum/english/Shakes/indexx.html
(C) 2003, The University of Michigan
19
The BNC (Adam Kilgarriff)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1 6187267 the det
2 4239632 be v
3 3093444 of prep
4 2687863 and conj
5 2186369 a det
6 1924315 in prep
7 1620850 to infinitive-marker
8 1375636 have v
9 1090186 it pron
10 1039323 to prep
11 887877 for prep
12 884599 i pron
13 760399 that conj
14 695498 you pron
15 681255 he pron
16 680739 on prep
17 675027 with prep
18 559596 do v
19 534162 at prep
20 517171 by prep
(C) 2003, The University of Michigan
Kilgarriff, A. Putting Frequencies in the Dictionary.
International Journal of Lexicography
10 (2) 1997. Pp 135--155
20
Stop lists
• 250-300 most common words in English
account for 50% or more of a given text.
• Example: “the” and “of” represent 10% of
tokens. “and”, “to”, “a”, and “in” - another
10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types),
2256 word occurrences (tokens). Top 65
types cover 1132 tokens (> 50%).
• Token/type ratio: 2256/859 = 2.63
(C) 2003, The University of Michigan
21
Zipf’s law
Rank x Frequency  Constant
Rank Term Freq. Z
Rank Term Freq. Z
1
the
69,971 0.070
6
in
21,341 0.128
2
of
36,411 0.073
7
that
10,595 0.074
3
and
28,852 0.086
8
is
10,099 0.081
4
to
26.149 0.104
9
was
9,816
0.088
5
a
23,237 0.116
10
he
9,543
0.095
(C) 2003, The University of Michigan
22
Zipf's law is fairly general!
• Frequency of accesses to web pages
• in particular the access counts on the Wikipedia page,
with s approximately equal to 0.3
• page access counts on Polish Wikipedia (data for late July 2003)
approximately obey Zipf's law with s about 0.5
• Words in the English language
• for instance, in Shakespeare’s play Hamlet with s approximately 0.5
• Sizes of settlements
• Income distributions amongst individuals
• Size of earthquakes
• Notes in musical performances
http://en.wikipedia.org/wiki/Zipf's_law
(C) 2003, The University of Michigan
23
Zipf’s law (cont’d)
• Limitations:
– Low and high frequencies
– Lack of convergence
• Power law with coefficient c = -1
– Y=kxc
• Li (1992) – typing words one letter at a
time, including spaces
(C) 2003, The University of Michigan
24
Heap’s law
• Size of vocabulary: V(n) = Knb
• In English, K is between 10 and 100, β is between 0.4 and 0.6.
V(n)
http://en.wikipedia.org/wiki/Heaps%27_law
n
(C) 2003, The University of Michigan
25
Heap’s law (cont’d)
• Related to Zipf’s law: generative models
• Zipf’s and Heap’s law coefficients change
with language
Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc.
CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics,
February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004,
ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.
(C) 2003, The University of Michigan
26
Indexing
(C) 2003, The University of Michigan
27
Methods
• Manual: e.g., Library of Congress subject
headings, MeSH
• Automatic
(C) 2003, The University of Michigan
28
LOC subject headings
A -- GENERAL WORKS
B -- PHILOSOPHY. PSYCHOLOGY. RELIGION
C -- AUXILIARY SCIENCES OF HISTORY
D -- HISTORY (GENERAL) AND HISTORY OF EUROPE
E -- HISTORY: AMERICA
F -- HISTORY: AMERICA
G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION
H -- SOCIAL SCIENCES
J -- POLITICAL SCIENCE
K -- LAW
L -- EDUCATION
M -- MUSIC AND BOOKS ON MUSIC
N -- FINE ARTS
P -- LANGUAGE AND LITERATURE
Q -- SCIENCE
R -- MEDICINE
S -- AGRICULTURE
T -- TECHNOLOGY
U -- MILITARY SCIENCE
V -- NAVAL SCIENCE
Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
http://www.loc.gov/catdir/cpso/lcco/lcco.html
(C) 2003, The University of Michigan
29
Medicine
CLASS R - MEDICINE
Subclass R
R5-920 Medicine (General)
R5-130.5 General works
R131-687
History of medicine. Medical expeditions
R690-697
Medicine as a profession. Physicians
R702-703
Medicine and the humanities. Medicine and disease in relation to history,
R711-713.97 Directories
R722-722.32 Missionary medicine. Medical missionaries
R723-726
Medical philosophy. Medical ethics
R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying
R727-727.5
Medical personnel and the public. Physician and the public
R728-733
Practice of medicine. Medical practice economics
R735-854
Medical education. Medical schools. Research
R855-855.5
Medical technology
R856-857
Biomedical engineering. Electronics. Instrumentation
R858-859.7
Computer applications to medicine. Medical informatics
R864
Medical records
R895-920
Medical physics. Medical radiology. Nuclear medicine
(C) 2003, The University of Michigan
30
Finding the most frequent terms in a
document
•
•
•
•
Typically stop words: the, and, in
Not content-bearing
Terms vs. words
Luhn’s method
(C) 2003, The University of Michigan
31
Luhn’s method
FREQUENCY
E
WORDS
(C) 2003, The University of Michigan
32
Computing term salience
• Term frequency (IDF)
• Document frequency (DF)
• Inverse document frequency (IDF)
DF ( w)
IDF ( w)   log
N
(C) 2003, The University of Michigan
33
Scripts to compute tf and idf
cd /clair4/class/ir-w03/tf-idf
./tf.pl 053.txt | sort -nr +1 | more
./tfs.pl 053.txt | sort -nr +1 | more
./stem.pl reasonableness
./build-df.pl
./idf.pl | sort -n +2 | more
(C) 2003, The University of Michigan
34
Applications of TFIDF
• Cosine similarity
• Indexing
• Clustering
(C) 2003, The University of Michigan
35
Variants of TF*IDF
• E.g., Okapi (Robertson)
• TF/(k+TF)
• k is from 1 to 2
(C) 2003, The University of Michigan
36
Vector-based matching
• The cosine measure
S (d . c . idf(k))
S (d ) . S (c )
k
sim (D,C) =
k
k
(C) 2003, The University of Michigan
k
k
2
k
k
2
37
IDF: Inverse document frequency
TF * IDF is used for automated indexing and for topic
discrimination:
N: number of documents
dk: number of documents containing term k
fik: absolute frequency of term k in document i
wik: weight of term k in document i
idfk = log2(N/dk) + 1 = log2N - log2dk + 1
(C) 2003, The University of Michigan
38
Asian and European news
622.941
306.835
196.725
153.608
152.113
124.591
108.777
102.894
85.173
71.898
68.820
43.402
38.166
deng
china
beijing
chinese
xiaoping
jiang
communist
body
party
died
leader
state
people
(C) 2003, The University of Michigan
97.487
92.151
74.652
46.657
34.778
34.778
33.803
32.571
14.095
9.389
9.154
8.459
6.059
nato
albright
belgrade
enlargement
alliance
french
opposition
russia
government
told
would
their
which
39
Other topics
120.385
99.487
90.128
70.224
59.992
50.160
49.722
47.782
47.782
40.889
35.778
27.063
shuttle
space
telescope
hubble
rocket
astronauts
discovery
canaveral
cape
mission
florida
center
(C) 2003, The University of Michigan
74.652
65.321
55.989
29.996
27.994
27.198
15.890
15.271
11.647
11.174
6.781
6.315
compuserve
massey
salizzoni
bob
online
executive
interim
chief
service
second
world
president
40
Software
• KEA: http://www.nzdl.org/Kea/
• Example:
– Paper: “Protocols for secure, atomic transaction
execution in electronic commerce”
– Author: anonymity, atomicity, auction, electronic
commerce, privacy, real-time, security, transaction
– Kea: atomicity, auction, customer, electronic commerce,
intruder, merchant, protocol, security, third party,
transaction
(C) 2003, The University of Michigan
41