Transcript PowerPoint

CS 430: Information Discovery
Lecture 2
Introduction to Text Based Information
Retrieval
1
Course Administration
• Please send all questions about the course to:
[email protected]
The message will be sent to
[email protected]
(Bill Arms)
[email protected] (Manpreet Singh )
[email protected]
(Sid Anand)
[email protected]
(Martin Guerrero)
2
Course Administration
Programming in Perl
Assignments 2, 3 and 4 require programs to be written in Perl.
An introduction to programming in Perl will be given at 7:30 p.m.
on Wednesdays September 19 and October 3.
These classes are optional. There will not be regular discussion
classes on these dates.
Materials about Perl and further information about these classes
will be posted on the course web site.
3
Course Administration
Discussion class, Wednesday, September 4
Read and be prepared to discuss:
Harman, D., Fox, E., Baeza-Yates, R.A., Inverted files.
(Frakes and Baeza-Yates, Chapter 3)
Phillips Hall 101, 7:30 to 8:30 p.m.
4
Classical Information Retrieval
media type
text
image, video,
audio, etc.
linking
searching
CS 502
natural
language
processing
CS 474
5
statistical
browsing
catalogs, indexes
(metadata)
user-in-loop
Recall and Precision
If information retrieval were perfect ...
Every hit would be relevant to the original query, and every
relevant item in the body of information would be found.
Precision: percentage of the hits that are relevant, the extent
to which the set of hits retrieved by a query satisfies
the requirement that generated the query.
Recall: percentage of the relevant items that are found by the
query, the extent to which the query found all the
items that satisfy the requirement.
6
Recall and Precision: Example
• Collection of 10,000 documents, 50 on a specific topic
• Ideal search finds these 50 documents and reject others
• Actual search identifies 25 documents; 20 are relevant but
5 were on other topics
• Precision: 20/ 25 = 0.8
• Recall: 20/50 = 0.4
7
Measuring Precision and Recall
Precision is easy to measure:
•
A knowledgeable person looks at each document that is
identified and decides whether it is relevant.
•
In the example, only the 25 documents that are found need
to be examined.
Recall is difficult to measure:
8
•
To know all relevant items, a knowledgeable person must
go through the entire collection, looking at every object to
decide if it fits the criteria.
•
In the example, all 10,000 documents must be examined.
Relevance and Ranking
Precision and recall assume that a document is either relevant to a
query or not relevant.
Often a user will consider a document to be partially relevant.
Ranking methods: measure the degree of similarity between a
query and a document.
Similar
Requests
Documents
Similar: How similar is document to a request?
9
Documents
A textual document is a digital object consisting of a sequence of
words and other symbols, e.g., punctuation.
The individual words and other symbols are known as tokens or
terms.
A textual document can be:
•
Free text, also known as unstructured text, which is a
continuous sequence of tokens.
•
Fielded text, also known as structured text, in which the text
is broken into sections that are distinguished by tags or other
markup.
[Methods of markup, e.g., XML, are covered in CS 502.]
10
Word Frequency
Observation: Some words are more common than others.
Statistics: Most large collections of text documents have
similar statistical characteristics. These statistics:
• influence the effectiveness and efficiency of data structures
used to index documents
•
many retrieval models rely on them
The following example is taken from:
Jamie Callan, Characteristics of Text, 1997
http://hobart.cs.umass.edu/~allan/cs646-f97/char_of_text.html
11
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
f(w) is the frequency that w appears
r(w) is rank of w in order of frequency, e.g., the most
commonly occurring word has rank 1
f
w has rank r and
frequency f
12
r
f
13
the 1130021
of
547311
to
516635
a
464736
in
390819
and
387703
that 204351
for
199340
is
152483
said 148302
it
134323
on
121173
by
118863
as
109135
at
101779
mr
101679
with 101210
f
from
96900
he
94585
million 93515
year
90104
its
86774
be
85588
was
83398
company83070
an
76974
has
74405
are
74097
have
73132
but
71887
will
71494
say
66807
new
64456
share
63925
f
or
about
market
they
this
would
you
which
bank
stock
trade
his
more
who
one
their
54958
53713
52110
51359
50933
50828
49281
48273
47940
47401
47310
47116
46244
42142
41635
40910
Zipf's Law
If the words, w, in a collection are ranked, r(w), by their frequency,
f(w), they roughly fit the relation:
r(w) * f(w) = c
Different collections have different constants c.
In English text, c tends to be about n / 10, where n is the number of
distinct words in the collection.
For a weird but wonderful discussion of this and many other
examples of naturally occurring rank frequency distributions, see:
Zipf, G. K., Human Behaviour and the Principle of Least Effort.
Addison-Wesley, 1949
14
1000*rf/n
15
the
of
to
a
in
and
that
for
is
said
it
on
by
as
at
mr
with
59
58
82
98
103
122
75
84
72
78
78
77
81
80
80
86
91
1000*rf/n
from
he
million
year
its
be
was
company
an
has
are
have
but
will
say
new
share
92
95
98
100
100
104
105
109
105
106
109
112
114
117
113
112
114
1000*rf/n
or
about
market
they
this
would
you
which
bank
stock
trade
his
more
who
one
their
101
102
101
103
105
107
106
107
109
110
112
114
114
106
107
108
Luhn's Proposal
"It is here proposed that the frequency of word occurrence in an
article furnishes a useful measurement of word significance. It is
further proposed that the relative position within a sentence of
words having given values of significance furnish a useful
measurement for determining the significance of sentences. The
significance factor of a sentence will therefore be based on a
combination of these two measurements."
Luhn, H.P., The automatic creation of literature abstracts, IBM
Journal of Research and Development, 2, 159-165 (1958)
16
Methods that Build on Zipf's Law
Term weighting: Give differing weights to terms based on their
frequency, with most frequent words weighed less.
Stop lists: Ignore the most frequent words (upper cut-off)
Significant words: Ignore the most frequent and least frequent
words (upper and lower cut-off)
17
Cut-off Levels for Significance Words
f
Upper
cut-off
Lower
cut-off
Resolving power of
significant words
Significant words
r
from: Van Rijsbergen, Ch. 2
18
Approaches to Weighting
Boolean information retrieval:
Weight of term i in document j:
w(i, j) = 1
w(i, j) = 0
if term i occurs in document j
otherwise
Vector space methods
Weight of term i in document j:
0 < w(i, j) <= 1 if term i occurs in document j
w(i, j) = 0
otherwise
19
Functional View of Information Retrieval
Similar: mechanism for determining the
similarity of the request representation to
the information item representation.
Documents
Requests
Index database
20
Major Subsystems
Indexing subsystem: Receives incoming documents,
converts them to the form required for the index and adds
them to the index database.
Search subsystem: Receives incoming requests, converts
them to the form required for searching the index and searches
the database for matching documents.
The index database is the central hub of the system.
21
Example: Indexing Subsystem
Documents
text
assign document IDs
document
numbers
and *field
numbers
break into words
words
*Indicates
optional
operation.
from Frakes,
page 7
22
documents
stoplist
non-stoplist
words
stemming*
stemmed
words
term weighting*
terms with
weights
Index
database
Example: Search Subsystem
query parse query
ranked
document set
query terms
stoplist
ranking*
non-stoplist
words
stemming*
relevance
judgments*
23
*Indicates
optional
operation.
Boolean
retrieved operations
document set
relevant
document set
stemmed
words
Index
database