Transcript PowerPoint

CS 430: Information Discovery
Sample Midterm Examination
Notes on the Solutions
1
Midterm Examination -- Question 1
1(a) Define the terms inverted file, inverted list, posting.
Inverted file: a list of the words in a set of documents and the
documents in which they appear.
Inverted list: All the entries in an inverted file that apply to a
specific word.
Posting: Entry in an inverted list
See Lecture 3
1(b) When implementing an inverted file system, what are the
criteria that you would use to judge whether the system is suitable
for very large-scale information retrieval?
2
Q1 (continued)
Storage
Inverted files are big, typically 10% to 100% the size of the
collection of documents.
Update performance
It must be possible, with a reasonable amount of computation, to:
(a) Add a large batch of documents
(b) Add a single document
Retrieval performance
Retrieval must be fast enough to satisfy users and not use
excessive resource.
from Lecture 3
3
Q1 (continued)
1(c) You are designing an inverted file system to be used with Boolean
queries on a very large collection of textual documents. New
documents are being continually added to the collection.
(i) What file structure(s) would you use?
(ii) How well does your design satisfy the criteria listed in Part (b)?
4
Q1 (continued)
Separate inverted index from lists of postings
Term Pointer to
list of postings
inverted
index
5
ant
bee
cat
dog
elk
fox
gnu
hog
postings
file
Lists of
postings
from
Lecture 3
Question 1 (continued)
(a) Postings file may be stored sequentially as a linked
list.
(b) Index file is best stored as a tree. Binary trees
provide fast searching but have problems with updating.
B-trees are better, with B+-trees as the best.
Note: Other answers are possible to this part of the
question.
6
Question 1 (continued)
1(c)(ii) How well does your design satisfy the criteria
listed in Part (b)?
7
•
Sequential list for each term is efficient for storage
and for processing Boolean queries. The
disadvantage is a slow update time for long
inverted lists.
•
B-trees combine fast retrieval with moderately
efficient updating.
•
Bottom-up updating is usual fast, but may require
recursive tree climbing to the root.
•
The main weakness is poor storage utilization;
typically buckets are only 0.69 full.
Midterm Examination -- Question 2
2(b) You have the collection of documents that contain the following
index terms:
D1: alpha bravo charlie delta echo foxtrot golf
D2: golf golf golf delta alpha
D3: bravo charlie bravo echo foxtrot bravo
D4: foxtrot alpha alpha golf golf delta
(i) Use an incidence matrix of terms to calculate a similarity matrix
for these four documents, with no term weighting.
8
Incidence array
D1:
D2:
D3:
D4:
alpha bravo charlie delta echo foxtrot golf
golf golf golf delta alpha
bravo charlie bravo echo foxtrot bravo
foxtrot alpha alpha golf golf delta
alpha bravo charlie delta
D1
1
D2
1
D3
D4
9
1
1
1
foxtrot golf
1
1
1
1
1
echo
1
1
1
1
7
1
3
4
1
1
1
4
Document similarity matrix
D1
D1
10
D2
D3
D4
0.65
0.76
0.76
0.00
0.87
D2
0.65
D3
0.76
0.00
D4
0.76
0.87
0.25
0.25
Question 2 (continued)
2b(ii) Use a frequency matrix of terms to calculate a similarity
matrix for these documents, with weights proportional to the term
frequency and inversely proportional to the document frequency.
11
Frequency Array
D1:
D2:
D3:
D4:
alpha bravo charlie delta echo foxtrot golf
golf golf golf delta alpha
bravo charlie bravo echo foxtrot bravo
foxtrot alpha alpha golf golf delta
alpha bravo charlie delta
D1
1
D2
1
D3
D4
12
1
1
1
foxtrot golf
1
1
3
2
1
echo
1
3
1
1
1
1
1
2
Inverse Document Frequency
Weighting
Principle:
(a) Weight is proportional to the number of times that the term
appears in the document
(b) Weight is inversely proportional to the number of documents
that contain the term:
wik = fik / dk
Where: wik is the weight given to term k in document i
fik is the frequency with which term k appears in document i
dk is the number of documents that contain term k
13
Frequency Array with Weights
D1:
D2:
D3:
D4:
alpha bravo charlie delta echo foxtrot golf
golf golf golf delta alpha
bravo charlie bravo echo foxtrot bravo
foxtrot alpha alpha golf golf delta
alpha bravo charlie delta
D1
0.33
D2
0.33
D3
14
0.50
0.67
dk
3
0.33
foxtrot golf
length
0.50
0.33
0.33
0.94
1.00
0.65
0.33
1.50
D4
0.50
echo
0.50
0.50
0.33
2
2
3
0.33
0.33
2
3
1.08
0.67
3
0.76
Document similarity matrix
D1
D1
15
D2
D3
D4
0.46
0.74
0.58
0.00
0.86
D2
0.46
D3
0.74
0.00
D4
0.56
0.86
0.06
0.06
Question 3
3(a) Define the terms recall and precision.
3(b) Q is a query. D is a collection of 1,000,000 documents.
When the query Q is run, a set of 200 documents is returned.
(i) How in a practical experiment would you calculate
the precision?
Have an expert examine each of the 200 documents and
decide whether it is relevant. Precision is number judged
relevant divided by 200.
(ii) How in a practical experiment would you calculate the
recall?
16
It is not feasible to examine 1,000,000 records. Therefore
sampling must be used ...
Question 3 (continued)
3(c) Suppose that, by some means, it is known that 100
of the documents in D are relevant to Q. Of the 200
documents returned by the search, 50 are relevant.
(i) What is the precision?
50/200 = 0.25
(ii) What is the recall?
50/100 = 0.5
3(d) Explain in general terms the method used by TREC
to estimate the recall.
17
Question 3 (continued)
For each query, a pool of potentially relevant documents is
assembled, using the top 100 ranked documents from each
participant
The human expert who set the query looks at every document
in the pool and determines whether it is relevant.
Documents outside the pool are not examined.
[In TREC-8:
7,100 documents in the pool
1,736 unique documents (eliminating duplicates)
94 judged relevant]
18
Midterm Examination -- Question 4
4(a) What is the Dublin Core principle of dumbing-down? Are
there any fields in this record that do not satisfy the principle?
"The theory behind this principle is that consumers of metadata
should be able to strip off qualifiers and return to the base form of a
property. ... this principle makes it possible for client applications to
ignore qualifiers in the context of more coarse-grained, crossdomain searches."
Lagoze 2001
19
Question 4 (continued)
Dumbing-down failures:
Description.note Title from home page as viewed on Nov. 1, 2000.
Description
Title from home page as viewed on Nov. 1, 2000.
which is not a description of the object
Publisher.place
Nashville, Tenn. :
Publisher
Nashville, Tenn. :
which is not the publisher of the object
Correct dumbing-down:
Subject.class.LCC E840.8.G65
Subject
E840.8.G65
which is a subject code
20
Question 4 (continued)
4(b) The metadata in the fields Publisher and Publisher place
end in punctuation marks. Can you suggest any reasons for
doing so?
This is a historic curiosity. It comes from the concept that the
metadata will be printed, so that the metadata is stored in a
printable format.
Publisher
Publisher.place
Gore/Lieberman,
Nashville, Tenn. :
is intended to be combined with a date as follows:
Nashville, Tenn. : Gore/Lieberman, 2001
21
Question 4 (continued)
4(c) This record has no Creator field. It has a
Contributor.nameCorporate field with value "Gore/Lieberman,
Inc." Do you consider that this is correct use of Dublin Core?
What would you put in the Creator and Contributor fields? Why?
22
Question 4 (continued)
Specification of Dublin Core:
A. All fields are optional. It is not necessary to have a Creator.
B. Definitions of fields
Creator The person or organization primarily responsible for the
intellectual content of the resource.
Contributor A person or organization not specified in a creator
element who has made significant intellectual contributions to the
resource but whose contribution is secondary to any person or
organization specified in a creator element.
Gore/Lieberman, Inc. is the corporate author of this web site
and is therefore the Creator.
23