Transcript PowerPoint

Discussion Class 1
Inverted Files
1
Discussion Classes
Format:
Question
Ask a member of the class to answer
Provide opportunity for others to comment
When answering:
Give your name. Make sure that the TA hears it.
Stand up
Speak clearly so that all the class can hear
2
Question 1: Terminology
(a) What is a keyword? How is it used?
(b) What is a controlled vocabulary? How might it
be used?
3
Question 2: Files
The book shows an inverted file implemented as
three files:
• Index file
• Postings file
• Documents file
(a) What is each used for?
(b) Why are they kept separate?
4
Question 3: Lexicographic Indexes
(a) What is a "lexicographic index"?
(b) Why are lexicographic indexes useful in
information retrieval?
(c) Give an example of an indexing system that is
not lexicographic.
5
Question 4: Building an Inverted File
The first stage in building an inverted file is to
create a list of words and their locations in the text.
(a) Before this list can be built, what decisions must
be made?
(b) What steps are involved in creating this list?
6
Question 5: Sorting an Inverted Index
The second stage in building an inverted file is to
sort the list of words and their locations in the text.
The book describes a two-step algorithm by Harman
and Candela for this purpose.
(a) For what circumstances is this algorithm
intended?
(b) What are the two steps?
7
Question 6: Sorting an Inverted Index
In the first step of the algorithm developed by
Harman and Candela:
(a) What data structure is used for the index file?
Why is this appropriate?
(b) What data structure is used for the postings file?
Why is this appropriate?
(c) Which files would be stored in memory and
which on disk?
8
Question 7: Footnote
The first sentence of Section 3.4.2 reads, "The second
technique to produce a sorted array inverted file is a fast
inversion algorithm called FAST-INV (Copyright
©Edward A. Fox, Whay C. Lee, Virginia Tech)."
What is surprising about this sentence?
9