Transcript PowerPoint

CS 430: Information Discovery
Lecture 3
Inverted Files
and
Boolean Operations
1
Course Administration
• Assignment 1 will be posted during the next couple of
days. It is due on Friday, September 21 at 5 p.m.
2
Inverted File (Basic Definition)
Inverted file: a list of the words in a set of documents and the
documents in which they appear.
Word
Document
abacus
3
19
22
2
19
29
5
11
34
actor
aspen
atoll
3
Stop words are removed
before building the index.
Inverted List
Inverted list: All the entries in an inverted file that apply to a
specific word, e.g.
abacus
3
19
22
Posting: Entry in an inverted list, e.g., the postings for
"abacus" are documents 3, 19, 22.
4
Keywords and Controlled Vocabulary
Keyword:
A term that is used to describe the subject matter in a
document. It is sometimes called an index term.
Keywords can be extracted automatically from a document
or assigned by a human cataloguer or indexer.
Controlled vocabulary:
A list of words that can be used as keywords, e.g., in a
medical system, a list of medical terms.
Inverted file (more complete definition):
A list of the keywords that apply to a set of documents and
the documents in which they appear.
5
Enhancements to Inverted Files
Location: The inverted file holds information about the
location of each term within the document.
Uses
adjacency and near operators
user interface design -- highlight location of search term
Frequency: The inverted file includes the number of postings
for each term.
Uses
term weighting
query processing optimization
user interface design
6
Inverted File (Enhanced)
Word
7
Postings Document Location
abacus
4
actor
3
aspen
atoll
1
3
3
19
19
22
2
19
29
5
11
11
34
94
7
212
56
66
213
45
43
3
70
40
Example: Boolean Queries
Boolean query: two or more search terms, related by
logical operators, e.g.,
and
or
not
Examples:
abacus and actor
abacus or actor
(abacus and actor) or (abacus and atoll)
not actor
8
Boolean Diagram
not (A or B)
A and B
A
9
B
A or B
Evaluating a Boolean Query
Examples: abacus and actor
Postings for abacus
Postings for actor
3
19
22
2
19
29
To evaluate the and
operator, merge the
two inverted lists
with a logical AND
operation.
Document 19 is the only document that contains both
terms, "abacus" and "actor".
10
Adjacent and Near Operators
abacus adj actor
Terms abacus and actor are adjacent to each other as in the string
"abacus actor"
abacus near 4 actor
Terms abacus and actor are near to each other as in the string
"the actor has an abacus"
Some systems support other operators, such as with (two terms in the
same sentence) or same (two terms in the same paragraph).
11
Evaluating an Adjacency Operation
Examples: abacus adj actor
Postings for abacus
Postings for actor
3 94
19 7
19 212
22 56
2 66
19 213
29 45
Document 19, locations 212 and 213, is the only
occurrence of the terms "abacus" and "actor" adjacent.
12
Evaluation of Boolean Operators
Precedence of operators must be defined:
adj, near
high
and, not
or
low
Example
A and B or C and B
is evaluated as
(A and B) or (C and B)
13
Sizes of Inverted Files
Set
Records
Unique
Terms
A
2,653
5,123
B
38,304
c.25,000
Set A has an average of 14 postings per term and a
maximum of over 2,000 postings per term.
Set B has an average of 88 postings per record.
Examples from Harman and Candela, 1990
14
Representation of Inverted Files
Index (vocabulary) file: Stores list of terms
(keywords). Designed for rapid searching and
processing range queries. Often held in memory.
Postings file: Stores a list of postings for each
term. Designed for rapid merging of lists. Each
list may be stored sequentially.
Document file: [Repositories for the storage of
document collections are covered in CS 502.]
15
Organization of Inverted Files
Index (vocabulary) file
Postings file
Term Pointer to
postings
ant
bee
cat
dog
elk
fox
gnu
hog
16
Inverted
lists
Documents file
Decisions in Building Inverted Files
17
•
Underlying character set, e.g., printable ASCII,
Unicode, UTF8.
•
Whether to use a controlled vocabulary. If so, what
words to include.
•
List of stopwords.
•
Rules to decide the beginning and end of words, e.g.,
spaces or punctuation.
•
Character sequences not to be indexed, e.g.,
sequences of numbers.
Efficiency Criteria
Storage
Inverted files are big, typically 10% to 100% the size of the
collection of documents.
Update performance
It must be possible, with a reasonable amount of computation, to:
(a) Add a large batch of documents
(b) Add a single document
Retrieval performance
Retrieval must be fast enough to satisfy users and not use
excessive resource.
18
Index File
If an index is held on disk, search time is dominated by the
number of disk accesses.
Suppose that an index has 1,000,000 distinct terms.
Each index entry consists of the term and a pointer to the
inverted list, average 100 characters.
Size of index is 100 megabytes, which can easily be held in
memory.
19
Postings File
Since inverted lists may be very long, it is important to
match postings efficiently.
Usually, the inverted lists will be held on disk. Therefore
algorithms for matching posting use sequential file
processing.
For efficient matching, the inverted lists should all be sorted
in the same sequence, usually alphabetic order,
"lexicographic index".
Merging inverted lists is the most computationally
intensive task in many information retrieval systems.
20
Efficiency and Query Languages
Some query options may require huge computation, e.g.,
Regular expressions
If inverted files are stored in alphabetical order,
comp* can be processed efficiently
*comp cannot be processed efficiently
Boolean terms
If A and B are search terms
A or B can be processed by comparing two moderate sized lists
(not A) or (not B) requires two very large lists
21
Index File Structures: Linear Index
Term Pointer to
list of postings
ant
bee
cat
dog
elk
fox
gnu
hog
22
Inverted
lists
Linear Index
Advantages
Can be searched quickly, e.g., by binary search, O(log n)
Good for sequential processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
23