imedmca10.files.wordpress.com

Transcript imedmca10.files.wordpress.com

File organization
File Organizations (Indexes)
• Choices for accessing data during query evaluation
• Scan the entire collection
–
–
–
–
Typical in early (batch) retrieval systems
Computational and I/O costs are O(characters in collection)
Practical for only “small” text collections
Large memory systems make scanning feasible
• Use indexes for direct access
– Evaluation time O(query term occurrences in collection)
– Practical for “large” collections
– Many opportunities for optimization
• Hybrids: Use small index, then scan a subset of the
collection
Indexes
• What should the index contain?
• Database systems index primary and secondarykeys
– This is the hybrid approach
– Index provides fast access to a subset of database records
– Scan subset to find solution set
• IR Problem:
• Cannot predict keys that people will use in queries
– Every word in a document is a potential search term
• IR Solution: Index by all keys (words) full text indexes
Indexes
•
•
•
•
•
•
•
•
•
•
•
•
Index is accessed by the atoms of a query language, The atoms are called
“features” or “keys” or “terms”
Most common feature types:
– Words in text, punctuation
– Manually assigned terms (controlled and uncontrolled vocabulary)
– Document structure (sentence and paragraph boundaries)
– Inter- or intra-document links (e.g., citations)
Composed features
– Feature sequences (phrases, names, dates, monetary amounts)
– Feature sets (e.g., synonym classes)
Indexing and retrieval models drive choices
– Must be able to construct all components of those models
Indexing choices (there is no “right” answer)
– Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g.,
the, a, its), Morphology (e.g., computer, computers, computing, computed)
•
Index granularity has a large impact on speed and effectiveness
Index Contents
• The contents depend upon the retrieval model
• Feature presence/absence
– Boolean
– Statistical (tf, df, ctf, doclen, maxtf)
– Often about 10% the size of the raw data, compressed
• Positional
–
–
–
–
Feature location within document
Granularities include word, sentence, paragraph, etc
Coarse granularities are less precise, but take less space
Word-level granularity about 20-30% the size of the raw
data,compressed
Indexes: Implementation
• Common implementations of indexes
– Bitmaps
– Signature files
– Inverted files
No positional data indexed
• Common index components
– Dictionary (lexicon)
– Postings
• document ids
• word positions
Indexes: Bitmaps
• Bag-of-words index only
• For each term, allocate vector with one bit per
document
• If feature present in document n, set nth bit to 1,
otherwise 0
• Boolean operations very fast
• Space efficient for common terms (why?)
• Space inefficient for rare terms (why?)
• Good compression with run-length encoding
(why?)
• Not widely used
Indexes: Signature Files
• Bag-of-words only
• For each term, allocate fixed size s-bit vector (signature)
• Define hash function:
– Multiple functions: word → 1..s [selects which bits to set]
• Each term has an s-bit signature
– may not be unique!
• OR the term signatures to form document signature
• Long documents are a problem (why?)
– Usually segment them into smaller pieces / blocks
Signature File Example
Signature File Example
Indexes: Signature Files
• At query time:
– Lookup signature for query (how?)
– If all corresponding 1-bits are “on” in document
signature, document probably contains that term
• Vary s to control P (false positive)
– Note space tradeoff
• Optimal s changes as collection grows (why?)
• Widely studied but Not widely used
Indexes: Inverted Lists
• Inverted lists are currently the most common
indexing technique
• Source file: collection, organized by document
• Inverted file: collection organized by term
– one record per term, listing locations where term occurs
• During evaluation, traverse lists for each query term
–
–
–
–
OR: the union of component lists
AND: an intersection of component lists
Proximity: an intersection of component lists
SUM: the union of component lists; each entry has a score
Inverted Files
Inverted Files
Word-Level Inverted File
How big is the index?
• For an n word collection:
• Lexicon
– Heaps’ Law: V = O(nβ), 0.4 < β< 0.6
– TREC-2: 1 GB text, 5 MB lexicon
• Postings
– at most, one per occurrence of the word in the
text: O(n)
Inverted Search Algorithm
1. Find query elements (terms) in the
lexicon
2. Retrieve postings for each lexicon entry
3. Manipulate postings according to the
retrieval model
Word-Level Inverted File
lexicon
Query:
1.porridge & pot (BOOL)
2.“porridge pot” (BOOL)
3. porridge pot (VSM)
Answer
posting
Lexicon Data Structures
According to Heaps’ Law, the size of lexicon
maybe very large
• Hash table
– O(1) lookup, with constant h() and collision handling
– Supports exact-match lookup
– May be complex to expand
• B-Tree
– On-disk storage with fast retrieval and good caching
behavior
– Supports exact-match and range-based lookup
– O(log n) lookups to find a list
– Usually easy to expand
In-memory Inversion Algorithm
1. Create an empty lexicon
2. For each document d in the collection,
1. Read document, parse into terms
2. For each indexing term t,
1. fd,t = frequency of t in d
2. If t is not in lexicon, insert it
3. Append <d, fd,t> to postings list for t
3. Output each postings list into inverted file
1.
2.
3.
4.
For each term, start new file entry
Append each <d,fd,t> to the entry
Compress entry
Write entry out to file.
Complexity of In-memory Inv.
• Time: O(n) for n-byte text
• Space
– Lexicon: space for unique words + offsets
– Postings, 10 bytes per entry
• document number: 4 bytes
• frequency count: 2 bytes (allows 65536 max occ)
• "next" pointer: 4 bytes
• Is this affordable?
– For 5GB collection, at 10 bytes/entry, for 400M entries,
need 4GB of main memory
Idea 1: Partition the text
• Invert a chunk of the text at a time
• Then, merge each sub-indexes into one
complete index
Main inverted file
多路归并
chunk
Idea 2: Sort-based Inversion
Invert in two passes:
1. Output records <t, d, ft> to a temp. file
2. Sort the records using external merge
sort
1.
2.
3.
4.
read a chunk of the temp file
sort it using Quicksort
write it back into the same place
then merge-sort the chunks in place
3. Read sorted file, and write inverted file
Access Optimizations
• Skip lists:
– A table of contents to the inverted list
– Embedded pointers that jump ahead n documents
(why is this useful?)
• Separating presence information from location
information
– Many operators only need presence information
– Location information takes substantial space (I/O)
– If split,
• reduced I/O for presence operators
• increased I/O for location operators (or larger index)
– Common in CD-ROM implementations
Inverted file compression
• Inverted lists are usually compressed
• Inverted files with word locations are about the size of
the raw data
• Distribution of numbers is skewed
– Most numbers are small (e.g., word locations, term frequency)
• Distribution can be made more skewed easily
– Delta encoding: 5, 8, 10, 17 → 5, 3, 2, 7
• Simple compression techniques are often the best
choice
– Simple algorithms nearly as effective as complex algorithms
– Simple algorithms much faster than complex algorithms
– Goal: Time saved by reduced I/O > Time required to
uncompress
Compress  Huffman encoding
Inverted file compression
• The longst lists, which take up the most space, have the most
frequent (probable) words.
• Compressing the longest lists would save the most space.
• The longest lists should compress easily because they contain the
least information (why?)
• Algorithms:
–
–
–
–
–
–
–
–
Delta encoding
Variable-length encoding
Unary codes
Gamma codes
Delta codes
Variable-Byte Code
Golomb code
Inverted List Indexes: Compression
Delta Encoding ("Storing Gaps")
• Reduces range of numbers.
– keep d in ascending order
– 3, 5, 20, 21, 23, 76, 77, 78
– becomes: 3, 2, 15, 1, 2, 53, 1, 1
• Produces a more skewed distribution.
• Increases probability of smaller numbers.
• Stemming also increases the probability of
smaller numbers. (Why?)
Variable-Byte Code
• Binary, but use minimum number of bytes
• 7 bits to store value, 1 bit to flag if there is
another byte
– 0 < x < 128: 1 byte
– 128 < x < 16384: 2 bytes
– 16384 < x < 2097152: 3 bytes
•
•
•
•
Integral byte sizes for easy coding
Very effective for medium-sized numbers
A little wasteful for very small numbers
Best trade-off between smallest index size and
fastest query time
Index/IR toolkits
http://www.lemurproject.org/
http://lucene.apache.org/nutch/
• The Lemur Toolkit for
Language Modeling and
Information Retrieval
• Java-based indexing
and search technology,
provide web search
application software
Summary
• Common implementations of indexes
– Bitmap, signature file, inverted file
• Common index components
– Lexicon & postings
• Inverted Search Algorithm
• Inversion algorithm
• Inverted file access optimizations
– compression
The End
Vector Space Model
• 文档d和查询q在向量空间中表示为两个m维向
量，每维度的权值用TF∙IDF，其相似度用向量
夹角余弦度量，有: (使用原始的tf,idf公式)
Cos(Q, Dd ) 
1

Wq  Wd
1

Wd

tQ
W
tQ

tQ
d ,t
 Wq ,t
Wq  Wd

f
tQ
d ,t
N
f d ,t  f q ,t  log ( )
dft
2
N
f d ,t  f q ,t  log ( )
dft
2
 idft  f q ,t  idft
Wq  Wd

Vocabulary Growth (Heap’s Law)
• How does the size of the overall vocabulary
(number of unique words) grow with the size of
the corpus?
– Vocabulary has no upper bound due to proper names,
typos, etc.
– New words occur less frequently as vocabulary grows
• If V is the size of the vocabulary and the n is the
length of the corpus in words:
– V = Knβ (0< β <1)
• Typical constants:
– K ≈ 10−100
– β ≈ 0.4−0.6 (approx. square-root of n)
Query Answer
• 1.porridge & pot (BOOL)
– d2
• 2.“porridge pot” (BOOL)
– null
• 3. porridge pot
– d2 > d1>d5
– Next page
(VSM)
Cos(Q, Dd ) 
3. porridge pot (VSM)
N=6，f(d1)表示d1内词频
df
f(d1)
f(d2)
f(d5)
porridge
2
2
1
0
Pot
2
0
1
1
Term
1
Wd
f
tQ
d ,t
 f q ,t  log2 (

N 
| Wd |   f d , t  log( ) 
dft 
tD 
下面，df相同，均略去，
2
文档向量：
D1   1,0,1,0,0,0,0,0,2,2,0,0,0 
D2   0,0,0,1,0,0,0,0,1,1,1,0,1
D5   0,0,0,1,1,1,0,0,0,0,1,1,1
文档长度：
| Wd1 | 10
| Wd2 | 5
| Wd5 | 6
Sim(q, d1)  2 / 10
Sim(q, d2)  2 / 5
back
Sim(q, d5)  1 / 6
 d2  d1  d5
N
)
dft

imedmca10.files.wordpress.com

Transcript imedmca10.files.wordpress.com

Directory