Chap 14 Ranking System
Download
Report
Transcript Chap 14 Ranking System
Chap 14 Ranking Algorithm
指導教授: 黃三益 博士
學生: 吳金山 鄭菲菲
1
Outline
Introduction
Ranking models
Selecting ranking techniques
Data structures and algorithms
The creation of an inverted file
Searching the inverted file
Stemmed and unstemmed query terms
A Boolean systems with ranking
Pruning
2
Introduction
Boolean systems
Providing powerful on-line search capabilities for
librarians and other trained intermediaries
Providing very poor service for end-users who use the
system infrequently
The ranking approach
Inputting a natural language query without Boolean
syntax
Producing a list of ranked records that “answer” the
query
More oriented toward end-users
3
Introduction (cont.)
Natural language/ranking approach
is more effective for end-users
The results being ranked based on co-
occurrence of query terms
modified by statistical term-weighting
eliminating the often-wrong Boolean syntax
used by end-users
providing some results even if a query term is
incorrect
4
Figure 14.1 Statistical ranking
Term
Factors
Qry.
Vtr.
Information
1
1
1
1
Retrieval
systems
0
1
0
1
1
0
1
0
1
0
0
1
0
1
Human, factors, help, systems
1
0
Rec3.
Vtr.
Operation
Human, factors, information, retrieval
Rec2.
Vtr.
Human
Human factors in information retrieval systems
Rec1.
Vtr.
Help
1
1
0
Factors, operation, systems
1
0
0
0
1
5
Figure 14.1 Statistical ranking
Simple Match
Query (1 1 0 1 0 1 1)
Rec1 (1 1 0 1 0 1 0)
(1 1 0 1 0 1 0) = 4
Weighted Match
Query (1 1 0 1 0 1 1)
Rec1 (2 3 0 5 0 3 0)
(2 3 0 5 0 3 0) = 13
Query (1 1 0 1 0 1 1)
Query (1 1 0 1 0 1 1)
Rec2 (1 0 1 1 0 0 1)
Rec2 (2 0 4 5 0 0 1)
(1 0 0 1 0 0 1) = 3
(2 0 0 5 0 0 1) = 8
Query (1 1 0 1 0 1 1)
Query (1 1 0 1 0 1 1)
Rec3 (1 0 0 0 1 0 1)
Rec3 (2 0 0 0 2 0 1)
(1 0 0 0 0 0 1) = 2
(2 0 0 0 0 0 1) = 3
6
Ranking models
Two types of ranking models
ranking the query against Individual
documents
Vector space model
Probabilistic model
ranking the query against entire sets of
related documents
7
Ranking models (cont.)
Vector space model
Using cosine correlation to compute similarity
Early experiments
SMART system (overlap similarity function)
Results
Within document frequency weighting >
no term weighting
Cosine correlation with frequency term weighting >
overlap similarity function
Salton & Yang (1973)
(Relying on term importance within an entire collection)
Results
Significant performance improvement using the withindocument frequency weighting + the inverted document
frequency (IDF)
8
Ranking models (cont.)
Probabilistic model
Terms appearing in previously retrieved relevant
documents was given a higher weight
Croft and Harper (1979)
Probabilistic indexing without any relevance
information
Assuming all query terms have equal probability
Deriving a term-weighting formula
Q
sim ilarityjk
i 1
( C log
N ni )
)
ni
9
Ranking models (cont.)
Probabilistic model
Croft (1983)
Incorporating within-document frequency weights
Using a tuning factor K
Q
similarityjk (C IDFi )* fij
i 1
f ij K (1 K )
freqij
max freq j
Result
Significant improvement over both the IDF weighting alone
and the combination weighting
10
Other experiments involving ranking
Direct comparison of similarity measures and
term-weighting schemes
4 types of term frequency weightings (Sparch Jones,1973)
Term frequency within a document
Term frequency within a collection
Term postings within a document (a binary measure)
Term postings within a collection
Indexing was taken from manually extracted keywords
Results
Using the term frequency (or postings) within a collection
always improved performance
Using term frequency ( or postings) within a document improved
performance only for some collections
11
Other experiments involving ranking (cont.)
Harman(1986)
Four term-weighting factors
(a) The number of matches between a document & a query
(b) The distribution of a term within a document collection
IDF & noise measure
(c) The frequency of a term within a document
(d) The length of the document
Results
Using the single measures alone, the distribution of the term
within the collection = 2 (c)
Combining the within-document frequency with either the IDF or
noise measure = 2 (using the IDF or noise alone)
12
Other experiments involving ranking (cont.)
Ranking based on document structure
Not only using weights based on term
importance both within an entire collection and
within a given document (Bernstein and Williamson, 1984)
But also using the structural position of the term
Summary versus text paragraphs
In SIBRIS, increasing term-weights for terms in
titles of documents and decreasing termweights for terms added to a query from a
thesaurus
13
Selecting ranking techniques
Using term-weighting based on the distribution of
a term within a collection
always improves performance
Within-document frequency + IDF weight
often provides even more improvement
Within-document frequency + (Several methods)
IDF measure
Adding additional weight for document structure
Eg. higher weightings for terms appearing in the title or
abstract vs. those appearing only in the text
Relevance weighting (Chap 11)
14
The creation of an inverted file
Implications for supporting inverted file structures
Only the record id has to be stored (smaller index)
Using strategies that increase recall at the expense of
precision
Inverted file is usually split into two pieces for
searching
The dictionary containing the term, along with statistics
about that term such as no. of postings and IDF, and a
pointer to the location of the postings file for term
The postings file containing the record ids and the
weights for all occurrences of the term
15
The creation of an inverted file (cont.)
4 major options for storing weights in the
postings file
Store the raw frequency
Slowest search
Most flexible
Store a normalized frequency
Not suitable for use with the cosine similarity
function
Updating would not change the postings
16
The creation of an inverted file (cont.)
Store the completely weighted term
Any of the combination weighting schemes are
suitable
Disadvantage: updating requires changing all
postings
If no within-record weighting is used, then the
postings records do not have to store weights
17
Searching the inverted file
Figure 14.4 flowchart of search engine
query
parser
Dictionary Lookup
Dictionary entry
Get Weights
Record numbers on a per term basis
Accumulator
Record numbers. Total weights
Sort by weight
Ranked record numbers
18
Searching the inverted file (cont.)
Inefficiencies of this technique
The I/O needs to be minimized
A single read for all the postings of a given term, and
then separating the buffer into record ids and weights
Time savings can be gained at the expense of
some memory space
Direct access to memory rather than through hashing
A final major bottleneck can be the sort step of
the “accumulators” for large data sets
Fast sort of thousands of records is very time
consuming
19
Stemmed and unstemmed query terms
If query terms were automatically stemmed
in a ranking system, users generally got
better results (Frakes, 1984; Canadela, 1990)
In some cases, a stem is produced that leads to
improper results
the original record terms are not stored in the
inverted file; only their stems are used
20
Stemmed and unstemmed query terms
(cont.)
Harman & Candela (1990)
2 separate inverted files could be created and
stored
Stem terms: normal query
Unstemmed terms: don’t stem
Hybrid inverted file
Saving no space in the dictionary part
Saving considerable storage (2 versions of posting)
At the expense of some additional search time
21
A Boolean systems with ranking
SIRE system
Full Boolean capability + a variation of the basic search
process
Accepts queries that are either Boolean logic
strings or natural language queries (implicit OR)
Major modification to the basic search process
Merge postings from the query terms before ranking is
done
Performance
Faster response time for Boolean queries
No increase in response time for natural language
queries
22
Pruning
A major time bottleneck in the basic search
process
The sort of the accumulators for large data sets
Changed search algorithm with pruning:
1. Sort all query terms (stems) by decreasing IDF value
2. Do a binary search for the first term (i.e., the highest
IDF) and get the address of the postings list for that
term
3. Read the entire postings file for that term into a buffer
and add the term weights for each record id into the
contents of the unique accumulator for the record id
23
Pruning (cont.)
4. Check the IDF of the next query term.
If the IDF >= 1/3 (max IDF of any term in the
data set)
then repeat steps 2, 3, and 4
otherwise repeat steps 2, 3, and 4, but do
not add weights to zero weight accumulators
5. Sort the accumulators with nonzero weights to
produce the final ranked record list
6. If a query has only high-frequency terms, then
pruning cannot be done.
24
Thanks
25