Transcript Document
Web search engines
Rooted in Information Retrieval (IR) systems
•Prepare a keyword index for corpus
•Respond to keyword queries with a ranked list of
documents.
ARCHIE
•Earliest application of rudimentary IR systems to
the Internet
•Title search across sites serving files over FTP
Boolean queries: Examples
Simple queries involving relationships
between terms and documents
• Documents containing the word Java
• Documents containing the word Java but not
the word coffee
Proximity queries
• Documents containing the phrase Java beans
•
or the term API
Documents where Java and island occur in
the same sentence
Mining the Web
Chakrabarti and Ramakrishnan
2
Document preprocessing
Tokenization
• Filtering away tags
• Tokens regarded as nonempty sequence of
•
•
•
characters excluding spaces and
punctuations.
Token represented by a suitable integer, tid,
typically 32 bits
Optional: stemming/conflation of words
Result: document (did) transformed into a
sequence of integers (tid, pos)
Mining the Web
Chakrabarti and Ramakrishnan
3
Storing tokens
Straight-forward implementation using a
relational database
• Example figure
• Space scales to almost 10 times
Accesses to table show common pattern
• reduce the storage by mapping tids to a
•
lexicographically sorted buffer of (did, pos)
tuples.
Indexing = transposing document-term matrix
Mining the Web
Chakrabarti and Ramakrishnan
4
Two variants of the inverted index data structure, usually stored on disk. The simpler
version in the middle does not store term offset information; the version to the right stores
term
offsets. The mapping from terms to documents and positions (written as
“document/position”) may
be implemented using a B-tree or a hash-table.
Mining the Web
Chakrabarti and Ramakrishnan
5
Storage
For dynamic corpora
• Berkeley DB2 storage manager
• Can frequently add, modify and delete
documents
For static collections
• Index compression techniques (to be
discussed)
Mining the Web
Chakrabarti and Ramakrishnan
6
Stopwords
Function words and connectives
Appear in large number of documents and little
use in pinpointing documents
Indexing stopwords
• Stopwords not indexed
For reducing index space and improving performance
• Replace stopwords with a placeholder (to remember
the offset)
Issues
• Queries containing only stopwords ruled out
• Polysemous words that are stopwords in one sense
but not in others
Mining the Web
E.g.; can as a verb vs. can as a noun
Chakrabarti and Ramakrishnan
7
Stemming
Conflating words to help match a query term with a
morphological variant in the corpus.
Remove inflections that convey parts of speech, tense
and number
E.g.: university and universal both stem to universe.
Techniques
• morphological analysis (e.g., Porter's algorithm)
• dictionary lookup (e.g., WordNet).
Stemming may increase recall but at the price of
precision
• Abbreviations, polysemy and names coined in the technical and
commercial sectors
• E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to
“gate”, may be bad !
Mining the Web
Chakrabarti and Ramakrishnan
8
Batch indexing and updates
Incremental indexing
• Time-consuming due to random disk IO
• High level of disk block fragmentation
Simple sort-merges.
• To replace the indexed update of variablelength postings
For a dynamic collection
• single document-level change may need to
update hundreds to thousands of records.
• Solution : create an additional “stop-press”
index.
Mining the Web
Chakrabarti and Ramakrishnan
9
Maintaining indices over dynamic collections.
Mining the Web
Chakrabarti and Ramakrishnan
10
Stop-press index
Collection of document in flux
• Model document modification as deletion followed by insertion
• Documents in flux represented by a signed record (d,t,s)
• “s” specifies if “d” has been deleted or inserted.
Getting the final answer to a query
• Main index returns a document set D0.
• Stop-press index returns two document sets
D+ : documents not yet indexed in D0 matching the query
D- : documents matching the query removed from the collection
since D0 was constructed.
Stop-press index getting too large
• Rebuild the main index
signed (d, t, s) records are sorted in (t, d, s) order and mergepurged into the master (t, d) records
• Stop-press index can be emptied out.
Mining the Web
Chakrabarti and Ramakrishnan
11
Relevance ranking
Keyword queries
• In natural language
• Not precise, unlike SQL
Boolean decision for response unacceptable
• Solution
Rate each document for how likely it is to satisfy the user's
information need
Sort in decreasing order of the score
Present results in a ranked list.
No algorithmic way of ensuring that the ranking
strategy always favors the information need
• Query: only a part of the user's information need
Mining the Web
Chakrabarti and Ramakrishnan
12
Responding to queries
Set-valued response
• Response set may be very large
(E.g.,
by recent estimates, over 12 million Web
pages contain the word java.)
Demanding selective query from user
Guessing user's information need and
ranking responses
Evaluating rankings
Mining the Web
Chakrabarti and Ramakrishnan
13
Evaluating procedure
Given benchmark
• Corpus of n documents D
• A set of queries Q
• For each query, q Q an exhaustive set of
relevant documents Dq D identified
manually
Query submitted system
• Ranked list of documents
•
(d1 , d 2 ,, d n )
retrieved (r1, r2 , .., rn )
compute a 0/1 relevance list
ri 1 iff di Dq
Mining the Web
ri 0 otherwise.Chakrabarti and Ramakrishnan
14
Recall and precision
Recall at rank
• Fraction of all relevant documents included in
•
. (d1 , d 2 ,, d n )
1
.
recall(k)
| Dq |
r
1i k
i
Precision at rank k 1
• Fraction of the top k responses that are
•
actually relevant.
. precision(k) 1
r
k
Mining the Web
1i k
i
Chakrabarti and Ramakrishnan
15
Other measures
Average precision
• Sum of precision at each relevant hit position in the
response list, divided by the total number of relevant
documents
• . avg.precision 1 rk * precision(k )
| Dq | 1k |D|
.
• avg.precision =1 iff engine retrieves all relevant
documents and ranks them ahead of any irrelevant
document
Interpolated precision
• To combine precision values from multiple queries
• Gives precision-vs.-recall curve for the benchmark.
Mining the Web
For each query, take the maximum precision obtained for the
query for any recall greater than or equal to
average them together for all queries
Chakrabarti and Ramakrishnan
16
Precision-Recall tradeoff
Interpolated precision cannot increase with
recall
• Interpolated precision at recall level 0 may be less
than 1
At level k = 0
• Precision (by convention) = 1, Recall = 0
Inspecting more documents
• Can increase recall
• Precision may decrease
we will start encountering more and more irrelevant
documents
Search engine with a good ranking function will
generally show a negative relation between
recall and precision.
Mining •
the Web
Ramakrishnan
17
Higher the curve,Chakrabarti
betterandthe
engine
Precision and interpolated precision plotted against recall for
the given relevance vector. Missing rk are zeroes.
Mining the Web
Chakrabarti and Ramakrishnan
18
The vector space model
Documents represented as vectors in a
multi-dimensional Euclidean space
• Each axis = a term (token)
Coordinate of document d in direction of
term t determined by:
• Term frequency TF(d,t)
number
of times term t occurs in document d,
scaled in a variety of ways to normalize document
length
• Inverse document frequency IDF(t)
to
scale down the coordinates of terms that occur
Mining the Web in many documents
Chakrabarti and Ramakrishnan
19
Term frequency
n(d, t)
. T F(d, t) n(d, t)
TF(d, t)
max (n(d, ))
n(d, )
.
Cornell SMART system uses a smoothed
version
n( d , t ) 0
TF (d , t ) 0
TF (d , t ) 1 log(1 n(d , t )) otherwise
Mining the Web
Chakrabarti and Ramakrishnan
20
Inverse document frequency
Given
• D is the document collection and Dt is the set
of documents containing t
Formulae
• mostly dampened functions of
• SMART
.
Mining the Web
D
| Dt |
1 | D |
IDF(t ) log(
)
| Dt |
Chakrabarti and Ramakrishnan
21
Vector space model
Coordinate of document d in axis t
• . dt TF (d , t )IDF(t )
• Transformed to d in the TFIDF-space
Query q
• Interpreted as a document
• Transformed to q in the same TFIDF-space
as d
Mining the Web
Chakrabarti and Ramakrishnan
22
Measures of proximity
Distance measure
• Magnitude of the vector difference
.
|d q |
• Document vectors must be normalized to unit
length
Else
shorter documents dominate (since queries
are short)
Cosine similarity
• cosine of the angle between
Shorter
Mining the Web
d
and
q
documents are penalized
Chakrabarti and Ramakrishnan
23
Relevance feedback
Users learning how to modify queries
• Response list must have least some relevant
documents
• Relevance feedback
`correcting' the ranks to the user's taste
automates the query refinement process
Rocchio's method
• Folding-in user feedback
q
• To query vector
• .
Add a weighted sum of vectors for relevant documents D+
Subtract a weighted sum of the irrelevant documents D-
q' q d - d
D
Mining the Web
D-
Chakrabarti and Ramakrishnan
24
Relevance feedback (contd.)
Pseudo-relevance feedback
• D+ and D- generated automatically
E.g.:
Cornell SMART system
top 10 documents reported by the first round of
query execution are included in D+
• typically set to 0; D- not used
Not a commonly available feature
• Web users want instant gratification
• System complexity
Executing
the second round query slower and
expensive for major search engines
Mining the Web
Chakrabarti and Ramakrishnan
25
Bayesian Inferencing
Bayesian inference network for relevance ranking. A
document is relevant to the extent that setting its
corresponding belief node to true lets us assign a high
degree of belief in the node corresponding to the query.
Mining the Web
Chakrabarti and Ramakrishnan
Manual specification of
mappings between terms
to approximate concepts.
26
Bayesian Inferencing (contd.)
Four layers
1.Document layer
2.Representation layer
3.Query concept layer
4.Query
Each node is associated with a random
Boolean variable, reflecting belief
Directed arcs signify that the belief of a
node is a function of the belief of its
immediate parents (and so on..)
Mining the Web
Chakrabarti and Ramakrishnan
27
Bayesian Inferencing systems
2 & 3 same for basic vector-space IR
systems
Verity's Search97
• Allows administrators and users to define
hierarchies of concepts in files
Estimation of relevance of a document d
w.r.t. the query q
• Set the belief of the corresponding node to 1
• Set all other document beliefs to 0
• Compute the belief of the query
• Rank documents in decreasing order of belief
that they induce
in the query
Chakrabarti and Ramakrishnan
Mining the Web
28
Other issues
Spamming
• Adding popular query terms to a page unrelated to
those terms
• E.g.: Adding “Hawaii vacation rental” to a page about
“Internet gambling”
• Little setback due to hyperlink-based ranking
Titles, headings, meta tags and anchor-text
• TFIDF framework treats all terms the same
• Meta search engines:
Assign weight age to text occurring in tags, meta-tags
• Using anchor-text on pages u which link to v
Mining the Web
Anchor-text on u offers valuable editorial judgment about v as
well.
Chakrabarti and Ramakrishnan
29
Other issues (contd..)
Including phrases to rank complex queries
• Operators to specify word inclusions and
•
exclusions
With operators and phrases
queries/documents can no longer be treated
as ordinary points in vector space
Dictionary of phrases
• Could be cataloged manually
• Could be derived from the corpus itself using
•
statistical techniques
Two separate indices:
one
Mining the Web
for single terms and another for phrases
Chakrabarti and Ramakrishnan
30
Corpus derived phrase dictionary
t
t
Two terms 1and 2
Null hypothesis = occurrences
of 1and 2 are independent
To the extent the pair violates
the null hypothesis, it is likely
to be a phrase
t
t
• Measuring violation
with likelihood ratio of
the hypothesis
• Pick phrases that
violate the null
hypothesis with large
confidence
Contingency table built from
statistics
Mining the Web
k00 k (t1 , t2 ) k01 k (t1 , t2 )
k10 k (t1 , t2 ) k11 k (t1 , t2 )
Chakrabarti and Ramakrishnan
31
Corpus derived phrase dictionary
Hypotheses
• Null hypothesis
k00 k01 k10 k11
H ( p00 , p01 , p10 , p11; k00 , k01 , k10 , k11 ) p00
p01 p10 p11
• Alternative hypothesis
H ( p1, p2 ; k00 , k01, k10 , k11 ) ((1 p1 )(1 p2 ))k00 ((1 p1 ) p2 )k01 ( p1 (1 p2 ))k10 ( p1 p2 )k11
• Likelihood ratio
max H ( p; k )
p0
max H ( p; k )
p
Mining the Web
Chakrabarti and Ramakrishnan
32
Approximate string matching
Non-uniformity of word spellings
• dialects of English
• transliteration from other languages
Two ways to reduce this problem.
1. Aggressive conflation mechanism to
2.
collapse variant spellings into the same
token
Decompose terms into a sequence of qgrams or sequences of q characters
Mining the Web
Chakrabarti and Ramakrishnan
33
Approximate string matching
1. Aggressive conflation mechanism to collapse
variant spellings into the same token
•
•
E.g.: Soundex : takes phonetics and pronunciation details
into account
used with great success in indexing and searching last
names in census and telephone directory data.
2. Decompose terms into a sequence of q-grams
or sequences of q characters
•
•
Check for similarity in the q(2 q 4)
grams
Looking up the inverted index : a two-stage affair:
•
•
•
•
Mining the Web
Smaller index of q-grams consulted to expand each query
term into a set of slightly distorted query terms
These terms are submitted to the regular index
Used by Google for spelling correction
Idea also adopted for eliminating near-duplicate pages
Chakrabarti and Ramakrishnan
34
Meta-search systems
• Take the search engine to the document
• Forward queries to many geographically distributed
repositories
•
Each has its own search service
• Consolidate their responses.
• Advantages
• Perform non-trivial query rewriting
•
Suit a single user query to many search engines with
different query syntax
• Surprisingly small overlap between crawls
• Consolidating responses
• Function goes beyond just eliminating duplicates
• Search services do not provide standard ranks which
can be combined meaningfully
Mining the Web
Chakrabarti and Ramakrishnan
35
Similarity search
• Cluster hypothesis
• Documents similar to relevant documents are
also likely to be relevant
• Handling “find similar” queries
• Replication or duplication of pages
• Mirroring of sites
Mining the Web
Chakrabarti and Ramakrishnan
36
Document similarity
• Jaccard coefficient of similarity between
document d1 and d 2
• T(d) = set of tokens in document d
| T (d ) T (d ) |
r
'
(
d
,
d
)
•.
| T (d ) T (d ) |
• Symmetric, reflexive, not a metric
• Forgives any number of occurrences and any
1
1
2
1
2
2
permutations of the terms.
• 1 r ' (d1 , d 2 )
Mining the Web
is a metric
Chakrabarti and Ramakrishnan
37
Estimating Jaccard coefficient with
random permutations
1. Generate a set of m random
permutations
2. for each do
3.
compute (d1 ) and (d 2 )
4.
check if minT (d1 ) minT (d2 )
5. end for
6. if equality was observed in k cases,
estimate. r ' (d1 , d 2 ) k
m
Mining the Web
Chakrabarti and Ramakrishnan
38
Fast similarity search with random
permutations
1. for each random permutation
do
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
create a filef
for each document d do
write out s min (T (d )),d
tof
end for
sort f using key s--this results in contiguous blocks with fixed
ds
s containing all associated
create a fileg
f
for each pair(d1, d2 )
within a run of
having a given s do
(d1 , d2 )
write out a document-pair record
to g
end for
sort g on key(d1, d2 )
end for
(d1of
, d2 )
merge g for all in(d1, d2 )
order, counting the number
entries
Mining the Web
Chakrabarti and Ramakrishnan
39
Eliminating near-duplicates via shingling
• “Find-similar” algorithm reports all duplicate/nearduplicate pages
• Eliminating duplicates
• Maintain a checksum with every page in the corpus
• Eliminating near-duplicates
• Represent each document as a set T(d) of q-grams (shingles)
• Find Jaccard similarity r (d1 , d 2 ) between d1 and d 2
• Eliminate the pair from step 9 if it has similarity above a
threshold
Mining the Web
Chakrabarti and Ramakrishnan
40
•
Detecting locally similar sub-graphs of the
Web
Similarity search and duplicate elimination on the
graph structure of the web
•
•
To improve quality of hyperlink-assisted ranking
Detecting mirrored sites
•
Approach 1 [Bottom-up Approach]
1.
Start process with textual duplicate detection
•
•
•
2.
3.
•
cleaned URLs are listed and sorted to find duplicates/nearduplicates
each set of equivalent URLs is assigned a unique token ID
each page is stripped of all text, and represented as a sequence
of outlink IDs
Continue using link sequence representation
Until no further collapse of multiple URLs are possible
Approach 2 [Bottom-up Approach]
1.
2.
3.
Mining the Web
identify single nodes which are near duplicates (using textshingling)
extend single-node mirrors to two-node mirrors
continue on to larger and larger graphs which are likely mirrors of
Chakrabarti and Ramakrishnan
41
one another
Detecting mirrored sites (contd.)
• Approach 3 [Step before fetching all pages]
•
Uses regularity in URL strings to identify host-pairs which are
mirrors
• Preprocessing
• Host are represented as sets of positional bigrams
• Convert host and path to all lowercase characters
• Let any punctuation or digit sequence be a token separator
• Tokenize the URL into a sequence of tokens, (e.g.,
www6.infoseek.com gives www, infoseek, com)
• Eliminate stop terms such as htm, html, txt, main, index, home,
bin, cgi
• Form positional bigrams from the token sequence
•
Two hosts are said to be mirrors if
• A large fraction of paths are valid on both web sites
• These common paths link to pages that are near-duplicates.
Mining the Web
Chakrabarti and Ramakrishnan
42