Search Engines & Question Answering
Download
Report
Transcript Search Engines & Question Answering
Search Engines & Question
Answering
Giuseppe Attardi
Università di Pisa
Topics
Web Search
– Search engines
– Architecture
– Crawling: parallel/distributed, focused
– Link analysis (Google PageRank)
– Scaling
Top Online Activities
96%
Email
88%
Web Search
Product Info.
Search
72%
Source: Jupiter Communications, 2000
Pew Study (US users July 2002)
Total Internet users = 111 M
Do a search on any given day = 33 M
Have used Internet to search = 85%
//www.pewinternet.org/reports/toc.asp?Repor
t=64
Search on the Web
Corpus:The publicly accessible Web: static + dynamic
Goal: Retrieve high quality results relevant to the user’s need
– (not docs!)
Need
– Informational – want to learn about something (~40%)
Low hemoglobin
– Navigational – want to go to that page (~25%)
United Airlines
– Transactional – want to do something (web-mediated) (~35%)
• Access a service
• Downloads
Tampere weather
Mars surface images
Nikon CoolPix
• Shop
– Gray areas
Car rental Finland
• Find a good hub
• Exploratory search “see what’s there”
Results
Static pages (documents)
– text, mp3, images, video, ...
Dynamic pages = generated
on request
– data base access
– “the invisible web”
– proprietary content, etc.
Terminology
URL = Universal Resource Locator
http://www.cism.it/cism/hotels_2001.ht
m
Access method
Host name
Page name
Scale
Immense amount of content
– 2-10B static pages, doubling every 8-12 months
– Lexicon Size: 10s-100s of millions of words
Authors galore (1 in 4 hosts run a web server)
http://www.netcraft.com/Survey
Diversity
Languages/Encodings
– Hundreds (thousands ?) of languages, W3C encodings: 55
(Jul01) [W3C01]
– Home pages (1997): English 82%, Next 15: 13% [Babe97]
– Google (mid 2001): English: 53%, JGCFSKRIP: 30%
Document & query topic
Popular Query Topics (from 1 million Google queries, Apr 2000)
Arts
14.6%
Arts: Music
6.1%
Computers
13.8%
Regional: North America
5.3%
Regional
10.3%
Adult: Image Galleries
4.4%
Society
8.7%
Computers: Software
3.4%
Adult
8%
Computers: Internet
3.2%
Recreation
7.3%
Business: Industries
2.3%
Business
7.2%
Regional: Europe
1.8%
…
…
…
…
Rate of change
[Cho00] 720K pages from 270 popular
sites sampled daily from Feb 17 –
Jun 14, 1999
Mathematically, what
does this seem to be?
Web idiosyncrasies
Distributed authorship
– Millions of people creating pages with their
own style, grammar, vocabulary, opinions,
facts, falsehoods …
– Not all have the purest motives in providing
high-quality information - commercial motives
drive “spamming” - 100s of millions of pages.
– The open web is largely a marketing tool.
• IBM’s home page does not contain computer.
Other characteristics
Significant duplication
– Syntactic - 30%-40% (near) duplicates
[Brod97, Shiv99b]
– Semantic - ???
High linkage
– ~ 8 links/page in the average
Complex graph topology
– Not a small world; bow-tie structure [Brod00]
More on these corpus characteristics later
– how do we measure them?
Web search users
Ill-defined queries
– Short
• AV 2001: 2.54 terms
avg, 80% < 3 words)
– Imprecise terms
– Sub-optimal syntax
(80% queries
without operator)
– Low effort
Wide variance in
–
–
–
–
Needs
Expectations
Knowledge
Bandwidth
Specific behavior
– 85% look over one
result screen only
(mostly above the
fold)
– 78% of queries are
not modified (one
query/session)
– Follow links –
“the scent of
information” ...
Evolution of search engines
First generation -- use only “on page”, text data 1995-1997 AV,
– Word frequency, language
Excite, Lycos, etc
Second generation -- use off-page, web-specific data
– Link (or connectivity) analysis
From 1998. Made
– Click-through data (What results people clickpopular
on) by Google
– Anchor-text (How people refer to this page) but everyone now
Third generation -- answer “the need behind the query”
– Semantic analysis -- what is this about?
– Focus on user need, rather than on query
Still experimental
– Context determination
– Helping the user
– Integration of search and text analysis
Third generation search engine:
answering “the need behind the query”
Query language determination
Different ranking
–(if query Japanese do not return English)
Hard
& soft matches
–Personalities (triggered on names)
–Cities (travel info, maps)
–Medical info (triggered on names and/or
results)
–Stock quotes, news (triggered on stock
symbol)
–Company info, …
Integration
of Search and Text Analysis
Answering “the need behind the query”
Context determination
Context determination
–
–
–
–
–
spatial (user location/target location)
query stream (previous queries)
personal (user profile)
explicit (vertical search, family friendly)
implicit (use AltaVista from AltaVista France)
Context use
– Result restriction
– Ranking modulation
The spatial context - geo-search
Two aspects
– Geo-coding
• encode geographic coordinates to make search effective
– Geo-parsing
• the process of identifying geographic context.
Geo-coding
– Geometrical hierarchy (squares)
– Natural hierarchy (country, state, county, city, zip-codes, etc)
Geo-parsing
– Pages (infer from phone nos, zip, etc). About 10% feasible.
– Queries (use dictionary of place names)
– Users
• From IP data
– Mobile phones
• In its infancy, many issues (display size, privacy, etc)
AV barry bonds
Lycos palo alto
Geo-search example - Northern Light (Now Divine Inc)
Helping the user
UI
spell
checking
query refinement
query suggestion
context transfer …
Context sensitive spell check
Search Engine Architecture
Document
Store
Page Repository
Link
Analysis
Indexer
Structure
Ranking
Snippet
Extraction
Crawlers
Crawl Control
Results
Text
Query
Engine
Queries
Terms
Crawler
Crawler control
Indexes – text, structure, utility
Page repository
Indexer
Collection analysis module
Query engine
Ranking module
Repository
“Hidden Treasures”
Storage
The page repository is a scalable storage
system for web pages
Allows the Crawler to store pages
Allows the Indexer and Collection
Analysis to retrieve them
Similar to other data storage systems –
DB or file systems
Does not have to provide some of the
other systems’ features: transactions,
logging, directory
Storage Issues
Scalability and seamless load distribution
Dual access modes
– Random access (used by the query engine for
cached pages)
– Streaming access (used by the Indexer and
Collection Analysis)
Large bulk update – reclaim old space,
avoid access/update conflicts
Obsolete pages - remove pages no longer
on the web
Designing a Distributed Web
Repository
Repository designed to work over a
cluster of interconnected nodes
Page distribution across nodes
Physical organization within a node
Update strategy
Page Distribution
How to choose a node to store a
page
Uniform distribution – any page can
be sent to any node
Hash distribution policy – hash page
ID space into node ID space
Organization Within a Node
Several operations required
– Add / remove a page
– High speed streaming
– Random page access
Hashed organization
– Treat each disk as a hash bucket
– Assign according to a page’s ID
Log organization
– Treat the disk as one file, and add the page at the end
– Support random access using a B-tree
Hybrid
– Hash map a page to an extent and use log structure
within an extent.
Distribution Performance
Log
Hashed Hashed
Log
Streaming
performance
++
-
+
Random access
performance
+-
++
+-
Page addition
++
-
+
Update Strategies
Updates are generated by the crawler
Several characteristics
– Time in which the crawl occurs and
the repository receives information
– Whether the crawl’s information
replaces the entire database or
modifies parts of it
Batch vs. Steady
Batch mode
– Periodically executed
– Allocated a certain amount of time
Steady mode
– Run all the time
– Always send results back to the
repository
Partial vs. Complete Crawls
A batch mode crawler can
– Do a complete crawl every run, and replace
entire collection
– Recrawl only a specific subset, and apply
updates to the existing collection – partial
crawl
The repository can implement
– In place update
• Quickly refresh pages
– Shadowing, update as another stage
• Avoid refresh-access conflicts
Partial vs. Complete Crawls
Shadowing resolves the conflicts
between updates and read for the
queries
Batch mode suits well with
shadowing
Steady crawler suits with in place
updates
Indexing
The Indexer Module
Creates Two indexes :
Text (content) index : Uses
“Traditional” indexing methods like
Inverted Indexing
Structure(Links( index : Uses a
directed graph of pages and links.
Sometimes also creates an inverted
graph
The Link Analysis Module
Uses the 2 basic indexes created by
the indexer module in order to
assemble “Utility Indexes”
e.g. : A site index.
Inverted Index
A Set of inverted lists, one per each index
term (word)
Inverted list of a term: A sorted list of
locations in which the term appeared.
Posting: A pair (w,l) where w is word and l
is one of its locations
Lexicon: Holds all index’s terms with
statistics about the term (not the posting)
Challenges
Index build must be:
– Fast
– Economic
(unlike traditional index buildings)
Incremental Indexing must be
supported
Storage: compression vs. speed
Index Partitioning
A distributed text indexing can be done by:
Local inverted file (IFL)
– Each nodes contain disjoint random pages
– Query is broadcasted
– Result is the joined query answers
Global inverted file (IFG)
– Each node is responsible only for a subset of
terms in the collection
– Query sent only to the apropriate node
Indexing, Conclusion
Web pages indexing is complicated
due to it’s scale (millions of pages,
hundreds of gigabytes)
Challenges: Incremental indexing
and personalization
Scaling
Scaling
Google (Nov 2002):
– Number of pages: 3 billion
– Refresh interval: 1 month (1200
pag/sec)
– Queries/day: 150 million = 1700 q/s
Avg page size:10KB
Avg query size: 40 B
Avg result size: 5 KB
Avg links/page: 8
Size of Dataset
Total raw HTML data size:
3G x 10 KB = 30 TB
Inverted index ~= corpus = 30 TB
Using compression 3:1
20 TB data on disk
Single copy of index
Index
– (10 TB) / (100 GB per disk) = 100 disk
Document
– (10 TB) / (100 GB per disk) = 100 disk
Query Load
1700 queries/sec
Rule of thumb: 20 q/s per CPU
– 85 clusters to answer queries
Cluster: 100 machine
Total = 85 x 100 = 8500
Document server
– Snippet search: 1000 snippet/s
– (1700 * 10 / 1000) * 100 = 1700
Limits
Redirector: 4000 req/sec
Bandwidth: 1100 req/sec
Server: 22 q/s each
Cluster: 50 nodes = 1100 q/s =
95 million q/day
Scaling the Index
Queries
Hardware based
load balancer
Google Web Server
Google Web Server
Spell Checker
Google Web Server
Ad server
Index servers
Document servers
Pooled Shard Architecture
Web Server
1 Gb/s
100 Mb/s
Index load balancer
Index Server Network
Index Server 1
…
Index Server K
Pool Network
Intermediate load
balancer 1
Intermediate load
balancer N
…
…
SIS
SIS
SIS
S1
S1
SN
Pool for shard 1
…
SIS
SN
Pool for shard N
Replicated Index Architecture
1 Gb/s
Web Server
100 Mb/s
Index load balancer
Index Server Network
Index Server 1
Index Server M
…
SIS
…
S1
Full Index 1
SIS
SIS
SN
S1
…
SIS
SN
Full Index M
Index Replication
100 Mb/s bandwidth
20 TB x 8 / 100
20 TB require one full day
Ranking
First generation ranking
Extended Boolean model
–
–
–
–
Matches: exact, prefix, phrase,…
Operators: AND, OR, AND NOT, NEAR, …
Fields: TITLE:, URL:, HOST:,…
AND is somewhat easier to implement, maybe
preferable as default for short queries
Ranking
– TF like factors: TF, explicit keywords, words in
title, explicit emphasis (headers), etc
– IDF factors: IDF, total word count in corpus,
frequency in query log, frequency in language
Second generation search engine
Ranking -- use off-page, web-specific
data
– Link (or connectivity) analysis
– Click-through data (What results people
click on)
– Anchor-text (How people refer to this
page)
Crawling
– Algorithms to create the best possible
corpus
Connectivity analysis
Idea:
mine hyperlink information in
the Web
Assumptions:
– Links often connect related pages
– A link between pages is a
recommendation “people vote with their
links”
Citation Analysis
Citation frequency
Co-citation coupling frequency
– Cocitations with a given author measures “impact”
– Cocitation analysis [Mcca90]
Bibliographic coupling frequency
– Articles that co-cite the same articles are related
Citation indexing
– Who is a given author cited by? (Garfield [Garf72])
Pinsker and Narin
Query-independent ordering
First generation: using link counts as
simple measures of popularity
Two basic suggestions:
– Undirected popularity:
• Each page gets a score = the number of inlinks plus the number of out-links (3+2=5)
– Directed popularity:
• Score of a page = number of its in-links (3)
Query processing
First retrieve all pages meeting the
text query (say venture capital)
Order these by their link popularity
(either variant on the previous page)
Spamming simple popularity
Exercise: How do you spam each of
the following heuristics so your page
gets a high score?
Each page gets a score = the number
of in-links plus the number of outlinks
Score of a page = number of its inlinks
PageRank scoring
Imagine a browser doing a random
1/3
walk on web pages:
1/3
1/3
– Start at a random page
– At each step, go out of the current page
along one of the links on that page,
equiprobably
“In the steady state” each page has a
long-term visit rate - use this as the
page’s score
Not quite enough
The web is full of dead-ends
– Random walk can get stuck in deadends.
– Makes no sense to talk about long-term
visit rates.
??
Teleporting
At each step, with probability 10%,
jump to a random web page
With remaining probability (90%), go
out on a random link
– If no out-link, stay put in this case
Result of teleporting
Now cannot get stuck locally
There is a long-term rate at which
any page is visited (not obvious, will
show this)
How do we compute this visit rate?
PageRank
Tries to capture the notion of “Importance
of a page”
Uses Backlinks for ranking
Avoids trivial spamming: Distributes
pages’ “voting power” among the pages
they are linking to
“Important” page linking to a page will
raise it’s rank more the “Not Important”
one
Simple PageRank
Given by :
Where
B(i) : set of pages links to i
N(j) : number of outgoing links from j
Well defined if link graph is strongly
connected
Based on “Random Surfer Model” Rank of page equals to the
probability of being in this page
Computation Of PageRank (1)
r At r
r [ r (1), r ( 2),..., r ( m)]
1 / N (i )
a
i, j
0
if i points to j
Otherwise
Computation of PageRank (2)
Given a matrix A, an eigenvalue c
and the corresponding eigenvector v
is defined if Av = cv
Hence r is eigenvector of At for
eigenvalue “1”
If G is strongly connected then r is
unique
Computation of PageRank (3)
Simple PageRank can be computed
by:
1. s any random vector
2. r A s
3. if || r s ||
goto 5
4. s r
goto 2
t
5. r is the PageRank v ector
PageRank Example
Practical PageRank: Problem
Web is not a strongly connected
graph. It contains:
– “Rank Sinks”: cluster of pages without
outgoing links. Pages outside cluster
will be ranked 0.
– “Rank Leaks”: a page without outgoing
links. All pages will be ranked 0.
Practical PageRank: Solution
Remove all Page Leaks
Add decay factor d to Simple
PageRank
Based on “Board Surfer Model”
HITS: Hypertext Induced Topic Search
A query dependent technique
Produces two scores:
– Authority: A most likely to be relevant
page to a given query
– Hub: Points to many Authorities
Contains two part:
– Identifying the focused subgraph
– Link analysis
HITS: Identifying The Focused Subgraph
Subgraph creation from t-sized page set:
1. R t initial pages
2. S R
3. for each page p R
(a) Include all the pages that p points to in S
(b) Include (up to maximum d ) all pages that points to p in S
4. S holds the focused graph
(d reduces the influence of extremely
popular pages like yahoo.com)
HITS: Link Analysis
Calculates Authorities & Hubs scores
(ai & hi) for each page in S
1. Initialize ai , bi (1 i n) arbitraril y.
2. Repeat until convergenc e
(a) ai
(b) hi
h
jB ( i )
a
jF ( i )
j
j
(for all pages)
(for all pages)
HITS:Link Analysis Computation
Eigenvectors
used by:
computation can be
a Ah a AAtr a
tr
tr
h A a h A Ah
Where
a: Vector of Authorities’ scores
h: Vector of Hubs’ scores
A: Adjacency matrix in which ai,j = 1 if points
to j
Markov chains
A Markov chain consists of n states, plus
an nn transition probability matrix P
At each step, we are in exactly one of the
states
For 1 i,j n, the matrix entry Pij tells us
the probability of j being the next state,
given we are currently in state i
Pii>0
is OK.
i
Pij
j
Markov chains
n
Clearly, for all i, Pij 1.
j 1
Markov chains are abstractions of
random walks
Exercise: represent the teleporting
random walk from 3 slides ago as a
Markov chain, for this case:
Ergodic Markov chains
A Markov chain is ergodic if
– you have a path from any state to any
other
– you can be in any state at every time
step, with non-zero probability
Not
ergodic
(even/
odd).
Ergodic Markov chains
For any ergodic Markov chain, there
is a unique long-term visit rate for
each state
– Steady-state distribution
Over a long time-period, we visit
each state in proportion to this rate
It doesn’t matter where we start
Probability vectors
A probability (row) vector x = (x1, … xn)
tells us where the walk is at any point
E.g., (000…1…000( means we’re in state i
1
i
n
More generally, the vector x = (x1, … xn) means
the walk is in state i with probability xi
n
x
i 1
i
1.
Change in probability vector
If the probability vector is x = (x1, …
xn) at this step, what is it at the next
step?
Recall that row i of the transition
prob. Matrix P tells us where we go
next from state i
So from x, our next state is
distributed as xP
Computing the visit rate
The steady state looks like a vector
of probabilities a = (a1, … an):
– ai is the probability that we are in state i
3/4
1/4
1
2
3/4
1/4
For this example, a1=1/4 and a2=3/4
How do we compute this vector?
Let a = (a1, … an) denote the row vector of
steady-state probabilities
If we our current position is described by
a, then the next step is distributed as aP
But a is the steady state, so a=aP
Solving this matrix equation gives us a
– So a is the (left) eigenvector for P
– (Corresponds to the “principal” eigenvector of
P with the largest eigenvalue)
One way of computing a
Recall, regardless of where we start, we
eventually reach the steady state a
Start with any distribution (say x=(10…0))
After one step, we’re at xP
after two steps at xP2 , then xP3 and so on
“Eventually” means for “large” k, xPk = a
Algorithm: multiply x by increasing
powers of P until the product looks stable
Lempel: Salsa
By applying ergodic theorem, Lempel
has proved that:
– ai is proportional to number of incoming
links
Pagerank summary
Preprocessing:
– Given graph of links, build matrix P
– From it compute a
– The entry ai is a number between 0 and
1: the PageRank of page i
Query processing:
– Retrieve pages meeting query
– Rank them by their PageRank
– Order is query-independent
The reality
Pagerank is used in Google, but so
are many other clever heuristics