Search Engines & Question Answering

Download Report

Transcript Search Engines & Question Answering

Search Engines & Question
Answering
Giuseppe Attardi
Università di Pisa
Topics

Web Search
– Search engines
– Architecture
– Crawling: parallel/distributed, focused
– Link analysis (Google PageRank)
– Scaling
Top Online Activities
96%
Email
88%
Web Search
Product Info.
Search
72%
Source: Jupiter Communications, 2000
Pew Study (US users July 2002)
Total Internet users = 111 M
 Do a search on any given day = 33 M
 Have used Internet to search = 85%

//www.pewinternet.org/reports/toc.asp?Repor
t=64
Search on the Web

Corpus:The publicly accessible Web: static + dynamic
Goal: Retrieve high quality results relevant to the user’s need
– (not docs!)
 Need
– Informational – want to learn about something (~40%)
Low hemoglobin
– Navigational – want to go to that page (~25%)
United Airlines
– Transactional – want to do something (web-mediated) (~35%)

• Access a service
• Downloads
Tampere weather
Mars surface images
Nikon CoolPix
• Shop
– Gray areas
Car rental Finland
• Find a good hub
• Exploratory search “see what’s there”
Results

Static pages (documents)
– text, mp3, images, video, ...

Dynamic pages = generated
on request
– data base access
– “the invisible web”
– proprietary content, etc.
Terminology
URL = Universal Resource Locator
http://www.cism.it/cism/hotels_2001.ht
m
Access method
Host name
Page name
Scale


Immense amount of content
– 2-10B static pages, doubling every 8-12 months
– Lexicon Size: 10s-100s of millions of words
Authors galore (1 in 4 hosts run a web server)
http://www.netcraft.com/Survey
Diversity

Languages/Encodings
– Hundreds (thousands ?) of languages, W3C encodings: 55
(Jul01) [W3C01]
– Home pages (1997): English 82%, Next 15: 13% [Babe97]
– Google (mid 2001): English: 53%, JGCFSKRIP: 30%

Document & query topic
Popular Query Topics (from 1 million Google queries, Apr 2000)
Arts
14.6%
Arts: Music
6.1%
Computers
13.8%
Regional: North America
5.3%
Regional
10.3%
Adult: Image Galleries
4.4%
Society
8.7%
Computers: Software
3.4%
Adult
8%
Computers: Internet
3.2%
Recreation
7.3%
Business: Industries
2.3%
Business
7.2%
Regional: Europe
1.8%
…
…
…
…
Rate of change
[Cho00] 720K pages from 270 popular
sites sampled daily from Feb 17 –
Jun 14, 1999
Mathematically, what
does this seem to be?
Web idiosyncrasies

Distributed authorship
– Millions of people creating pages with their
own style, grammar, vocabulary, opinions,
facts, falsehoods …
– Not all have the purest motives in providing
high-quality information - commercial motives
drive “spamming” - 100s of millions of pages.
– The open web is largely a marketing tool.
• IBM’s home page does not contain computer.
Other characteristics

Significant duplication
– Syntactic - 30%-40% (near) duplicates
[Brod97, Shiv99b]
– Semantic - ???

High linkage
– ~ 8 links/page in the average

Complex graph topology
– Not a small world; bow-tie structure [Brod00]

More on these corpus characteristics later
– how do we measure them?
Web search users

Ill-defined queries
– Short
• AV 2001: 2.54 terms
avg, 80% < 3 words)
– Imprecise terms
– Sub-optimal syntax
(80% queries
without operator)
– Low effort

Wide variance in
–
–
–
–
Needs
Expectations
Knowledge
Bandwidth

Specific behavior
– 85% look over one
result screen only
(mostly above the
fold)
– 78% of queries are
not modified (one
query/session)
– Follow links –
“the scent of
information” ...
Evolution of search engines

First generation -- use only “on page”, text data 1995-1997 AV,
– Word frequency, language
Excite, Lycos, etc

Second generation -- use off-page, web-specific data
– Link (or connectivity) analysis
From 1998. Made
– Click-through data (What results people clickpopular
on) by Google
– Anchor-text (How people refer to this page) but everyone now

Third generation -- answer “the need behind the query”
– Semantic analysis -- what is this about?
– Focus on user need, rather than on query
Still experimental
– Context determination
– Helping the user
– Integration of search and text analysis
Third generation search engine:
answering “the need behind the query”
 Query language determination
 Different ranking
–(if query Japanese do not return English)
 Hard
& soft matches
–Personalities (triggered on names)
–Cities (travel info, maps)
–Medical info (triggered on names and/or
results)
–Stock quotes, news (triggered on stock
symbol)
–Company info, …
 Integration
of Search and Text Analysis
Answering “the need behind the query”
Context determination

Context determination
–
–
–
–
–

spatial (user location/target location)
query stream (previous queries)
personal (user profile)
explicit (vertical search, family friendly)
implicit (use AltaVista from AltaVista France)
Context use
– Result restriction
– Ranking modulation
The spatial context - geo-search

Two aspects
– Geo-coding
• encode geographic coordinates to make search effective
– Geo-parsing
• the process of identifying geographic context.

Geo-coding
– Geometrical hierarchy (squares)
– Natural hierarchy (country, state, county, city, zip-codes, etc)

Geo-parsing
– Pages (infer from phone nos, zip, etc). About 10% feasible.
– Queries (use dictionary of place names)
– Users
• From IP data
– Mobile phones
• In its infancy, many issues (display size, privacy, etc)
AV barry bonds
Lycos palo alto
Geo-search example - Northern Light (Now Divine Inc)
Helping the user
 UI
 spell
checking
 query refinement
 query suggestion
 context transfer …
Context sensitive spell check
Search Engine Architecture
Document
Store
Page Repository
Link
Analysis
Indexer
Structure
Ranking
Snippet
Extraction
Crawlers
Crawl Control
Results
Text
Query
Engine
Queries
Terms
Crawler
 Crawler control
 Indexes – text, structure, utility
 Page repository
 Indexer
 Collection analysis module
 Query engine
 Ranking module

Repository
“Hidden Treasures”
Storage





The page repository is a scalable storage
system for web pages
Allows the Crawler to store pages
Allows the Indexer and Collection
Analysis to retrieve them
Similar to other data storage systems –
DB or file systems
Does not have to provide some of the
other systems’ features: transactions,
logging, directory
Storage Issues

Scalability and seamless load distribution
 Dual access modes
– Random access (used by the query engine for
cached pages)
– Streaming access (used by the Indexer and
Collection Analysis)
Large bulk update – reclaim old space,
avoid access/update conflicts
 Obsolete pages - remove pages no longer
on the web

Designing a Distributed Web
Repository
Repository designed to work over a
cluster of interconnected nodes
 Page distribution across nodes
 Physical organization within a node
 Update strategy

Page Distribution
How to choose a node to store a
page
 Uniform distribution – any page can
be sent to any node
 Hash distribution policy – hash page
ID space into node ID space

Organization Within a Node

Several operations required
– Add / remove a page
– High speed streaming
– Random page access

Hashed organization
– Treat each disk as a hash bucket
– Assign according to a page’s ID

Log organization
– Treat the disk as one file, and add the page at the end
– Support random access using a B-tree

Hybrid
– Hash map a page to an extent and use log structure
within an extent.
Distribution Performance
Log
Hashed Hashed
Log
Streaming
performance
++
-
+
Random access
performance
+-
++
+-
Page addition
++
-
+
Update Strategies
Updates are generated by the crawler
 Several characteristics

– Time in which the crawl occurs and
the repository receives information
– Whether the crawl’s information
replaces the entire database or
modifies parts of it
Batch vs. Steady

Batch mode
– Periodically executed
– Allocated a certain amount of time

Steady mode
– Run all the time
– Always send results back to the
repository
Partial vs. Complete Crawls

A batch mode crawler can
– Do a complete crawl every run, and replace
entire collection
– Recrawl only a specific subset, and apply
updates to the existing collection – partial
crawl

The repository can implement
– In place update
• Quickly refresh pages
– Shadowing, update as another stage
• Avoid refresh-access conflicts
Partial vs. Complete Crawls
Shadowing resolves the conflicts
between updates and read for the
queries
 Batch mode suits well with
shadowing
 Steady crawler suits with in place
updates

Indexing
The Indexer Module
Creates Two indexes :
 Text (content) index : Uses
“Traditional” indexing methods like
Inverted Indexing
 Structure(Links( index : Uses a
directed graph of pages and links.
Sometimes also creates an inverted
graph
The Link Analysis Module
Uses the 2 basic indexes created by
the indexer module in order to
assemble “Utility Indexes”
e.g. : A site index.
Inverted Index

A Set of inverted lists, one per each index
term (word)
 Inverted list of a term: A sorted list of
locations in which the term appeared.
 Posting: A pair (w,l) where w is word and l
is one of its locations
 Lexicon: Holds all index’s terms with
statistics about the term (not the posting)
Challenges

Index build must be:
– Fast
– Economic
(unlike traditional index buildings)
 Incremental Indexing must be
supported
 Storage: compression vs. speed
Index Partitioning
A distributed text indexing can be done by:
 Local inverted file (IFL)
– Each nodes contain disjoint random pages
– Query is broadcasted
– Result is the joined query answers

Global inverted file (IFG)
– Each node is responsible only for a subset of
terms in the collection
– Query sent only to the apropriate node
Indexing, Conclusion
Web pages indexing is complicated
due to it’s scale (millions of pages,
hundreds of gigabytes)
 Challenges: Incremental indexing
and personalization

Scaling
Scaling

Google (Nov 2002):
– Number of pages: 3 billion
– Refresh interval: 1 month (1200
pag/sec)
– Queries/day: 150 million = 1700 q/s
Avg page size:10KB
 Avg query size: 40 B
 Avg result size: 5 KB
 Avg links/page: 8

Size of Dataset

Total raw HTML data size:
3G x 10 KB = 30 TB
Inverted index ~= corpus = 30 TB
 Using compression 3:1

20 TB data on disk
Single copy of index

Index
– (10 TB) / (100 GB per disk) = 100 disk

Document
– (10 TB) / (100 GB per disk) = 100 disk
Query Load
1700 queries/sec
 Rule of thumb: 20 q/s per CPU

– 85 clusters to answer queries
Cluster: 100 machine
 Total = 85 x 100 = 8500
 Document server

– Snippet search: 1000 snippet/s
– (1700 * 10 / 1000) * 100 = 1700
Limits
Redirector: 4000 req/sec
 Bandwidth: 1100 req/sec
 Server: 22 q/s each
 Cluster: 50 nodes = 1100 q/s =
95 million q/day

Scaling the Index
Queries
Hardware based
load balancer
Google Web Server
Google Web Server
Spell Checker
Google Web Server
Ad server
Index servers
Document servers
Pooled Shard Architecture
Web Server
1 Gb/s
100 Mb/s
Index load balancer
Index Server Network
Index Server 1
…
Index Server K
Pool Network
Intermediate load
balancer 1
Intermediate load
balancer N
…
…
SIS
SIS
SIS
S1
S1
SN
Pool for shard 1
…
SIS
SN
Pool for shard N
Replicated Index Architecture
1 Gb/s
Web Server
100 Mb/s
Index load balancer
Index Server Network
Index Server 1
Index Server M
…
SIS
…
S1
Full Index 1
SIS
SIS
SN
S1
…
SIS
SN
Full Index M
Index Replication
100 Mb/s bandwidth
 20 TB x 8 / 100
 20 TB require one full day

Ranking
First generation ranking

Extended Boolean model
–
–
–
–

Matches: exact, prefix, phrase,…
Operators: AND, OR, AND NOT, NEAR, …
Fields: TITLE:, URL:, HOST:,…
AND is somewhat easier to implement, maybe
preferable as default for short queries
Ranking
– TF like factors: TF, explicit keywords, words in
title, explicit emphasis (headers), etc
– IDF factors: IDF, total word count in corpus,
frequency in query log, frequency in language
Second generation search engine

Ranking -- use off-page, web-specific
data
– Link (or connectivity) analysis
– Click-through data (What results people
click on)
– Anchor-text (How people refer to this
page)

Crawling
– Algorithms to create the best possible
corpus
Connectivity analysis
 Idea:
mine hyperlink information in
the Web
 Assumptions:
– Links often connect related pages
– A link between pages is a
recommendation “people vote with their
links”
Citation Analysis


Citation frequency
Co-citation coupling frequency
– Cocitations with a given author measures “impact”
– Cocitation analysis [Mcca90]

Bibliographic coupling frequency
– Articles that co-cite the same articles are related

Citation indexing
– Who is a given author cited by? (Garfield [Garf72])

Pinsker and Narin
Query-independent ordering
First generation: using link counts as
simple measures of popularity
 Two basic suggestions:

– Undirected popularity:
• Each page gets a score = the number of inlinks plus the number of out-links (3+2=5)
– Directed popularity:
• Score of a page = number of its in-links (3)
Query processing
First retrieve all pages meeting the
text query (say venture capital)
 Order these by their link popularity
(either variant on the previous page)

Spamming simple popularity
Exercise: How do you spam each of
the following heuristics so your page
gets a high score?
 Each page gets a score = the number
of in-links plus the number of outlinks
 Score of a page = number of its inlinks

PageRank scoring

Imagine a browser doing a random
1/3
walk on web pages:
1/3
1/3
– Start at a random page
– At each step, go out of the current page
along one of the links on that page,
equiprobably

“In the steady state” each page has a
long-term visit rate - use this as the
page’s score
Not quite enough

The web is full of dead-ends
– Random walk can get stuck in deadends.
– Makes no sense to talk about long-term
visit rates.
??
Teleporting
At each step, with probability 10%,
jump to a random web page
 With remaining probability (90%), go
out on a random link

– If no out-link, stay put in this case
Result of teleporting
Now cannot get stuck locally
 There is a long-term rate at which
any page is visited (not obvious, will
show this)
 How do we compute this visit rate?

PageRank
Tries to capture the notion of “Importance
of a page”
 Uses Backlinks for ranking
 Avoids trivial spamming: Distributes
pages’ “voting power” among the pages
they are linking to
 “Important” page linking to a page will
raise it’s rank more the “Not Important”
one

Simple PageRank

Given by :
Where
B(i) : set of pages links to i
N(j) : number of outgoing links from j
Well defined if link graph is strongly
connected
 Based on “Random Surfer Model” Rank of page equals to the
probability of being in this page

Computation Of PageRank (1)
r  At r
r  [ r (1), r ( 2),..., r ( m)]
1 / N (i )
a

i, j
0
if i points to j
Otherwise
Computation of PageRank (2)
Given a matrix A, an eigenvalue c
and the corresponding eigenvector v
is defined if Av = cv
 Hence r is eigenvector of At for
eigenvalue “1”
 If G is strongly connected then r is
unique

Computation of PageRank (3)

Simple PageRank can be computed
by:
1. s  any random vector
2. r  A  s
3. if || r  s || 
goto 5
4. s  r
goto 2
t
5. r is the PageRank v ector
PageRank Example
Practical PageRank: Problem

Web is not a strongly connected
graph. It contains:
– “Rank Sinks”: cluster of pages without
outgoing links. Pages outside cluster
will be ranked 0.
– “Rank Leaks”: a page without outgoing
links. All pages will be ranked 0.
Practical PageRank: Solution
Remove all Page Leaks
 Add decay factor d to Simple
PageRank


Based on “Board Surfer Model”
HITS: Hypertext Induced Topic Search
A query dependent technique
 Produces two scores:

– Authority: A most likely to be relevant
page to a given query
– Hub: Points to many Authorities

Contains two part:
– Identifying the focused subgraph
– Link analysis
HITS: Identifying The Focused Subgraph
Subgraph creation from t-sized page set:
1. R  t initial pages
2. S  R
3. for each page p  R
(a) Include all the pages that p points to in S
(b) Include (up to maximum d ) all pages that points to p in S
4. S holds the focused graph
(d reduces the influence of extremely
popular pages like yahoo.com)
HITS: Link Analysis

Calculates Authorities & Hubs scores
(ai & hi) for each page in S
1. Initialize ai , bi (1  i  n) arbitraril y.
2. Repeat until convergenc e
(a) ai 
(b) hi 
h
jB ( i )
a
jF ( i )
j
j
(for all pages)
(for all pages)
HITS:Link Analysis Computation
 Eigenvectors
used by:
computation can be
a  Ah  a  AAtr a


tr
tr
h  A a  h  A Ah
Where
a: Vector of Authorities’ scores
h: Vector of Hubs’ scores
A: Adjacency matrix in which ai,j = 1 if points
to j
Markov chains

A Markov chain consists of n states, plus
an nn transition probability matrix P
 At each step, we are in exactly one of the
states
 For 1  i,j  n, the matrix entry Pij tells us
the probability of j being the next state,
given we are currently in state i
Pii>0
is OK.
i
Pij
j
Markov chains
n
Clearly, for all i,  Pij  1.
j 1
 Markov chains are abstractions of
random walks
 Exercise: represent the teleporting
random walk from 3 slides ago as a
Markov chain, for this case:

Ergodic Markov chains

A Markov chain is ergodic if
– you have a path from any state to any
other
– you can be in any state at every time
step, with non-zero probability
Not
ergodic
(even/
odd).
Ergodic Markov chains

For any ergodic Markov chain, there
is a unique long-term visit rate for
each state
– Steady-state distribution
Over a long time-period, we visit
each state in proportion to this rate
 It doesn’t matter where we start

Probability vectors
A probability (row) vector x = (x1, … xn)
tells us where the walk is at any point
 E.g., (000…1…000( means we’re in state i

1
i
n
More generally, the vector x = (x1, … xn) means
the walk is in state i with probability xi
n
x
i 1
i
 1.
Change in probability vector
If the probability vector is x = (x1, …
xn) at this step, what is it at the next
step?
 Recall that row i of the transition
prob. Matrix P tells us where we go
next from state i
 So from x, our next state is
distributed as xP

Computing the visit rate

The steady state looks like a vector
of probabilities a = (a1, … an):
– ai is the probability that we are in state i
3/4
1/4
1
2
3/4
1/4
For this example, a1=1/4 and a2=3/4
How do we compute this vector?
Let a = (a1, … an) denote the row vector of
steady-state probabilities
 If we our current position is described by
a, then the next step is distributed as aP
 But a is the steady state, so a=aP
 Solving this matrix equation gives us a

– So a is the (left) eigenvector for P
– (Corresponds to the “principal” eigenvector of
P with the largest eigenvalue)
One way of computing a






Recall, regardless of where we start, we
eventually reach the steady state a
Start with any distribution (say x=(10…0))
After one step, we’re at xP
after two steps at xP2 , then xP3 and so on
“Eventually” means for “large” k, xPk = a
Algorithm: multiply x by increasing
powers of P until the product looks stable
Lempel: Salsa

By applying ergodic theorem, Lempel
has proved that:
– ai is proportional to number of incoming
links
Pagerank summary

Preprocessing:
– Given graph of links, build matrix P
– From it compute a
– The entry ai is a number between 0 and
1: the PageRank of page i

Query processing:
– Retrieve pages meeting query
– Rank them by their PageRank
– Order is query-independent
The reality

Pagerank is used in Google, but so
are many other clever heuristics