Search Engines - University of California, Berkeley

Download Report

Transcript Search Engines - University of California, Berkeley

Search Engines
CS 186 Guest Lecture
Prof. Marti Hearst
SIMS
Web Search Questions
• How do search engines differ from DBMSs?
• What do people search for?
• How do search engines work?
• Interfaces
• Ranking
• Architecture
Web Search vs DBMS?
A Comparison
• Web Search
• Imprecise
• DBMS
• Precise
• Ranked results
•
•
•
•
•
“Satisficing” results
Unedited content
Keyword queries
Mainly Read-only
Inverted index
• Usually unordered
•
•
•
•
•
Complete results
Controlled content
SQL
Reads and Writes
B-trees
What Do People Search for
on the Web?
What Do People Search for on the Web?
•
•
•
•
•
•
•
•
•
•
•
•
Genealogy/Public Figure:
Computer related:
Business:
Entertainment:
Medical:
Politics & Government
News
Hobbies
General info/surfing
Science
Travel
Arts/education/shopping/images
12%
12%
12%
8%
8%
7%
7%
6%
6%
6%
5%
14%
Something is missing…
Study by Spink et al., Oct 98 Survey on Excite, 13 questions
Data for 316 surveyswww.shef.ac.uk/~is/publications/infres/paper53.html
What Do People Search for on the Web?


50,000 queries from Excite, 1997
Most frequent terms:
• 4660 sex
• 3129 yahoo
• 2191 internal site
•
•
•
•
•
admin check from kho
1520 chat
1498 porn
1315 horoscopes
1284 pokemon
1283 SiteScope test
•
•
•
•
•
•
•
•
•
1223 hotmail
1163 games
1151 mp3
1140 weather
1127 www.yahoo.com
1110 maps
1036 yahoo.com
983 ebay
980 recipes
Why do these differ?
• Self-reporting survey
• The nature of language
• Only a few ways to say certain things
• Many different ways to express most concepts
• UFO, Flying Saucer, Space Ship, Satellite
• How many ways are there to talk about history?
Intranet Queries (Aug 2000)
•
•
•
•
•
•
•
•
•
•
•
•
•
3351 bearfacts
3349 telebears
1909 extension
1874 schedule+of+classes
1780 bearlink
1737 bear+facts
1468 decal
1443 infobears
1227 calendar
989 career+center
974 campus+map
920 academic+calendar
840 map
•
•
•
•
•
•
•
•
•
•
•
•
•
•
773 bookstore
741 class+pass
738 housing
721 tele-bears
716 directory
667 schedule
627 recipes
602 transcripts
582 tuition
577 seti
563 registrar
550 info+bears
543 class+schedule
470 financial+aid
Intranet Queries
• Summary of sample data from 3 weeks of UCB queries
• 13.2% Telebears/BearFacts/InfoBears/BearLink
(12297)
• 6.7% Schedule of classes or final exams (6222)
• 5.4% Summer Session (5041)
• 3.2% Extension (2932)
• 3.1% Academic Calendar (2846)
• 2.4% Directories (2202)
• 1.7% Career Center (1588)
• 1.7% Housing (1583)
• 1.5% Map (1393)
• Average query length over last 4 months: 1.8 words
• This suggests what is difficult to find from the home page
Different kinds of users; different kinds of data
• Legal and news colleciton:
• professional searchers
• paying (by the query or by the minute)
• Online bibliographic catalogs (melvyl)
• scholars searching scholarly literature
• Web
• Every type of person with every type of goal
• No “driving school” for searching
Different kinds of information needs; different kinds
of queries
• Example: Search on “Mazda”
– What does this mean on the web?
– What does this mean on a news collection?
• Example: “Mazda transmissions”
• Example: “Manufacture of Mazda
transmissions in the post-cold war world”
Web Queries
• Web queries are SHORT
• ~2.4 words on average (Aug 2000)
• Has increased, was 1.7 (~1997)
• User Expectations
• Many say “the first item shown should be what
I want to see”!
• This works if the user has the most
popular/common notion in mind
Recent statistics from Inktomi, August 2000, for one client, one week
• Total # queries:
1315040
Number of repeated queries:
771085
Number of queries with repeated words: 12301
Average words/ query:
2.39
Query type: All words: 0.3036; Any words: 0.6886; Some words:0.0078
Boolean: 0.0015 (0.9777 AND / 0.0252 OR / 0.0054 NOT)
Phrase searches: 0.198
URL searches: 0.066
URL searches w/http: 0.000
email searches: 0.001
Wildcards: 0.0011 (0.7042 '?'s )
fraction '?' at end of query: 0.6753
interrogatives when '?' at end: 0.8456
How to Optimize for Short Queries?
• Find good starting places
• User still has to search at the site itself
• Dialogues
• Build upon a series of short queries
• Not well understood how to do this for the general case
• Question Answering
• AskJeeves – hand edited
• Automated approaches are under development
• Very simple
• Or domain-specific
How to Find Good Starting Points?
• Manually compiled lists
• Directories
• e.g., Yahoo, Looksmart, Open directory
• Page “popularity”
• Frequently visited pages (in general)
• Frequently visited pages as a result of a query
• Link “co-citation”,
• which sites are linked to by other sites?
• Number of pages in the site
• Not currently used (as far as I know)
Directories vs. Search Engines
An IMPORTANT Distinction
• Directories
• Search Engines
• Hand-selected sites
• Search over the
• All pages in all sites
• Search over the
contents of the
descriptions of the
pages
• Organized in advance
into categories
contents of the pages
themselves
• Organized after the
query by relevance
rankings or other
scores
Link Analysis for Starting Points
• Assumptions:
• If the pages pointing to this page are good, then
this is also a good page.
• The words on the links pointing to this page are
useful indicators of what this page is about.
• References: Page et al. 98, Kleinberg 98
Co-Citation Analysis
• Has been around since the 50’s. (Small, Garfield, White &
McCain)
• Used to identify core sets of
• authors, journals, articles for particular fields
• Not for general search
• Main Idea:
• Find pairs of papers that cite third papers
• Look for commonalitieis
• A nice demonstration by Eugene Garfield at:
–
http://165.123.33.33/eugene_garfield/papers/mapsciworld.html
Link Analysis for Starting Points
• Why does this work?
• The official Toyota site will be linked to by lots of other
official (or high-quality) sites
• The best Toyota fan-club site probably also has many
links pointing to it
• Less high-quality sites do not have as many highquality sites linking to them
Co-citation analysis (From Garfield 98)
Link Analysis for Starting Points
• Does this really work?
• Actually, there have been no rigorous evaluations
• Seems to work for the primary sites; not clear if it works for
the relevant secondary sites
• One (small) study suggests that sites with many pages are
often the same as those with good link co-citation scores.
(Terveen & Hill, SIGIR 2000)
What is Really Being Used?
• Todays search engines combine these methods in
various ways
• Integration of Directories
• Today most web search engines integrate categories into the
results listings
• Lycos, MSN, Google
• Link analysis
• Google uses it; others are using it or will soon
• Words on the links seems to be especially useful
• Page popularity
• Many use DirectHit’s popularity rankings
Ranking Algorithms
The problem of ranking
Query: cat dog fish orangutang
Cat cat cat
Dog dog dog
Fish fish fish
Cat cat cat
Cat cat cat
Cat cat cat
Which is the best match?
Orangutang
Fish
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf x idf
• Recall the Zipf distribution
• Want to weight terms highly if they are
• frequent in relevant documents … BUT
• infrequent in the collection as a whole
• Automatically derived thesaurus terms
Binary Weights
• Only the presence (1) or absence (0) of a
term is included in the vector
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
1
1
0
1
1
1
0
0
0
0
1
t2
0
0
1
0
1
1
1
1
0
1
0
t3
1
0
1
0
1
0
0
0
1
1
1
Raw Term Weights
• The frequency of occurrence for the term in
each document is included in the vector
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
2
1
0
3
1
3
0
0
0
0
4
t2
0
0
4
0
6
5
8
10
0
3
0
t3
3
0
7
0
3
0
0
0
1
5
1
Assigning Weights
• Goal: give more weight to terms that are
• Common in THIS document
• Uncommon in the collection as a whole
• The tf x idf measure:
• term frequency (tf)
• inverse document frequency (idf)
Document Vectors
• Documents are represented as “bags of
words”
• Represented as vectors when used
computationally
• A vector is like an array of floating point
• Each vector holds a place for every term in the
collection
• Therefore, most vectors are sparse
Document Vectors
One location for each word.
nova
10
A
5
B
C
D
E
F
G 5
H
I
galaxy heat
5
3
10
h’wood
film
role
diet
fur
10
8
7
10 in text
5
“Nova” occurs9 10 times
A
“Galaxy” occurs 5 times in text A 10
“Heat” occurs 3 times in text A 9
7 (Blank means 0 occurrences.)
9
6
10
2
7
8
5
1
10
10
3
Document Vectors
One location for each word.
nova
10
A
5
B
C
D
E
F
G 5
H
I
galaxy heat
5
3
10
h’wood
film
role
diet
10
“Hollywood” occurs
78 times7 in text I
9 5 times
10 in5text I
“Film” occurs
“Diet” occurs 1 time in text I 10
“Fur” occurs 3 times in text I 9
7
6
10
2
7
9
8
5
1
fur
10
10
3
Document Vectors
Document ids
nova
10
A
5
B
C
D
E
F
G 5
H
I
galaxy heat
5
3
10
7
6
10
h’wood
film
role
10
9
8
10
7
5
2
7
9
8
5
diet
fur
10
9
10
10
1
3
Vector Space Model
• Documents are represented as vectors in term
space
• Terms are usually stems
• Documents represented by binary vectors of terms
• Queries represented the same as documents
• Query and Document weights are based on length
and direction of their vector
• A vector distance measure between the query and
documents is used to rank retrieved documents
Documents in 3D Space
Assumption: Documents that are “close together”
in space are similar in meaning.
tf x idf
wik  tfik * log(N / nk )
Tk  termk in document Di
tf ik  frequencyof termTk in document Di
idfk  inversedocumentfrequencyof termTk in C
N  totalnumber of documentsin thecollectionC
nk  the number of documentsin C thatcontainTk
idfk  log N 
 nk 
Computing Similarity Scores
D1  (0.8, 0.3)
D2  (0.2, 0.7)
1.0
Q  (0.4, 0.8)
Q
D2
cos1  0.74
0.8
0.6
0.4
0.2
cos 2  0.98
2
1
0.2
D1
0.4
0.6
0.8
1.0
The results of ranking
Query: cat dog fish orangutang
Cat cat cat
Dog dog dog
Fish fish fish
Cat cat cat
Cat cat cat
Cat cat cat
Orangutang
Fish
What does vector space ranking do?
High-Precision Ranking
Proximity search can help get highprecision results if > 1 term
• Hearst ’96 paper:
• Combine Boolean and passage-level proximity
• Proves significant improvements when retrieving
top 5, 10, 20, 30 documents
• Results reproduced by Mitra et al. 98
• Google uses something similar
What is Really Being Used?
• Lots of variation here
• Pretty messy in many cases
• Details usually proprietary and fluctuating
• Combining subsets of:
•
•
•
•
•
•
•
Term frequencies
Term proximities
Term position (title, top of page, etc)
Term characteristics (boldface, capitalized, etc)
Link analysis information
Category information
Popularity information
Web Spam
• Email Spam:
• Undesired content
• Web Spam:
• Content disguised as something it is not:
• Be retrieved more often than it otherwise would
• Be retrieved in contexts that it otherwise would not
be retrieved in
Web Spam
• What are the types of Web spam?
• Add extra terms to get a higher ranking
• Repeat “cars” thousands of times
• Add irrelevant terms to get more hits
• Put a dictionary in the comments field
• Put extra terms in the same color as the background of the web
page
• Add irrelevant terms to get different types of hits
• Put “sex” in the title field in sites that are selling cars
• Add irrelevant links to boost your link analysis ranking
• There is a constant “arms race” between web
search companies and spammers
Inverted Index
• This is the primary data structure for text indexes
• Main Idea:
• Invert documents into a big index
• Basic steps:
• Make a “dictionary” of all the tokens in the collection
• For each token, list all the docs it occurs in.
• Do a few things to reduce redundancy in the data
structure
Inverted indexes
• Permit fast search for individual terms
• For each term, you get a list consisting of:
• document ID
• frequency of term in doc (optional)
• position of term in doc (optional)
• These lists can be used to solve Boolean queries
• Also used for statistical ranking algorithms
Inverted Indexes
An Inverted File is a vector file “inverted” so
that rows become columns and columns
become rows
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
t1
1
1
0
1
1
1
0
0
0
0
t2
0
0
1
0
1
1
1
1
0
1
t3
1
0
1
0
1
0
0
0
1
1
Terms D1
t1
t2
t3
D2
1
0
1
D3
1
0
0
D4
0
1
1
D5
1
0
0
D6
1
1
1
…
D7
1
1
0
0
1
0
How Are Inverted Files Created
• Documents are parsed to extract tokens.
These are saved with the Document ID.
Doc 1
Doc 2
Now is the time
for all good men
to come to the aid
of their country
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
How Inverted
Files are Created
After all documents
have been parsed the
inverted file is sorted
alphabetically.
Term
now
is
the
time
for
all
good
men
to
come
to
the
aid
of
their
country
it
was
a
dark
and
stormy
night
in
the
country
manor
the
time
was
past
midnight
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
How Inverted
Files are Created
• Multiple term entries
for a single document
are merged.
• Within-document term
frequency information
is compiled.
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
the
the
their
time
time
to
to
was
was
Doc #
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
1
2
2
1
1
2
1
1
2
2
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
How Inverted Files are Created
Term
a
aid
all
and
come
country
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
the
their
time
time
to
was
Doc #
Dictionary
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Postings
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Inverted indexes
• Permit fast search for individual terms
• For each term, you get a list consisting of:
• document ID
• frequency of term in doc (optional)
• position of term in doc (optional)
• These lists can be used to solve Boolean queries:
• country -> d1, d2
• manor -> d2
• country AND manor -> d2
• Also used for statistical ranking algorithms
How Inverted Files are Used
Dictionary
Term
a
aid
all
and
come
country
dark
for
good
in
is
it
manor
men
midnight
night
now
of
past
stormy
the
their
time
to
was
N docs
Doc #
Tot Freq
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
Postings
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
1
2
2
2
Freq
2
1
1
2
1
1
2
2
1
1
2
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
Query on
“time” AND “dark”
2 docs with “time” in
dictionary ->
IDs 1 and 2 from
posting file
1 doc with “dark” in
dictionary ->
ID 2 from posting
file
Therefore, only doc 2
satisfied the query.
Web Search Architecture
Web Search Architecture
• Preprocessing
• Collection gathering phase
• Web crawling
• Collection indexing phase
• Online
• Query servers
An Example Search System:
Cha-Cha
• A system for searching complex intranets
• Places retrieval results in context
• Important design goals:
• Users at any level of computer expertise
• Browsers at any version level
• Computers of any speed
How Cha-Cha Works
• Crawl the Intranet
• Compute the shortest hyperlink path from a
certain root page to every web page
• Index and compute metadata for the pages
Cha-Cha System Architecture
crawl the
web
store
the
documents
Cha-Cha System Architecture
crawl the
web
store
the
documents
create a
keyword
index
create files
of
metadata
Cheshire II
Creating a Keyword Index
• For each document
• Tokenize the document
• Break it up into tokens: words, stems, punctuation
• There are many variations on this
• Record which tokens occurred in this document
• Called an Inverted Index
• Dictionary: a record of all the tokens in the collection and their
overall frequency
• Postings File: a list recording for each token, which document
it occurs in and how often it occurs
Responding to the User Query
• User searches on “pam samuelson”
• Search Engine looks up documents indexed
with one or both terms in its inverted index
• Search Engine looks up titles and shortest
paths in the metadata index
• User Interface combines the information
and presents the results as HTML
Standard Web Search Engine Architecture
crawl the
web
Check for duplicates,
store the
documents
DocIds
create an
inverted
index
user
query
Show results
To user
Search
engine
servers
Inverted
index
Inverted Indexes for Web Search Engines
• Inverted indexes for word lists
• Some systems partition the indexes across
different machines; each machine handles
different parts of the data
• Other systems duplicate the data across
many machines; queries are distributed
among the machines
• Most do a combination of these
In this example, the data
for the pages is
partitioned across
machines. Additionally,
each partition is allocated
multiple machines to
handle the queries.
Each row can handle 120
queries per second
Each column can handle
7M pages
To handle more queries,
add another row.
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Cascading Allocation of CPUs
• A variation on this that produces a costsavings:
• Put high-quality/common pages on many
machines
• Put lower quality/less common pages on fewer
machines
• Query goes to high quality machines first
• If no hits found there, go to other machines
Web Crawlers
• How do the web search engines get all of
the items they index?
• Main idea:
• Start with known sites
• Record information for these sites
• Follow the links from each site
• Record information found at new sites
• Repeat
Web Crawlers
• How do the web search engines get all of the items
they index?
• More precisely:
• Put a set of known sites on a queue
• Repeat the following until the queue is empty:
• Take the first page off of the queue
• If this page has not yet been processed:
– Record the information found on this page
• Positions of words, links going out, etc
– Add each link on the current page to the queue
– Record that this page has been processed
• In what order should the links be followed?
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Structure to be traversed
Web Crawling Issues
• “Keep-out” signs
• A file called norobots.txt tells the crawler which directories are off
limits
• Freshness
• Figure out which pages change often
• Recrawl these often
• Duplicates, virtual hosts, etc
• Convert page contents with a hash function
• Compare new pages to the hash table
• Lots of problems
•
•
•
•
Server unavailable
Incorrect html
Missing links
Infinite loops
• Web crawling is difficult to do robustly!
Commercial Issues
• General internet search is often commercially
driven
• Commercial sector sometimes hides things – harder to
track than research
• On the other hand, most CTOs for search engine
companies used to be researchers, and so help us out
• Commercial search engine information changes
monthly
• Sometimes motivations are commercial rather than
technical
For More Information
• IS213: Information Organization and Retrieval
•
•
•
•
http://www.sims.berkeley.edu/courses/is202/f00/Assignments.html
Modern Information Retrieval, Baeza-Yates and Ribeiro, Addison Wesley,
1999. http://www.sims.berkeley.edu/~hearst/irbook
Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual
Web Search Engine, in the Proceedings of WWW7 / Computer Networks
30(1-7): 107-117, 1998.
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm
Jurgen Koenemann and Nicholas J. Belkin, A Case for Interaction: A Study of
Interactive Information Retrieval Behavior and Effectiveness, in the
Proceedings of ACM/CHI, Vancouver, 1996.
Marti Hearst, Improving Full-Text Precision on Short Queries using Simple
Constraints, Proceedings of the Fifth Annual Symposium on Document
Analysis and Information Retrieval (SDAIR), Las Vegas, NV, April 1996.
http://www.sims.berkeley.edu/~hearst/publications.shtml