Elektronik Yayıncılık ve Bilimsel İletişim: Teknoloji
Download
Report
Transcript Elektronik Yayıncılık ve Bilimsel İletişim: Teknoloji
Principles of IR
Hacettepe University
Department of Information Management
DOK 324: Principles of IR
Search engines
Some Slides taken from: Ray Larson
The beginnings - Yahoo
“DOK 324 PIR”
Yet Another Hierarchical Officious Oracle
David Filo and Jerry Yang, Stanford University,
spring 1994
– keep track of their personal interests on the Internet
– converted later on onto a accessible database
fall 1994 - 1 million hits, 100,000 unique visitors
March 1995 - moved into business
Today…’also a search engine’
– But focused on offering other services
– The search technology is actually licensed from Google
Y.T.
“DOK 324 PIR”
The current favourite - Google
Indexes
–
–
–
–
Y.T.
3,5 billion web pages (1.6 billion)
35 million non-HTML files (22 million)
700 million newsgroup messages (650 million)
250 million images
Serves 200 million queries / day (150 million)
Note: the figures from last year are in brackets
“DOK 324 PIR”
Google’s ‘life of a query’
3tiersystem
•Front-end
•Database
•Processing
Y.T.
“DOK 324 PIR”
Why is it good? - technical reasons!
Powerful – cluster of 10,000 Linux servers
PageRank technology
– A link from Page A to Page B is a "vote" by Page A for
Page B.
The more links refer to page B, the higher page B will score
– The score of page A will be used when voting for page B
The more important page A is, the higher page B will score
Hypertext-Matching Analysis: analyse page content
in terms of headings, fonts, position, neighbours
– Differentiate between ‘title’ text and ‘small-print’ text
Y.T.
What can go wrong?
“DOK 324 PIR”
Victim of its own success
– Google becomes the web directory – information that cannot
be found in it may be regarded as inexistent
– Sued for: rank errors, addresses dropped from database
The attraction of money
– ‘bid-for-placing’ web searches – rank websites based on
how much they have paid
– Google is, after all, a business company
Y.T.
“DOK 324 PIR”
Search engines
Web
Crawling
Web Search Engines and Algorithms
Y.T.
“DOK 324 PIR”
Standard Web Search Engine Architecture
crawl the
web
Check for duplicates,
store the
documents
DocIds
create an
inverted
index
user
query
Show results
To user
Y.T.
Search
engine
servers
Inverted
index
Web Crawling
How
“DOK 324 PIR”
do the web search engines get all of the
items they index?
Main idea:
–
–
–
–
–
Y.T.
Start with known sites
Record information for these sites
Follow the links from each site
Record information found at new sites
Repeat
Web Crawlers
“DOK 324 PIR”
How do the web search engines get all of the items
they index?
More precisely:
– Put a set of known sites on a queue
– Repeat the following until the queue is empty:
Take the first page off of the queue
If this page has not yet been processed:
– Record the information found on this page
• Positions of words, links going out, etc
– Add each link on the current page to the queue
– Record that this page has been processed
Y.T.
In what order should the links be followed?
Page Visit Order
“DOK 324 PIR”
Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Structure to be traversed
Y.T.
Page Visit Order
“DOK 324 PIR”
Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Breadth-first search
(must be in presentation mode to see this
animation)
Y.T.
Page Visit Order
“DOK 324 PIR”
Animated examples of breadth-first vs depth-first search on trees:
– http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Depth-first search
(must be in presentation mode to see this
animation)
Y.T.
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Y.T.
“DOK 324 PIR”
“DOK 324 PIR”
Sites Are Complex Graphs, Not Just Trees
Page 1
Page 2
Page 1
Site 1
Page 2
Page 3
Page 5
Page 4
Page 3
Page 1
Site 5
Page 6
Page 1
Site 3
Y.T.
Site 2
Page 2
Page 1
Site 6
Web Crawling Issues
Keep out signs
“DOK 324 PIR”
– A file called robots.txt tells the crawler which directories
are off limits
Freshness
– Figure out which pages change often
– Recrawl these often
Duplicates, virtual hosts, etc
– Convert page contents with a hash function
– Compare new pages to the hash table
Lots of problems
–
–
–
–
Y.T.
Server unavailable
Incorrect html
Missing links
Infinite loops
Web crawling is difficult to do robustly!
Searching the Web
Web
“DOK 324 PIR”
Directories versus Search Engines
Some statistics about Web searching
Challenges for Web Searching
Search Engines
– Crawling
– Indexing
– Querying
Y.T.
“DOK 324 PIR”
Directories vs. Search Engines
Directories
– Hand-selected sites
– Search over the contents
of the descriptions of the
pages
– Organized in advance
into categories
Y.T.
Search Engines
– All pages in all sites
– Search over the contents
of the pages themselves
– Organized after the
query by relevance
rankings or other scores
“DOK 324 PIR”
Search Engines vs. Internal Engines
Not
long ago HotBot, GoTo, Yahoo and
Microsoft were all powered by Inktomi
Today Google is the search engine behind
many other search services (such as Yahoo up
until very recently and AOL’s search service)
Y.T.
“DOK 324 PIR”
Statistics from Inktomi
Statistics from Inktomi, August 2000, for one client, one week
–
–
–
–
–
–
–
–
–
–
–
Total # queries:
1315040
Number of repeated queries:
771085
Number of queries with repeated words: 12301
Average words/ query:
2.39
Query type: All words: 0.3036; Any words: 0.6886; Some
words:0.0078
Boolean: 0.0015 (0.9777 AND / 0.0252 OR / 0.0054 NOT)
Phrase searches: 0.198
URL searches: 0.066
URL searches w/http: 0.000
email searches: 0.001
Wildcards: 0.0011 (0.7042 '?'s )
frac '?' at end of query: 0.6753
interrogatives when '?' at end: 0.8456
composed of:
– who: 0.0783 what: 0.2835 when: 0.0139 why: 0.0052 how: 0.2174 where
0.1826 where-MIS 0.0000 can,etc.: 0.0139 do(es)/did: 0.0
Y.T.
“DOK 324 PIR”
What Do People Search for on the Web?
Y.T.
Topics
–
–
–
–
–
–
–
–
–
–
–
–
Genealogy/Public Figure:
Computer related:
Business:
Entertainment:
Medical:
8%
Politics & Government
7%
News
Hobbies
6%
General info/surfing
Science
6%
Travel
Arts/education/shopping/images
12%
12%
12%
8%
7%
6%
5%
14%
(from Spink et al. 98 study)
“DOK 324 PIR”
Challenges for Web Searching: Data
Distributed data
Volatile data/”Freshness”: 40% of the web changes
every month
Exponential growth
Unstructured and redundant data: 30% of web pages
are near duplicates
Unedited data
Multiple formats
Commercial biases
Hidden data
Y.T.
“DOK 324 PIR”
Challenges for Web Searching: Users
Users
unfamiliar with search engine interfaces
(e.g., Does the query “apples oranges” mean
the same thing on all of the search engines?)
Users unfamiliar with the logical view of the
data (e.g., Is a search for “Oranges” the same
things as a search for “oranges”?)
Many different kinds of users
Y.T.
Web Search Queries
Web
“DOK 324 PIR”
search queries are SHORT
– ~2.4 words on average
User
Expectations
– Many say “the first item shown should be what I
want to see”!
– This works if the user has the most
popular/common notion in mind
Y.T.