A glance at the world of search engines July 2005 Matias Cuenca-Acuna Research Scientist Teoma Search Development.

Download Report

Transcript A glance at the world of search engines July 2005 Matias Cuenca-Acuna Research Scientist Teoma Search Development.

A glance at the world of search engines

July 2005

Matias Cuenca-Acuna Research Scientist Teoma Search Development

The numbers behind search engines

• • • • We are the 8th most visited site in the world !

– People don’t use bookmarks We serve 4B queries per month – Assume you get 0.00001 of them and can sell something to 0.01 of the users that visit your site – You would close 400 operations per month • Not bad for doing nothing We reach 25% of the active US audience – Imagine how effective all the search engines combined are to deliver a message?

• In US search engines reach 126 million people • The largest audience for a TV show ever in US was 177 million We push the envelope on several areas of CS – Huge scale distributed systems (thousands of machines) – Terabytes of data – Integrate distributed and scalable systems, fault tolerance, information retrieval, natural language understanding, etc.

How do search engines work?

+

Crawler Indexing Ranking Algorithm Search Results

=

Overview of the talk

• Crawling • Content control & cleaning • Indexing • Ranking

Crawling

How much do search engines index?

Search Engine

Google Yahoo Ask MSN Indexed Web Total Web

Self-Report. Size (Billions)

8.1

4.2(est.) 2.5

5.0

Est. Size (Billions)

8.0

6.6

5.3

5.1

9.4

11.5

Coverage Of Indexed Web Coverage Of Total Web

76.2% 69.3% 57.6% 61.9% 69.6% 57.4% 46.1% 44.3%

Source [Gulli & Signorini 2005]

Goals

• Coverage – Download a complete copy of the web • Freshness – Keep the copy in sync with the source pages • Politeness – Do it without disrupting the web and obeying the webmasters constrains

The coverage challenge

• The distribution of pages to servers follows a power law – i.e. the majority of the content is on the minority of the servers • Suppose you want to crawl 100% of wikipedia.org

– They have over 11M pages – The average page size is 10KB • They host 112GB of content – Suppose you want to be fresh and finish them in 15 days • This content is highly dynamic – We would have to download 7GB per day !!!

• This is not acceptable for a non profit organization • The deep web – It is estimated that there are 500B pages that are not accessible by search engines – Scripting languages, dynamic pages, forms, session IDs, non HTML pages, etc.

The freshness challenge

• • On their work [Cho & Garcia-Molina 2003] find that: – After tracking 154 sites for one year, only 24% of the initial links are available. On average, 25% new links were created every week.

– 5% new shingles are downloaded each week vs. 8% new pages are detected • new pages borrow old content, the creation of pages is the largest source of change (not the modification of existing pages) – After a year 50% of the content of the web is replaced and 80% of its links namestead.com records 250K new .com domains per day

Source [Brewington & Cybenko 2000]

The freshness challenge

• The promise of adaptive crawling [Brewington & Cybenko 2000] – Page updates can be modeled as a Poisson process • The distribution of update interval is exponential – Still, If we need 8ms to pull a URL record from disk, to crawl 11B pages we would need 2.7 years!!

• Help from the web masters – Conditional GETs – Site maps – RSS • Although it on their best interest to be indexed not all of them can be trusted

The politeness challenge

• In order to preserve bandwidth webmaster put many constrains on us – Follow Robots.txt standard.

– ‘Crawl Delay’ directive to specify download minimum download interval – NOARCHIVE – don’t cache this page – NOINDEX – don’t index this page – NOFOLLOW – index but don’t follow links – Use compression to save your bandwidth • up to 75% savings with gzip • Getting out of crawling traps – Calendars • We don’t need pages for year 3001 – Session Ids on URLs

Crawling in a nutshell

• Timing, timing, timing • It’s a huge scheduling problem – It’s like trying to keep 11B customers stoked • Requires massive bandwidth and storage – 11B pages x 10KB = 112TB!!

Content control & cleaning

Goals

• “Bad pages” can hurt relevancy – Porn – Spam • Pages whose sole purpose is to manipulate ranking – Duplicates • Must recognize and remove – Each is time-consuming, formidable tasks – Not exact science, constantly evolving • Must do a good job – Otherwise we get angry phone calls from parents, or get sued – Not a laughing matter as you might have thought • Requires a mixture of algorithmic and editorial solutions

A case of automatic spam content

A sample duplicate

Indexing

Purpose

• Create and efficient data structure for answering queries • Forwarding Indexing – Creating page-to-word index – Recognize as many page properties as possible • Last modified, language, etc.

• Inverted Indexing – A transposition of the forward indexing • Containing equivalent information – Creating word-to-page index • Also keep information about individual pages – Last modified time – Language – others

Ranking

(i.e. the secret sauce)

Goals

• Return relevant results – How do we pick the best 10 pages out of billions?

• Understand the intent of the user – A query is more than just words – What is the meaning behind it?

• Structured data - Smart Answers – Sometimes the best answer is not on the web

General Link Analysis Methods

<i><b>Laps</b></i> Laps

B

Laps

Laps lorem ipsum dolor Laps sit.

<i><b>Dogs</b></i> Dogs

A

Dogs

Dogs lorem ipsum dolor Dogs sit.

<i><b>Ears</b></i> Ears

C

Ears

Ears lorem ipsum dolor Ears sit.

A link from page

A

to page

B

(or mendation by the author or page

A C

) is a vote or recom for page

B

(or

C

) The greater the number… First proposed by Kleinberg (known as the HITS algorithm)

Highlights of the technologies behind Jeeves

• Teoma search technology – Finding authoritative pages in a community of pages in real time – Only engine that solves this computational issue • DirectHit feedback technology • Natural language question answering – Answering queries that are posed as questions

Understanding Teoma’s view of the Web Local Subject Communities of Sites

1

Teoma’s Approach at a Glance

Search the index to collect & calculate global information 2 Break the index into communities 3 Collect & calculate local subject-specific information Hub Local Subject Community Authority 4 Apply all Pertinent global and local information

Key challenges

• Solving the problem in real-time – 200 ms or less to do this computation for each query, millions of times per day • Identifying the communities – Link structure on the web is noisy – Hubs link to multiple topic areas

Jeeves in action

Jeeves in action 2

What happens when you don’t understand the query

Conclusion

• The key factors for a search engine are – Coverage • How much of the web do we index?

– Freshness • What id the average staleness of the database – Ranking • Given a 1M pages how do we select the 10 most relevant?

– Query understanding • What is the intent of the user?

• Simple problems become very challenging – 1000’s of machines – TB’s of data – Millions of daily queries requiring sub-second response time

A sample problem

• Assume we have a 2TB file with urls – 6B urls coming from 100M hosts – Stored on a 10MB/sec disk (2.3 days needed to read it) – Assume 300 bytes max. url size and 100 bytes max host name size • How would you find the top 10 largest hosts?

• What if we had more than one machine to do it?

– Still the file is on one disk • What if we know that the page to host distribution follows a power law?

Questions?