Transcript Slide 1

How Does Google Work? The
Technology behind Google's Great
Results
Emre Altug Yavuz
Ph.D. candidate
Data Communications Lab.
Electrical & Computer Engineering
University of British Columbia (UBC)
Vancouver, BC, CANADA
2004 © Emre A. Yavuz. EECE, UBC
1
What is Google ?
A fully automated search engine, which employs
robots known as “spiders” to crawl the web
frequently and find sites for inclusion in the Google
database or index.
2
2004 © Emre A. Yavuz. EECE, UBC
Some Google Factoids
• Named for the mathematical term “googol” or 10100,the
number represented by the numeral 1 followed by 100
zeros.
• Global unique users per month: 81.9 million.
• Selected by Yahoo (2000) and AOL (2002) as search
engine partner.
• Indexes largest amount of Internet accessible documents.
• Designed to scale well to extremely large data sets
• Efficient usage of storage space to store the index.
• Optimized data structures for fast and efficient access.
3
2004 © Emre A. Yavuz. EECE, UBC
Who invented it, when and why ?
• In early 90s, search engines started springing out of
academic projects.
• Low quality of the results and existence of poorly
designed search engines prepared the born of Google.
• Designed and created by Sergie Brin and Larry Page.
• On September 7, 1998, Google Inc. opened its doors in
a garage in Menlo Park, California.
4
2004 © Emre A. Yavuz. EECE, UBC
How does Google Work ?
• When you perform a Google search, you are not actually
searching the web, but rather an index of the copy of the
web stored on Google’s servers.
• The index is compiled from all the pages that have been
returned by a multitude of spiders – called GoogleBot - that
crawl the web.
• When a user types in a query, the search items are looked up
in the index and the results are then returned from a separate
set of document servers along with advertisement.
• All of these bits are assembled, with the help of its
PageRank technology, into the page of search results.
5
2004 © Emre A. Yavuz. EECE, UBC
What is PageRank ?
• The method of measuring a page’s “importance”.
• The applied version of academic citation literature to the web.
• An extended idea based on the counted citations or backlinks
to a given page by not counting links from all pages equally,
and by normalizing the number of links on a page.
• Assuming page A having pointing pages to itself labeled from
t1 to tn, the pagerank of page A is given as follows:
PR(A) = (1-d) + d . (PR(t1)/C(t1) + … + PR(tn)/C(tn))
where C(A) is defined as the # of links going out of page A.
6
2004 © Emre A. Yavuz. EECE, UBC
How to tell what a PageRank of a page is
• Download a toolbar from http://toolbar.google.com.
• Once installed, there will be bar graph at the top of the
browser showing a version of PageRank for the page
being browsed.
• Hold the mouse over the bar to see a number from 0 to 10.
• Only to give you an idea, not very accurate, sometimes
guesses, if the page entered is not in indexed, but there is
a closer one. Just a representation of actual PageRank.
• Whilst PageRank is linear, Google uses a non-linear graph
to portray it.
7
2004 © Emre A. Yavuz. EECE, UBC
How significant is PageRank ?
• The significance of any factor in search engine algorithms
depends on the quality of the information it supplies.
• A factor’s importance is known as its weight.
• Originally, when the Meta keyword tag was new, it could be
used as an indicator of what the page was about.
• However, the weighting was fast approaching nothing
since it was easily abused by the Webmasters with a high
level of manipulation.
• Even though PageRank is harder to be manipulated, it is
not impossible to do.
8
2004 © Emre A. Yavuz. EECE, UBC
Is PageRank enough to determine the
quality of a page (1)?
“People only link to pages they think are good.” However,
there may be other reasons like:
 Reciprocal links – “Link to me and I’ll link you.”
 Link requirements – “Using our script requires you to put
a link to our website.” or “We’ll give you an award in return
for a link to our website.”
 Friends and family – “This is my friend Pete’s site”
 Free Page Add-ons – “This counter is provided by …”
9
2004 © Emre A. Yavuz. EECE, UBC
Is PageRank enough to determine the
quality of a page (2)?
• If a Webmaster picks the outbound links by searching on
Google, then PageRank itself will have an influence on the
number of links to a page, (in a circular way).
• Thus the links will no longer be based solely on human
judgement and the increase will not be solely because it is
a good page, but because its PageRank is already high.
Therefore, PageRank is not enough to produce
high precision results.
10
2004 © Emre A. Yavuz. EECE, UBC
Other System Features
• Title tag – most important factor since high level of
importance is placed by most engines & directories.
• Proximity of search terms – how often do they appear ?
How close together are they ?
• Text characteristics – font size and type, search terms in
a larger or bolder font are weighted higher than others.
• Anchor text – Anchors often provide more accurate
descriptions of web pages than the pages themselves.
They may exist for documents which can not be indexed
by a text based search engine – images, programs,
databases etc.
11
2004 © Emre A. Yavuz. EECE, UBC
The difference between PageRank
and other factors
Title Tag
Can only be listed once
Keywords in Body text
Each successive repetition is less
important. Proximity is important.
Anchor text
Highly weighted, but like keywords in
body text, there is a cutoff point where
further anchor text is no longer
worthwhile
PageRank
Potentially infinite. You are always
capable of increasing your PageRank
significantly, but it takes work.
12
2004 © Emre A. Yavuz. EECE, UBC
How does Google rank pages ?
• Find all pages matching the keywords of the search.
• Rank accordingly using “on the page factors” such as
keywords bolded, relatively larger etc.
• Calculate the inbound anchor text.
• Adjust the results by PageRank scores.
13
2004 © Emre A. Yavuz. EECE, UBC
System Anatomy (1)
• Most of Google is implemented in C or C++ for efficiency
and can run in either Solaris or Linux.
• URLserver sends list of URLs to be fetched to the
crawlers.
• The fetched web pages are sent to the storeserver to be
compressed and stored into a repository.
• Every webpage has an associated ID number called a
docID.
• The indexer reads the repository, uncompresses the
documents and parses them to be converted into a set of
word occurrences called hits.
14
2004 © Emre A. Yavuz. EECE, UBC
High Level Google Architecture
15
2004 © Emre A. Yavuz. EECE, UBC
System Anatomy (2)
• The hits record the word, position, fontsize and
capitalization.
• The indexer distributes these hits into a set of barrels and
parses out all the links in every webpage and stores
important information about them in an anchors file.
• The URLresolver reads the anchors file and converts
relative URLs into absolute URLs and docIDs.
• The sorter takes the barrels, sorted by docID and resorts
them by wordID. It also produces a list of wordIDs and
offsets into the inverted index.
16
2004 © Emre A. Yavuz. EECE, UBC
System Anatomy (3)
• A program called DumpLexicon takes this list together
with the lexicon produced by the indexer and generates a
new lexicon.
• The searcher is run by a webserver and uses the lexicon
together with the inverted index and the PageRank to
answer queries.
17
2004 © Emre A. Yavuz. EECE, UBC
How does Google make money ?
• Initially, sold targeted banner advertisements and
provided search services to other websites including
Yahoo.
• Later, launched AdWords – a system for automatically
selling and displaying advertisements alongside search
results. The ads are also ranked according to their
popularity.
• Using the base created by AdWords, launched a context
targeted advertisement system – AdSense.
• Google “next generation corporate software” – released
on 2nd of June 04, query and document update software.
18
2004 © Emre A. Yavuz. EECE, UBC
How do you maximize your place on
Google ? (1)
• Make sure that all your pages are indexed in the first
place.
• Pay a great deal of attention to your webpage titles.
• Have keywords well-represented in the body of the
webpage.
• Add content to your pages and to your website, Google
likes sites with lots of content.
• Use keywords as hyperlink names.
19
2004 © Emre A. Yavuz. EECE, UBC
How do you maximize your place on
Google ? (2)
• Have a good system of navigation between your
webpages, PageRank gets passed among the internal
links of a website.
• Get external links to as many pages on your site as you
can. Each external link will add to the PageRank not only
of the page that is linked, but also of every webpage on
your site, if you have good site navigation.
• Do not submit a redirection web page. Most search
engines will skip your web site completely in that case.
• Try to avoid using frames in your web site.
20
2004 © Emre A. Yavuz. EECE, UBC
References
• “The Anatomy of a Large Scale Hypertextual Web Search
Engine”, Sergey Brin and Lawrence Page.
• “PageRank Uncovered”, Chris Ridings and Mike Shishigin.
• “Google! Everything you always wanted to know, but didn’t
have time to find out”, Judy Broom, Betsy Chessler and
Katherine Foster.
• And not surprisingly http://www.google.com
21
2004 © Emre A. Yavuz. EECE, UBC
THANKS
Questions ?
2004 © Emre A. Yavuz. EECE, UBC
Some Features of Google (1)
• daterange: limits your search to a particular date or range
of dates that a page was indexed by Google.
• only works with Julian dates, so you’ll need to find a Julian
date converter online. The Julian date must be an integer
(no decimals.)
• Usage  daterange:start - stop
e.g. stjohns daterange:2452401-2452766
23
2004 © Emre A. Yavuz. EECE, UBC
Some Features of Google (2)
• filetype: restricts your results to files ending in ".doc"
(or .xls, .ppt. etc.), and shows you only files created
with the corresponding program.
• The “dot” in the file extension – .doc – is optional.
• filetype:extension
e.g. stjohns -filetype:pdf
24
2004 © Emre A. Yavuz. EECE, UBC
Some Features of Google (3)
• inanchor: restricts the results to text in a page’s link
anchors.
• inanchor:terms
e.g. stjohns -inanchor:”ubc”
• intext: ignores link text, URLs, and titles, and only
searches body text, helps you find query words that are
too common in URLs and links.
• intext:terms
e.g.stjohns -intext:”ubc.ca”
25
2004 © Emre A. Yavuz. EECE, UBC
Some Features of Google (4)
• intitle: restricts the results to documents containing a
particular word in its title.
• inurl: restricts the results to documents containing a
particular word in its URL.
• site: restricts the results to those websites in a domain.
• cache: shows the version of a web page that Google has
in its cache.
26
2004 © Emre A. Yavuz. EECE, UBC
Some Features of Google (5)
• link: restricts the results to those web pages that
have links to the specified URL.
• related: lists web pages that are "similar" to a
specified web page.
• info: presents some information that Google has
about a particular web page.
27
2004 © Emre A. Yavuz. EECE, UBC
Some Features of Google (6)
• There are actually three different Google phonebook
operators.
• Using phonebook: searches the entire Google
phonebook.
• Using rphonebook: searches residential listings only.
• Using bphonebook: searches business listings only.
28
2004 © Emre A. Yavuz. EECE, UBC
Some Features of Google (7)
• If you begin a query with stocks: Google will treat
the rest of the query terms as stock ticker symbols,
and will link to a Yahoo finance page showing stock
information for those symbols.
• If you begin a query with define: Google will display
definitions for the word or phrase that follows, if
definitions are available.
29
2004 © Emre A. Yavuz. EECE, UBC