No Slide Title

Download Report

Transcript No Slide Title

IST 511 Information Management: Information
and Technology
Link analysis, web graphs, search & crawling
Dr. C. Lee Giles
David Reese Professor, College of Information Sciences
and Technology
Professor of Computer Science and Engineering
Professor of Supply Chain and Information Systems
The Pennsylvania State University, University Park, PA,
USA
[email protected]
http://clgiles.ist.psu.edu
Special thanks to C. Manning, P. Raghavan, H. Schutze
Last Time
What is text
– Definitions
– Why important
– What is text processing
What is information retrieval?
–
–
–
–
–
–
Definition of a document
Text and documents as vectors
Relevance (performance) – recall/precision
Inverted index
Similarity/ranking
ACM SIGIR (list serve)
Impact on information science
– Foundations of SEARCH ENGINES
Today
Web as a graph
What is link analysis
– Definitions
– Why important
– How are links used – ranking
IR vs search engines
– How are search engines related to information retrieval?
– How is information gathered
Impact and importance of search engines
Impact on information science
Tomorrow
Topics used in IST
Machine learning
Probabilistic reasoning
Digital libraries
Others?
Theories in Information Sciences
Enumerate some of these theories in this course.
Issues:
– Unified theory?
– Domain of applicability
– Conflicts
Theories here are
– Mostly algorithmic
– Some quantitative
– Some qualitative
Quality of theories
– Occam’s razor
– Subsumption of other theories (IR vs search)
Theories of link analysis
– Mathematical - networks
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Search
engines
The history of the Web
Vannevar Bush – “As we may think” (1945)
The “MEMEX”: A photo-electrical-mechanical device
that stores documents and images and allows to create
and follow links between them
The history of the Web
Tim Berners-Lee
1980 – CERN: Writes a notebook program
“Enquire-upon-within-everything”
that allows links to be made between
arbitrary nodes
1989 – CERN: Circulates the document
“Information management: a proposal”
1990 – CERN: The first Web browser, and the first
Web server-client communication
1994 : The creation of the WWW consortium (W3C)
The history of the Web
Hypertext 1991: Tim Berners-Lee paper on WWW
was accepted only as a poster
Web Today
The Web consists trillions of pages
It is considered one of the biggest information
revolutions in recent human history
– Comparable to the invention of the printing press
Web page
The Web graph
A graph G = (V, E) is defined by
– a set V of vertices (nodes)
– a set E of edges (links) = pairs of nodes
The Web page graph (directed)
– V is the set of static public pages
– E is the set of static hyperlinks
Many more graphs can be defined
– The host graph
– The co-citation graph
– etc
Which pages do we care for?
We want to avoid “dynamic” pages
– catalogs
– pages generated by queries
– pages generated by cgi-scripts (the nostradamus effect)
We are only interested in “static” web pages
The Static Public Web
Static
– not the result of a cgi-bin
scripts
– no “?” in the URL
– doesn’t change very often
– etc.
Public
–
–
–
–
no password required
no robots.txt exclusion
no “noindex” meta tag
etc.
 These rules can still be fooled
 “Dynamic pages” appear static
• browseable catalogs (Hierarchy built from DB)
 Spider traps -- infinite url descent
• www.x.com/home/home/home/…./home/home.html
 Spammer games
Why do we care about the Web graph?
Is it the largest human artifact ever created?
Exploit the Web structure for
–
–
–
–
–
crawlers
search and link analysis ranking
spam detection
community discovery
classification/organization
Predict the Web future
–
–
–
–
–
mathematical models
algorithm analysis
sociological understanding
new business opportunities
new politics
Know what we don’t know
– Do search engines really know what’s out there?
What is the size of the Web?
Surprisingly hard to answer
Naïve solution: keep crawling until the whole graph has been
explored
Extremely simple but wrong solution: crawling is
complicated because the web is complicated
– spamming
– duplicates
– mirrors
Simple example of a complication: Soft 404
– When a page does not exists, the server is supposed to return an
error code = “404”
– Many servers do not return an error code, but keep the visitor on
site, or simply send him to the home page
A sampling approach
Sample pages uniformly at random
Compute the percentage of the pages that belong to a
search engine repository (search engine coverage)
Estimate the size of the Web
Problems:
– how do you sample a page uniformly at random?
– how do you test if a page is indexed by a search
engine?
Measuring the Web
It is clear that the Web that we see is what the web
crawler discovers
We need large crawls in order to make meaningful
measurements
The measurements are still biased by
– the crawling policy
– size limitations of the crawl
– Perturbations of the "natural" process of birth and
death of nodes and links
Measures on the Web graph
[BKMRRSTW00]
Degree distributions
The global picture
– what does the Web look like from a far?
Reachability
Connected components
Community structure
The finer picture
In & out degree distribution
Power-law distribution with exponent 2.1 & 2.7
The good news
The fact that the exponent is greater than 2 implies
that the expected value of the degree is a constant
(not growing with n)
Therefore, the expected number of edges is linear in
the number of nodes n
This is good news, since we cannot handle anything
more than linear
Connected components – definitions
Weakly connected components (WCC)
– Set of nodes such that from any node can go to any node via an
undirected path
Strongly connected components (SCC)
– Set of nodes such that from any node can go to any node via a directed
path.
WCC
SCC
The bow-tie structure of the Web
The Size of the Web
Lawrence and Giles 1999 – 800 million
Over 11.5 billion in 2005 (Google indexed over 8 billion
at the time)
About 600 billion in 2010, approaching 1 trillion
Coverage – about 40% in 1999
Overlap - low
The deep (or hidden or invisible) web contains 400-550
times more information.
Estimate: (query “the” into Google)
How about “dynamic pages”?
Search Engines as Information Retrieval (IR)
Search engines are examples of an automated IR system
applied to the web
– Automated systems SCALE!
IR is an old field that goes back to the late 40’s and 50’s.
– Vannevar Bush
Web search is an extended aspect of the field of information
retrieval where the IR system is applied to the web.
Levels of search engines
0th - Library catalog
– Based on human created metadata
1st - Altavista
– First large comprehensive database
– Word based index and ranking
2nd - Google
– High relevance
– Link (connectivity) based importance
3rd - ?
Brief History of Search Engines
Working predecessors - not vaporware
– Archie (1990) first search engine for ftp files
– Veronica and Jughead (1991) first for gopher space
First gen SE’s - something to find
– Yahoo! (www.yahoo.com) - (1994-) directory service and search engine.
– Infoseek – (1994-2001) search engine.
Second gen SE’s - big and fast
–
–
–
–
AltaVista – (1995-) search engine, acquired by Overture in 2003.
Inktomi – (1995-) search engine infrastructure, acquired by Yahoo! 2003.
Overture – (1997-) pay-per-click search engine, acquired by Yahoo! 2003.
AlltheWeb – (1999-) search engine, acquired by Overture in 2003 .
Third gen SE’s - ranking by importance (web links - graphs)
– Google (www.google.com) – (1998-) – search engine.
– Ask (www.ask.com) - (1996-) Originally Teoma. Combine with Ask Jeeves
Q&A and search engine, acquired by IAC/InterActiveCorp in 2005.
– MSN Search (now Bing) (www.msn.com) – (2004-) Microsoft Network’s
search engine.
Motivation for Link Analysis
Early search engines mainly compare content similarity
of the query and the indexed pages. I.e.,
– They use information retrieval methods, cosine, TF-IDF, ...
From mid 90’s, it became clear that content similarity
alone was no longer sufficient.
– The number of pages grew rapidly in the mid-late 1990’s.
• Try “classification methods”, Google estimates: millions of relevant pages.
• How to choose only 30-40 pages and rank them suitably to present to the
user?
– Content similarity is easily spammed.
• A page owner can repeat some words and add many related words to boost
the rankings of his pages and/or to make the pages relevant to a large
number of queries.
Early hyperlinks
Starting mid 90’s, researchers began to work on the
problem, resorting to hyperlinks.
– In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a
hyperlink based search patent. The method uses words in
anchor text of hyperlinks.
Web pages on the other hand are connected through
hyperlinks, which carry important information.
– Some hyperlinks: organize information at the same site.
– Other hyperlinks: point to pages from other Web sites. Such
out-going hyperlinks often indicate an implicit conveyance
of authority to the pages being pointed to.
Those pages that are pointed to by many other pages are
likely to contain authoritative information.
Hyperlink algorithms
During 1997-1998, two most influential hyperlink based search
algorithms PageRank and HITS were reported.
Both algorithms are related to social networks. They exploit the
hyperlinks of the Web to rank pages according to their levels of
“prestige” or “authority”.
– HITS: Jon Kleinberg (Cornel University), at Ninth Annual ACM-SIAM
Symposium on Discrete Algorithms, January 1998
– PageRank: Sergey Brin and Larry Page, PhD students from Stanford
University, at Seventh International World Wide Web Conference (WWW7)
in April, 1998.
PageRank powers the Google search engine.
Impact of “Stanford University” in web search
– Google: Sergey Brin and Larry Page (PhD candidates in CS)
– Yahoo!: Jerry Yang and David Filo (PhD candidates in EE)
– HP, Sun, Cisco, …
Other uses
Apart from search ranking, hyperlinks are also useful for
finding Web communities.
– A Web community is a cluster of densely linked pages
representing a group of people with a special interest.
Beyond explicit hyperlinks on the Web, links in other
contexts are useful too, e.g.,
– for discovering communities of named entities (e.g., people
and organizations) in free text documents, and
– for analyzing social phenomena in emails..
The Web as a Directed Graph
Page A
Anchor
hyperlink
Page B
Assumption 1: A hyperlink between pages denotes
author perceived relevance (quality signal)
Assumption 2: The anchor of the hyperlink
describes the target page (textual context)
Anchor Text Indexing
Extract anchor text (between <a> and </a>) of each link
followed.
Anchor text is usually descriptive of the document to which it
points.
Add anchor text to the content of the destination page to
provide additional relevant keyword indices.
Used by Google:
– <a href=“http://www.microsoft.com”>Evil Empire</a>
– <a href=“http://www.ibm.com”>IBM</a>
Anchor text
Indexing anchor text
When indexing a document D, include anchor text from links
pointing to D.
Armonk, NY-based computer
giant IBM announced today
www.ibm.com
Joe’s computer hardware links
Compaq
HP
IBM
Big Blue today announced
record profits for the quarter
Indexing anchor text
Can sometimes have unexpected side effects - e.g., french
military victories
Helps when descriptive text in destination page is embedded
in image logos rather than in accessible text.
Many times anchor text is not useful:
– “click here”
Increases content more for popular pages with many incoming links, increasing recall of these pages.
May even give higher weights to tokens from anchor text.
Concepts of Relevance
IR vs Web
Document measures
Relevance, as conventionally defined, is binary (relevant
or not relevant). It is usually estimated by the similarity
between the terms in the query and each document.
Importance measures documents by their likelihood of
being useful to a variety of users. It is usually estimated
by some measure of popularity.
Web search engines rank documents by combination of
relevance and importance. The goal is to present the user
with the most important of the relevant documents.
Ranking Options
1. Paid advertisers
2. Manually created classification
3. Vector space ranking with corrections for document length
4. Extra weighting for specific fields, e.g., title, anchors, etc.
5. Popularity or importance, e.g., PageRank
Not all these factors are made public.
History of link analysis
Bibliometrics
– Citation analysis since the 1960’s
– Citation links to and from documents
• Basis of pagerank idea
Bibliometrics
Techniques that use citation analysis to measure the
similarity of journal articles or their importance
Bibliographic coupling: two papers that cite many of the
same papers
Co-citation: two papers that were cited by many of the
same papers
Impact factor (of a journal): frequency with which the
average article in a journal has been cited in a particular
year or period
Citation frequency
Citation Graph
Bibliographic coupling
cites
Paper
is cited by
cocitation
Note that journal citations nearly always refer to earlier work.
Graphical Analysis of Hyperlinks on
the Web
This page links to
many other pages
(hub)
1
2
4
3
5
6
Many pages
link to this
page
(authority)
Bibliometrics: Citation Analysis
Many standard documents include bibliographies (or
references), explicit citations to other previously
published documents.
Using citations as links, standard corpora can be
viewed as a graph.
The structure of this graph, independent of content,
can provide interesting information about the
similarity of documents and the structure of
information.
Impact of paper!
Citations vs. Links
Web links are a bit different than citations:
–
–
–
–
–
–
Many links are navigational.
Many pages with high in-degree are portals not content providers.
Not all links are endorsements.
Company websites don’t point to their competitors.
Citations to relevant literature is enforced by peer-review.
in-degree vs out-degree
Many types of link analysis methods
– Global vs local; HITS vs pagerank
Authorities
Authorities are pages that are recognized as providing
significant, trustworthy, and useful information on a topic.
In-degree (number of pointers to a page) is one simple
measure of authority.
However in-degree treats all links as equal.
Should links from pages that are themselves authoritative
count more?
Hubs
Hubs are index pages that provide lots of useful links to
relevant content pages (topic authorities).
Ex: pages are included in the course home page
HITS
Algorithm developed by Kleinberg in 1998.
IBM search engine project
Attempts to computationally determine hubs and authorities
on a particular topic through analysis of a relevant
subgraph of the web.
Based on mutually recursive facts:
– Hubs point to lots of authorities.
– Authorities are pointed to by lots of hubs.
Hubs and Authorities
Together they tend to form a bipartite graph:
Hubs
Authorities
HITS Algorithm
Computes hubs and authorities for a particular topic specified
by a normal query.
– Thus query dependent ranking
First determines a set of relevant pages for the query called
the base set S.
– Base set must be determined beforehand
Analyze the link structure of the web subgraph defined by S
to find authority and hub pages in this set.
Kleinberg, IBM ‘99
Constructing a Base Subgraph
For a specific query Q, let the set of documents
returned by a standard search engine be called the
root set R.
Initialize S to R.
Add to S all pages pointed to by any page in R.
Add to S all pages that point to any page in R.
S
R
Iterative Algorithm
Use an iterative algorithm to slowly converge on a
mutually reinforcing set of hubs and authorities.
Maintain for each page p  S:
– Authority score: ap
– Hub score:
hp
(vector a)
(vector h)
Initialize all ap = hp = 1
Maintain normalized scores:
 a   1  h p 
2
2
p
pS
p S
1
Convergence
Algorithm converges to a fix-point if iterated
indefinitely.
Define A to be the adjacency matrix for the subgraph
defined by S.
– Aij = 1 for i  S, j  S iff ij
Authority vector, a, converges to the principal
eigenvector of ATA
Hub vector, h, converges to the principal eigenvector
of AAT
In practice, 20 iterations produces fairly stable
results.
HITS Results
An ambiguous query can result in the principal
eigenvector only covering one of the possible
meanings.
Non-principal eigenvectors may contain hubs &
authorities for other meanings.
Example: “jaguar”:
– Atari video game (principal eigenvector)
– NFL Football team (2nd non-princ. eigenvector)
– Automobile (3rd non-princ. eigenvector)
Reportedly used by Ask.com
Google Background
“Our main goal is to improve the quality of web search
engines”
Google  googol = 10^100
Originally part of the Stanford digital library project known
as WebBase, commercialized in 1999
Initial Design Goals
Deliver results that have very high precision even at
the expense of recall
Make search engine technology transparent, i.e.
advertising shouldn’t bias results
Bring search engine technology into academic realm
in order to support novel research activities on
large web data sets
Make system easy to use for most people, e.g. users
shouldn’t have to specify more than a couple
words
PageRank in Words
Intuition:
Imagine a web surfer doing a simple random walk on
the entire web for an infinite number of steps.
Occasionally, the surfer will get bored and instead of
following a link pointing outward from the current
page will jump to another random page.
At some point, the percentage of time spent at each
page will converge to a fixed value.
This value is known as the PageRank of the page.
PageRank
Link-analysis method used by Google (Brin & Page, 1998).
Does not attempt to capture the distinction between hubs and
authorities.
Ranks pages just by authority.
Query independent
Applied to the entire web rather than a local neighborhood of
pages surrounding the results of a query.
Initial PageRank Idea
Just measuring in-degree (citation count) doesn’t
account for the authority of the source of a link.
Initial page rank equation for page p:
R( q )
R( p )  c 
q:q  p N q
– Nq is the total number of out-links from page q.
– A page, q, “gives” an equal fraction of its authority to
all the pages it points to (e.g. p).
– c is a normalizing constant set so that the rank of all
pages always sums to 1.
Initial PageRank Idea (cont.)
Can view it as a process of PageRank “flowing” from pages
to the pages they cite.
.08
.1
.05
.05
.03
.09
.03
.03
.08
.03
Initial Algorithm
Iterate rank-flowing process until convergence:
Let S be the total set of pages.
Initialize pS: R(p) = 1/|S|
Until ranks do not change (much) (convergence)
For each pS:
R( q )
R( p)  
q:q  p N q
c  1 /  R( p)
pS
For each pS: R(p) = cR´(p) (normalize)
Sample Stable Fixed Point
0.4
0.2
0.2
0.4
0.4
0.2
0.2
Problem with Initial Idea
A group of pages that only point to themselves but are
pointed to by other pages act as a “rank sink” and absorb
all the rank in the system.
Rank flows into
cycle and can’t get out
Rank Source
Introduce a “rank source” E that continually replenishes the
rank of each page, p, by a fixed amount E(p).


R
(
q
)
R ( p )  c 
 E ( p) 
 q:q  p N

q


PageRank Algorithm
Let S be the total set of pages.
Let pS: E(p) = /|S| (for some 0<<1, e.g. 0.15)
Initialize pS: R(p) = 1/|S|
Until ranks do not change (much) (convergence)
For each pS:
R( q )
R( p)  
 E ( p)
q:q  p N q
c  1 /  R( p)
pS
For each pS: R(p) = cR´(p) (normalize)
Random Surfer Model
PageRank can be seen as modeling a “random surfer”
that starts on a random page and then at each point:
– With probability E(p) randomly jumps to page p.
– Otherwise, randomly follows a link on the current page.
R(p) models the probability that this random surfer
will be on page p at any given time.
“E jumps” are needed to prevent the random surfer
from getting “trapped” in web sinks with no
outgoing links.
Justifications for using PageRank
Attempts to model user behavior
Captures the notion that the more a page is pointed to by
“important” pages, the more it is worth looking at
Takes into account global structure of web
Speed of Convergence
Early experiments on Google used 322 million links.
PageRank algorithm converged (within small tolerance) in
about 52 iterations.
Number of iterations required for convergence is empirically
O(log n) (where n is the number of links).
Therefore calculation is quite efficient.
Google Ranking
Complete Google ranking includes (based on university
publications prior to commercialization).
–
–
–
–
Vector-space similarity component.
Keyword proximity component.
HTML-tag weight component (e.g. title preference).
PageRank component.
Details of current commercial ranking functions are trade
secrets.
– Pagerank becomes Googlerank!
Google PageRank-Biased Crawling
Use PageRank to direct (focus) a crawler on “important”
pages.
Compute page-rank using the current set of crawled pages.
Order the crawler’s search queue based on current estimated
PageRank.
Information Retrieval Using PageRank
Simple Method
Consider all hits (i.e., all document vectors that share at least
one term with the query vector) as equal.
Display the hits ranked by PageRank.
Final rank = R(p) x Similarity
The disadvantage of this method is that it only gives a
similarity measure as to how much a document matches a
query
Combining Term Weighting with
Reference Pattern Ranking
Combined Method
1. Find all documents that share a term with the query vector.
2. The similarity, using conventional term weighting, between the
query and document j is sj.
3. The rank of document j using PageRank or other reference
pattern ranking is pj.
4. Calculate a combined rank cj = sj + (1- )pj, where  is a
constant.
5. Display the hits ranked by cj.
This method is used in several commercial systems, but the details
have not been published.
Link Analysis Conclusions
Link analysis uses information about the structure of the web
graph to aid search.
It is one of the major innovations in web search.
It is the primary reason for Google’s success.
Still lots of research regarding improvements
Limits of Link Analysis
Stability
– Adding even a small number of nodes/edges to the
graph has a significant impact
Topic drift
– A top authority may be a hub of pages on a different
topic resulting in increased rank of the authority page
Content evolution
– Adding/removing links/content can affect the intuitive
authority rank of a page requiring recalculation of page
ranks
Ranking: query (in)dependence
Query independent ranking
– Important pages; no need for queries
– Trusted pages?
– Pagerank can do this
Query dependent ranking
– Combine importance with query evaluation
– Hits is query based.
Google Indexing the Web Goals:
Precision
Short queries applied to very large numbers of
items leads to large numbers of hits.
• Goal is that the first 10-100 hits presented should
satisfy the user's information need
-- requires ranking hits in order that fits user's
requirements
• Recall is not an important criterion
Completeness of index is not an important factor.
• Comprehensive crawling is unnecessary
Foundations of search engines
Internet
– What apps such as the WWW and email actually travel on
Web
– Link structure that is traversed by users: humans and bots
Protocols and web languages
– http, html: what controls access and generates content
Browsers
– Allows users to view and interact with a url and search engine
Crawlers (bots, spiders, harvesters)
– Harvests web content by following links
Indexers
– Builds a searchable structure of what a crawler finds
What is web search?
Revolution in information and knowledge access
– Access to “heterogeneous”, distributed information
• Heterogeneous in creation
• Heterogeneous in motives
• Heterogeneous in accuracy …
Multi-billion dollar business
Source of new opportunities in marketing and sales
Strains the boundaries of trademark and intellectual property
laws
A source of unending technical challenges
“Flattens” the world
www.internetworldstats.com/stats.htm
Impact of search engines
Unbelievable access to information
– Implications are only just being understood
– Democratization of humankind’s knowledge
The online world
– I “googled” him just to see …
– Search is crucial part of many’s everyday existence and 2nd most popular
online activity after email.
– Social interactions - blogs
The death of anonymity/privacy
– Nearly everyone is searchable
• Choicepoint
• MySpace
• FaceBook
• LinkedIn
Digital divide
A Typical Web Search Engine
Web bot
Query Engine
Index
Interface
Indexer
Filter
Users
Automate all
that’s possible
Crawler
Crawler Agent/Algorithm
• Without web crawlers, there would be
no search engines
•A crawler or bot is a program that
traverses web pages, downloads them
for indexing and follows (or harvests)
the hyperlinks on the downloaded pages.
•A crawler will typically start from a
multitude of web pages and aims to
cover as much of the indexable web as
possible.
Crawler actions can be specified in advance:
Standard algorithm used breadth-first search.
Focused crawlers use best-first strategy
“Ethical” crawlers obey robots.txt files placed on web
pages which limits crawlable content
Crawled pages are stored for possible indexing
Crawlers require many resources
Storage
bandwidth
Crawlers vs Browsers vs Scrapers
• Crawlers automatically harvest all files on the web
• Browsers are manual crawlers
• Scrapers automatically harvest the visual files for a
web site and are limited crawlers
• wget linux command - simple crawler
Crawling the web
URLs crawled
and parsed
Seed
pages
Web
Unseen Web
URLs frontier
URL frontier
The next node to crawl
• Can include multiple pages from the
same host
• Must avoid trying to fetch them all at
the same time
• Must try to keep all crawling threads
busy
Web Crawler
• A crawler is a program that picks up a page and follows all
the links on that page
• Many names; all mean the same
• Crawler = Spider = Bot = Harvester
• Usual types of crawler:
– Breadth First
– Depth First
– Combinations of the above
Breadth First Crawlers
Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and add them to a
queue
• Pick the 1st link from the queue, get all links on the page
and add to the queue
• Repeat above step till queue is empty
Search Strategies BF
Breadth-first Search
Depth First Crawlers
Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start page
• Visit link and get 1st non-visited link
• Repeat above step till no no-visited links
• Go to next non-visited link in the previous level and repeat
2nd step
Search Strategies DF
Depth-first Search
Sorting and Ranking Hits
When a user submits a query to a
search system, the system returns a
set of hits. With a large collection of
documents, the set of hits maybe very
large.
The value to the use depends on the
order in which the hits are presented.
Three main methods:
• Sorting the hits, e.g., by date
• Ranking the hits by similarity
between query and document
• Ranking the hits by the
importance of the documents
Sorting and Ranking Hits
When a user submits a query to a
search system, the system returns
a set of hits. With a large
collection of documents, the set
of hits maybe very large.
The value to the use depends on
the order in which the hits are
presented.
Three main methods:
• Sorting the hits, e.g., by date
• Ranking the hits by
similarity between query and
document
• Ranking the hits by the
importance of the documents
Number of search engine queries - US
About 500M per day; now over a billion a day
ComScore global share
Result evaluation
recall becomes useless
precision measured over top-10/20 results
Shift of interest from “relevance” to
“authoritativeness/reputation” or importance
ranking becomes critical
What’s up and coming?
More personal search
Social search
Mobile search
Specialty search
Freshness search
Next generation search?
Will anyone replace Google?
“Search as a problem is only 5% solved” Udi Manber, 1st
Yahoo, 2nd Amazon, now Google
What we covered
Search engines types and business models
Ranking by importance
– Link analysis
• Citations
• Pagerank (query independent; must be combined with a lexical
similarity)
• Hubs/authorities (query dependent)
Google
– Role in search
– Architecture
Covered the basics of modern search engines
Search engines are constantly evolving
Economic Impact of Google
Economic impact of Google
– In 2011, Google's search and advertising tools helped provide $80
billion of economic activity for 1.8 million businesses, website
publishers and non-profits across the U.S. (1/2 % of US GDP)
Economic value of better search?
Highest capitalization of any tech firm (more valuable than
Microsoft)
Are Google rankings fair?
Are any search engines rankings fair?
Try out certain queries
– “Jew” USA
– “Jew” Germany
– “tiananmen square”
• Google.com
• Google.cn
• Baidu.com
What is the web role of search in society
– Human flesh search engine
– What is the responsibility of search engines?
Social implications of public search
engines
What is the web role of search in society
– Do search engines make us dumber or smarter?
– Human flesh search engine
– What is the responsibility of search engines?
The googlization of everything
Should governments control search engines
– Positions
• Search is too important to be left to the private sector
• Search will only work from 8 to 5 on weekdays and will be off
on holidays.
– New Chinese government search engine
• Panguso
Google of the future
What next? GoogleFace or Faceoogle
What we covered
•
•
•
•
Web as a graph
Search engines vs classic IR
Link analysis
Crawlers
Crawlers
• Foundational species; search engines would
not exist without them
• Good and bad crawlers out there; run 24/7
• Must constantly evolve to find new sources
and content
Basic assumptions of
Web Information Retrieval
(Search engines)
Corpus: constantly changing; created by amateurs and
professionals
Goal: Retrieve summaries of relevant information quickly
with links to the original site
– High precision! Recall not important
– Crawling important
Searcher: amateurs; no professional training and less or no
concern about quality queries
Subscription
Feeds
Crawls
Content creators
Transaction
Advertisement
Editorial
The coarse-level dynamics
Content aggregators
Content consumers
Propositions
• Web is possibly the largest structure built by
humanity
• Search engines have truly changed information
and knowledge acquisition
– Search engines make the web scale
• Nothing would be found
– Democratization of knowledge
• Everyone can play
– Loss of privacy
– Making us dumber or smarter
Questions
• Impact of link analysis on information science
• Impact of search engines on information
science
• What next?