CSM06 Information Retrieval

Download Report

Transcript CSM06 Information Retrieval

CSM06 Information Retrieval
Lecture 4: Web IR part 1
Dr Andrew Salway
[email protected]
Lecture 4: OVERVIEW
• Previously we looked at IR techniques that
indexed a document based on the words that
occur in the document
• Some of these techniques are applied in web
search engines (but VSM may not be
appropriate). However, web IR can also exploit
a distinctive feature of information on the web –
hypertext link structure
Use of anchor text for indexing web
pages
The PageRank algorithm based on link
structure analysis
Other techniques for ranking web pages
Challenges for IR on the Web
• High volume of information
• Heterogeneous information
(multimedia and multilingual)
• Diverse users - hence diverse
information needs, and many
inexperienced users
• Average query length 2-5 words
• Poorly structured and low quality
information
Scale
•Projection of worldwide Internet population in
2005 = 1.07 billion users,
www.clickz.com/stats/web_worldwide/
•Early in 2005 Google claimed to index over 8
billion web pages, Yahoo recently claimed 19
billion, now Google claims to index 3 times
more than nearest competitor
http://select.nytimes.com/gst/abstract.html?res=
F30610F93E540C748EDDA00894DD404482
•Given the low overlap in search engine results
for a given query, it is likely that the total
number of webpages is much greater than that
indexed by any single web search engine
Requirements of Web Search
Engine Users?
• Fast response time
• Some relevant results in first page;
maybe less concern with getting all
relevant results
• Good coverage of web, at least of
‘important sites’
• Up-to-date links
• Simple and intuitive to use – making
queries and understanding results
NB. Some of these requirements contrast with
those of expert researchers using specialist
information retrieval systems
User Goals (Information Needs)
• Queries are used to express a user’s
goal (or information need), but note
that the same query might be used
for quite different goals
(Rose and Levinson 2004)
User Goals: Rose and Levinson’s
classification (2004)
1.
2.
3.
Navigational – wanting a specific known
website
Informational – “my goal is to learn
something by reading or viewing web
pages” – e.g. closed and open-ended
questions, advice
Resource – “my goal is to obtain a
resource (not information) available on
web pages” – e.g. download music,
interact with online shopping service
NOTE: prior to web most IR was concerned
only with Informational queries
User Goals: Rose and Levinson’s
classification (2004)
•
The more a search engine understands
about a user’s goal then the better results it
can provide
 User goals may be deduced not only from
the query, but also from
• The results returned by the search
engine
• Results clicked on by the user
• Further searches / actions by the user
Opportunity…
• Web search engines can
exploit the fact that
information on the web is in
the form of hypertext…
Hypertext
• The web is, in some senses
at least, hypertextual, i.e. it
can be viewed as networks
of nodes (e.g. pages) and
links (between pages)
Hypertext
• Links suggest – relatedness of
topic / perhaps also a
recommendation
• Topological information about
the hypertext graph gained by
link structure analysis can be
exploited for ranking
Use of Anchor Text
(Brin and Page 1998)
• Words in the anchor text can be used to
index the webpage being linked to – the text
in an anchor may give a good description of
the page it points to, e.g.
<ahref=“www.bio.com/beckhambio.html"> A
Biography of David Beckham</a></p>
• The words in the anchor text might be a
better indicator of what the webpage is
about than the words in the webpage
• Anchor text is also good for resources like
images that can not be analysed as
keywords
PageRank (Brin and Page 1998)
• “Google makes use of both link structure
and anchor text”
• “The citation (link) graph of the web is an
important resource that has largely gone
unused in existing web search engines”
 PageRank is “an objective measure of [a web
page’s] citation importance that corresponds
well with people’s subjective idea of
importance”
Calculating PageRank
PR(A) = (1-d) + d*(PR(T1)/C(T1) + … +
PR(Tn)/C(Tn)
PR(A) = PageRank of webpage A
C (A) = the number of links out of webpage A
T1…Tn = the webpages that point to webpage A
d = a damping factor set between 0-1
In reality, the calculation of PageRank is iterative
Web-adjacency Analysis (a similar
idea to PageRank)
• Kleinberg and colleagues proposed a
method for identifying authoritative webpages
– Identify set of relevant pages (as normal)
– Identify those with a large in-degree, i.e. lots
of pages point to them (cf. ‘impact’)
– Ensure that the authorities selected are
referred to by a number of the same hubs,
i.e. those with a large out-degree
Web-adjacency Analysis
• “Hubs and authorities exhibit what could be
called a mutually reinforcing relationship”
(Kleinberg 1998)
• Computing authority and hub values for
web-pages is an iterative process over a
graph, where each node is a web-page
– Two weights are given to each node relating
to in-degree and out-degree: total in-degree
weights and total out-degree weights are kept
constant
– Weights are modified each iteration
depending on weights of connected nodes
Some other Factors used to rank
Web Pages (Hock 2001)
• Popularity of the Page: measured
either by how many other web-pages
link to it, or by how many people
have clicked on it when they had the
same query
• Frequency of search terms: need
to consider length of the document,
and web-page authors attempts to
affect ranking by deliberate repetition
• Number of query terms matched:
but remember many queries are only
one or two words
Other Factors (continued…)
• Rarity of terms: rank pages
containing rare search terms more
highly (cf. TFIDF)
• Weighting by Field: give high
ranking to pages including search
terms in important fields, e.g. Title
• Proximity of Terms: rank pages
more highly if search terms occur
near one another
• Order of Query Terms: give priority
to pages containing the search term
entered first
Set Reading for Lecture 4
• Page and Brin (1998), “The Anatomy of
a Large-Scale Hypertextual Web Search
Engine”. SECTIONS 1 and 2. Explains
Google’s use of anchor text and
PageRank.
www-db.stanford.edu/~backrub/google.html
• Hock (2001), The extreme searcher's
guide to web search engines, pages 2531. Gives an overview of some factors
used by web search engines to rank
webpages. AVAILABLE in Main Library
collection and in Library Article
Collection.
Exercise
• Explore the idea of PageRank using an online
PageRank calculator, e.g.
www.markhorrell.com/seo/pagerank.shtml
OR
www.webworkshop.net/pagerank_calculator.php3
Further Reading
Rose and Levinson (2004), “Understanding User Goals in Web Search”, 13th
International WWW Conference, 2004.
www.sims.berkeley.edu/courses/is141/f05/readings/rose_www04.pdf
Page, Brin, Motwani and Winograd (1999), “The PageRank Citation Ranking:
Bringing Order to the Web.” http://dbpubs.stanford.edu:8090/pub/1999-66
Belew (2000), Finding Out About, pages 195-199 for an overview of Kleinberg’s
work on web-adjacency analysis and authorities and hubs.
Kleinberg (1998), ‘Authoritative Sources in a Hyperlinked Environment’, Journal
of the ACM. http://citeseer.nj.nec.com/87928.html
Kobayashi and Takeda (2000), “Information Retrieval on the Web”, ACM
Computing Surveys 32(2), pp. 144-173. AVAILABLE IN LIBRARY /
ARTICLE COLLECTION. **This comprehensive article reviews a lot the
ideas covered so far in this module and discusses them in the context of
Web IR. NOTE, it is already a little out of date in places because of the
rapid changes of the Web.
Lecture 4: LEARNING OUTCOMES
After this lecture you should be able to:
• Explain how the challenges of web IR are
different than those facing the developers
of traditional IR systems
• Explain how web search engines can
exploit the hypertext structure of the web
to index and rank web pages, e.g. using
Anchor Text, and PageRank
• Explain how PageRank is calculated
• Discuss and critique a range of factors
used by web search engines to rank web
pages
Reading ahead for LECTURE 5
If you want to read about next week’s lecture topics,
see:
Dean and Henzinger (1999), ‘Finding Related
Pages in the World Wide Web’. Pages 1-10.
http://citeseer.ist.psu.edu/dean99finding.html
Agichtein, Lawrence and Gravano (2001), ‘Learning
Search Engine Specific Query Transformations
for Question Answering’, Procs. 10th International
WWW Conference. **Section 1 and Section 3**
www.cs.columbia.edu/~eugene/papers/www10.pdf
Oppenheim, Morris and McKnight (2000), ‘The
Evaluation of WWW Search Engines’, Journal of
Documentation, 56(2). Pages 194-205. In
Library Article Collection.