Recent Results in Automatic Web Resource Discovery

Transcript Recent Results in Automatic Web Resource Discovery

Recent Results in Automatic Web
Resource Discovery
Soumen Chakrabartiv
Presentation by Cui Tao
7/17/2015
1
Introduction

Classical IR:



Indexing a collection of documents
Answering queries by returning a ranked list of
relevant document
Problems for retrieve online document





7/17/2015
Ambiguity
Context sensitivity
Synonymy
Polysemy
Large amount of relevant Web pages
2
Introduction
Directory-based topic browsing:
tree-like structure

Most Maintained by human expert
 Advantages:

7/17/2015
exemplary, influential
Disadvantages: slow, subjective and noisy
3
Introduction

Standard crawler and search engine



7/17/2015
1997: cover 35-40% out of 340 million Web
pages
1999: cover 18% out of 800 million Web
pages
Cannot be used for maintaining generic
portals and automatic resource discovery
4
Introduction

Focused crawler:



Can selectively seek out pages that are
relevant to pre-defined set of topics
Experts and researchers preferred
Two modules:
Classifier: analyzes the text in and links around a
given web page and automatically assigns it to
suitable directories in a web catalog
 Distiller: identifies the centrality of crawled pages
to determine visit priorities

7/17/2015
5
Distillation techniques

Google:



7/17/2015
Simulate a random wander on the Web
Ranked by pre-computed popularity and
visitation rate
fast
6
Distillation techniques

HITS (Hyperlink Induced Topic Search):


Depends on a search engine
Combine two scores:
Authorities: identify pages with useful information
about a topic
 Hubs: identify pages that contain many links to
pages with useful information on the topic



7/17/2015
Query dependent and slow
May lead topic contamination or drift
7
Distillation techniques

ARC and CLEVER:



7/17/2015
ARC (Automatic Resource Complier): part of
CLEVER
Root set was expanded by 2 links instead of 1link
( Including all pages which are link-distance two or
less from at least one page in the root set )
Assign weights to the hyperlinks: base on the match
between the query and the text surrounding the
hyperlink in the source document
8
Distillation techniques

Outlier filtering:



7/17/2015
Computes relevance weights for pages using
Vector Space Model
All pages whose weights are below a
threshold are pruned
Effectively prune away outlier nodes in the
neighborhood, thus avoid contamination
9
Topic distillation vs. Resource discovery

Topic distillation:


Depend on large, comprehensive Web crawls
and indices (Post processing)
Can be used to generate a Web taxonomy?
Set a keyword query for each node in the
taxonomy
 Run a distillation program
 Simple but have some problems

7/17/2015
10
Topic distillation vs. Resource discovery

Problems:

Construction the query: involves trial, error and
complicated thought

Query: “North American telecommunication companies”
Query: +"power suppl*" ßwitch* mode" smps multiprocessor* üninterrupt* power suppl*" ups -parcel

The Yahoo! node /Business&Economy /Companies
/Electronics /PowerSupplies
 To match the directory based browsing quality of :
7/17/2015

Yahoo!: 7.03 terms and 4.34 operators

Alta Vista: 2.35 terms and 0.41 operators
11
Topic distillation vs. Resource discovery

Problems:

Contamination
stop-sites: not automatic
 terming weighting
 edge weighing: no precise algorithm to set the
weight


Topic distillation by itself is not enough for
resource discovery
7/17/2015
12
Hypertext classification: learning
from example



Adding example pages and their distance-1
neighbors into the graph to be distilled will
improve the result
The contents of the given example and its
neighbors provide a way to compute the
decision boundary of classification
NN, Bayesian and support vector
classifiers
7/17/2015
13
Hypertext classification

Link-based features: important

Circular topic influence
Topic of one page influences its text and its
neighbor page’s topic
 Knowledge of the linked vicinity’s topic provides
clues for the test document’s topic
 Bibliometric, more general than the simple linear
endorsement model used in topic distillation

7/17/2015
14
Putting it together for resource
discovery
7/17/2015
15
Conclusion



Emphasized the importance of scalable
automatic resource discovery
Argued that common search engines are
not adequate to achieve the resource
discovery
Introduced the recently invented focused
crawling system
7/17/2015
16
Future Works


How to derive the training examples
automatically?
How to personalize the outcome of focused
crawler for users?
7/17/2015
17