Recent Results in Automatic Web Resource Discovery
Download
Report
Transcript Recent Results in Automatic Web Resource Discovery
Recent Results in Automatic Web
Resource Discovery
Soumen Chakrabartiv
Presentation by Cui Tao
7/17/2015
1
Introduction
Classical IR:
Indexing a collection of documents
Answering queries by returning a ranked list of
relevant document
Problems for retrieve online document
7/17/2015
Ambiguity
Context sensitivity
Synonymy
Polysemy
Large amount of relevant Web pages
2
Introduction
Directory-based topic browsing:
tree-like structure
Most Maintained by human expert
Advantages:
7/17/2015
exemplary, influential
Disadvantages: slow, subjective and noisy
3
Introduction
Standard crawler and search engine
7/17/2015
1997: cover 35-40% out of 340 million Web
pages
1999: cover 18% out of 800 million Web
pages
Cannot be used for maintaining generic
portals and automatic resource discovery
4
Introduction
Focused crawler:
Can selectively seek out pages that are
relevant to pre-defined set of topics
Experts and researchers preferred
Two modules:
Classifier: analyzes the text in and links around a
given web page and automatically assigns it to
suitable directories in a web catalog
Distiller: identifies the centrality of crawled pages
to determine visit priorities
7/17/2015
5
Distillation techniques
Google:
7/17/2015
Simulate a random wander on the Web
Ranked by pre-computed popularity and
visitation rate
fast
6
Distillation techniques
HITS (Hyperlink Induced Topic Search):
Depends on a search engine
Combine two scores:
Authorities: identify pages with useful information
about a topic
Hubs: identify pages that contain many links to
pages with useful information on the topic
7/17/2015
Query dependent and slow
May lead topic contamination or drift
7
Distillation techniques
ARC and CLEVER:
7/17/2015
ARC (Automatic Resource Complier): part of
CLEVER
Root set was expanded by 2 links instead of 1link
( Including all pages which are link-distance two or
less from at least one page in the root set )
Assign weights to the hyperlinks: base on the match
between the query and the text surrounding the
hyperlink in the source document
8
Distillation techniques
Outlier filtering:
7/17/2015
Computes relevance weights for pages using
Vector Space Model
All pages whose weights are below a
threshold are pruned
Effectively prune away outlier nodes in the
neighborhood, thus avoid contamination
9
Topic distillation vs. Resource discovery
Topic distillation:
Depend on large, comprehensive Web crawls
and indices (Post processing)
Can be used to generate a Web taxonomy?
Set a keyword query for each node in the
taxonomy
Run a distillation program
Simple but have some problems
7/17/2015
10
Topic distillation vs. Resource discovery
Problems:
Construction the query: involves trial, error and
complicated thought
Query: “North American telecommunication companies”
Query: +"power suppl*" ßwitch* mode" smps multiprocessor* üninterrupt* power suppl*" ups -parcel
The Yahoo! node /Business&Economy /Companies
/Electronics /PowerSupplies
To match the directory based browsing quality of :
7/17/2015
Yahoo!: 7.03 terms and 4.34 operators
Alta Vista: 2.35 terms and 0.41 operators
11
Topic distillation vs. Resource discovery
Problems:
Contamination
stop-sites: not automatic
terming weighting
edge weighing: no precise algorithm to set the
weight
Topic distillation by itself is not enough for
resource discovery
7/17/2015
12
Hypertext classification: learning
from example
Adding example pages and their distance-1
neighbors into the graph to be distilled will
improve the result
The contents of the given example and its
neighbors provide a way to compute the
decision boundary of classification
NN, Bayesian and support vector
classifiers
7/17/2015
13
Hypertext classification
Link-based features: important
Circular topic influence
Topic of one page influences its text and its
neighbor page’s topic
Knowledge of the linked vicinity’s topic provides
clues for the test document’s topic
Bibliometric, more general than the simple linear
endorsement model used in topic distillation
7/17/2015
14
Putting it together for resource
discovery
7/17/2015
15
Conclusion
Emphasized the importance of scalable
automatic resource discovery
Argued that common search engines are
not adequate to achieve the resource
discovery
Introduced the recently invented focused
crawling system
7/17/2015
16
Future Works
How to derive the training examples
automatically?
How to personalize the outcome of focused
crawler for users?
7/17/2015
17