CRAWLING THE HIDDEN WEB Authors

Transcript CRAWLING THE HIDDEN WEB Authors

CRAWLING THE HIDDEN
WEB
Authors: S. Raghavan & H. Garcia-Molina
Presenter: Nga Chung
OUTLINE
Introduction
 Challenges
 Approach
 Experimental Results
 Contributions
 Pros and Cons
 Related Work

INTRODUCTION

Hidden Web
Content stored in databases that can only be
retrieved through user query, such as, medical
research databases, flight schedules, product listings,
news archives
 Social media blog posts, comments


So why should we care?
Scale of the web (55 ~ 60 billions of pages) does not
include the deep web or pages behind security walls
[2]
 Estimate in 2001, Hidden Web is 500 times the
publicly indexed web
 Mike Bergman, “The long-term impact of Deep Web
search had more to do with transforming business
than with satisfying the whims of Web surfers.” [5]

CHALLENGES

From a search engine perspective
Locate the hidden databases
 Identify which databases to search for a given user
query


From a crawler’s perspective

Interact with a search form

Search can be form-based, facet/guided navigation, freetext, which are intended for users [3]
Know what keywords to put into the form fields
 Filter search results returned from search queries
 Define metrics to measure crawler’s performance

HIDDEN WEB EXPOSER ARCHITECTURE
(HIWE)
URL List
Task
Specific
Database
Parser
Crawl Manager
Label Value Set
(LVS)
Form Analyzer
LVS Manager
…
Data Sources
Form Processor
Form Submission
Feedback
Response
Analyzer
Response
WWW
FORM ANALYSIS

How does a crawler interact with a search form?

Crawler builds an “Internal Form Representation”

F = ({E1, E2, …, En}, S, M)
set of n form
elements
submission
information e.g.
submission URL
meta-information
e.g. URL of form
page, web site
hosting form, links
to form
Label(E1) is descriptive text describing the field e.g. Date
 Domain(E1) is set of possible values for the field which can
be finite (select box) of infinite (text box)

FORM ANALYSIS
Label(E1) = Make
Domain(E1) = {Acura, Lexus…}
Label(E5) = Your ZIP
Domain(E5) = {s | s is a text string}
TASK SPECIFIC DATABASE

How does a crawler know what keywords to put
into the form fields?

Crawler has a “task-specific database”


For instance, if the task is to search archives pertaining to
the automobile industry, the database will contain lists of
all car makes and models.
Database has a Label Value Set (LVS) table
Each row contains
 L – a label e.g. “Car Make”
 V = {v1, …, vn} – a graded set of values e,g, {‘Toyota’,
‘Honda’, ‘Mercedes-Benz’, …}
 Membership function Mv assigns weight to each member of
the set V

TASK SPECIFIC DATABASE

LVS table can be populated through



Explicit initialization by human intervention
Built-in entries for commonly used categories e.g. dates
Querying external data sources e.g. Open Directory Project
Categories
Regional:
North America:
United States

Crawler’s encounter with forms that have finite domain fields
TASK SPECIFIC DATABASE

Computing weights M(v1)
Case 1: Precomputed
 Case 2: Computed by respective data source wrapper
 Case 3: Computed by crawling experience shown below

Extract
Label
Extracted?
yes
no
Find entry that close
resembles Domain(E) and
add Domain(E) to set
Find
Label in
LVS Table
no
Found?
yes
Replace (L, V) with
(L, V U Domain(E)
Add new
entry to LVS
MATCHING FUNCTION

“Matching function” maps values from database to
form field
E1 = Car Make
Match
E2 = Car Model

E1 = Car Make
v1 = Toyota
E2 = Car Model
v2 = Prius
Step 1: Label matching

Normalize form label and use string matching algorithm to
compute minimum edit distance between form label and all
LVS labels
MATCHING FUNCTION

Step 2: Value assignment


Take all possible combinations of value assignments, rank
them, and choose the best set to use for form submission
There are three ranking functions
 Fuzzy conjunction

Average
 fuz ([E1  v i,...,E n  v n ])  minMv (v i )
i
avg ([E1  v i ,...,E n  v n ]) 
Probabilistic


1
M v i (v i )

n i1...n
 prob ([E1  v i ,...,E n  v n ]) 1  (1  Mv (v i ))
i

Example: form with 2 fields car make and year
 Jaguar, 2009 where Mv1(Jaguar) = 0.5 and Mv2(2009) = 1





ρfuz = 0.5
ρavg = ½ (0.5 + 1) = 0.75
ρprob = 1 – [(1 – 0.5) * (1 – 1)] = 1
Toyota, 2010, where Mv1(Toyota) = 1 and Mv2(2010) = 1



ρfuz = 1
ρavg = ½ (1 + 1) = 1
ρprob = 1 – [(1 – 1) * (1 – 1)] = 1
LAYOUT-BASED INFORMATION
EXTRACTION
(LITE)
Label Extraction Method
Prune form page
Identify pieces of
text (candidates)
physically closest
to form element
Choose highest
ranked candidate
as label
Layout prune
page using
custom layout
engine
Rank candidates
based on
position, font
size, etc.
Results
Method
Accuracy
LITE
93%
Textual
Analysis
72%
Common Form
Layout
83%
RESPONSE ANALYSIS

How does crawler determine whether response page contains
results or error message?
Identify significant portion of the response page by removing header,
footer, etc. and find content in middle of the page
 See if content matches predefined error messages e.g. “No results,”
“No matches”
 Store hash of significant portion and assume that if hash occurs very
often, then hash is that of an error page

METRICS

How to measure the efficiency of the hidden web
crawler?

Define submission efficiency SE
Ntotal = total number of forms submitted
 Nsuccess = total number of submissions that resulted in
response page containing search results
 Nvalid = number of semantically correct submissions (e.g.
inputting “Orange” for form element labeled “Vegetable” is
semantically incorrect)

SE strict
N success

N total
N valid
SE lenient 
N total
EXPERIMENT


Task: Market analyst interested in building an
archive of information about the semiconductor
industry in the past10 years
LVS table populated from online sources such as
Semiconductor Research Corporation, Lycos
Companies Online
Parameter
Valu
e
Number of sites visited
50
Number of forms encountered
218
Number of forms chosen for submission
94
Label matching threshold
0.75
Minimum form size
3
Value assignment ranking function
ρfuz
Minimum acceptable value assignment rank 0.6
EXPERIMENTAL RESULTS –
RANKING FUNCTION

Crawler executed 3 times with different ranking
function
Number of Form Submission
Task 1 Performance with Different
Ranking Functions
5000
4500
88.8%
4000
3500
3000
3214
2853
83.1%
4316
65.1%
3760
3126
2810
2500
Total
2000
Successes
1500
1000
500
0
fuz
avg
prob
Ranking Function


ρfuz and ρavg submission efficiency above 80%
ρfuz does better but less forms are submitted as
compared to ρavg
EXPERIMENTAL RESULTS –
MINIMUM FORM SIZE
Effect of minimum form size – crawler performs
better on larger forms
Task 1 Performance with Different Minimum
Form Size
4000
3735
78.9%
3500
Number of Form Submission

88.77%
3214
2950
3000
2853
88.96%
2800
2491
2500
90%
2000
Ntotal
1560
1404
1500
1000
500
0
2
3
4
Minimum Form Size
5
Successes
CONTRIBUTIONS


Introduces HiWE, one of the first publicly available
techniques for crawling the hidden web
Introduces LITE, a technique to extract form data, by
incorporating the physical layout of the HTML page
 Techniques
prior to this were based on pattern recognition of
the underlying HTML
PROS
Defines clear performance metric from which to
analyze the crawler’s efficiency
 Points out known limitations of technique from
which future work can be done
 Directs readers to technical report which
provides more detailed explanation of HiWE
implementation

CONS
Not an automatic approach, requires human
intervention
 Task-specific



Requires creation of LVS table per task
Technique has lots of limitations
Can only retrieve search results from HTML based
forms
 Cannot support forms that is driven by Javascript
events e.g. onclick, onselect


No mention of whether forms submitted through
HTTP post were stored/indexed
RELATED WORK

USC ISI Extract Data from Web (1999 - 2001) [7, 8]


Research at UCLA (2005) [4]


Adaptive approach – automatically generate queries by
examining results from previous queries
Google’s Deep-Web Crawler (2008) [1]


Describe relevant information on web page with a formal
grammar and automatically adapt to web page changes
Select only a small number of input combinations that
provides good coverage of content in underlying database
and adds the resulting HTML pages into a search engine
index
DeepPeep [6]

Tracks 45,000 forms across 7 domains and allows users to
search for these forms
Q&A
REFERENCES
[1] J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, & A. Halev, “Google’s
Deep-Web Crawl,” Proceedings of the VLDB Endowment, 2008. Available:
http://www.cs.cornell.edu/~lucja/Publications/I03.pdf. [Accessed June 13, 2010]
[2] C. Mattmann, “Characterizing the Web,” Available:
http://sunset.usc.edu/classes/cs572_2010/Characterizing_the_Web.ppt. [Accessed
May 19, 2010]
[3] C. Mattmann, “Query Models,” Available:
http://sunset.usc.edu/classes/cs572_2010/Query_Models.ppt. [Accessed June 10,
2010]
[4] A. Ntoulas, P. Zerfos, & J. Cho, “Downloading Textual Hidden Web Content by
Keyword Queries,” Proceedings of the Joint Conference on Digital Libraries, June
2005. Available: http://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdf.
[Accessed June 13, 2010]
[5] A. Wright, “Exploring a ‘Deep Web’ That Google Can’t Grasp,” The New York
Times, February 22, 2009. Available:
http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&th
&emc=th. [Accessed June 1, 2010]
[6] DeepPeep beta, Available: http://www.deeppeep.org/index.jsp
[7] C. A. Knoblock, K. Lerman, S. Minton, & I. Muslea, “Accurately and Reliably
Extracting Data from the Web: A Machine Learning Approach,” IEEE Data
Engineering Bulletin, 1999. Available: http://www.isi.edu/~muslea/PS/deb-2k.pdf.
[Accessed June 28, 2010]
[8] C. A. Knoblock, S. Minton, & I. Muslea,” Hierarchical Wrapper Induction for
Semistructured Information Sources,” Journal of Autonomous Agents and MultiAgent Systems, 2001. Available: http://www.isi.edu/~muslea/PS/jaamas-2k.pdf.
[Accessed June 28, 2010]