Slide 1

Transcript Slide 1

Deep web
Jianguo Lu
7/20/20151
What is deep web
Deep web
• Also called hidden web, invisible
web
– In contrast to surface web
• Content is dynamically generated
• http://www.osti.gov/fedsearch
by a search interface. The search
interface can be
– HTML form
– Web service
• Content in general is stored in a
database
• Usually not indexed by a search
engine
– That is the reason that sometimes
people define surface web as the
web accessible by a search engine
2
Deep web vs. surface web
• Bergman, Michael K. (August 2001). "The Deep Web: Surfacing
Hidden Value". The Journal of Electronic Publishing 7 (1).
3
How large is deep web
Deep web
•
http://www.juanico.co.il/Main%20frame%20-%20English/Issues/Information%20systems.htm
4
Deep and surface web may overlap
Deep web
• Some content hidden behind an HTML
form or web service can also be available
in normal html pages
• Some search engines try to index part of
the deep web
– Google is also crawling deep web
– Madhavan, Jayant; David Ko, Łucja Kot,
Vignesh Ganapathy, Alex Rasmussen, Alon
Halevy (2008). Google’s Deep-Web Crawl.
VLDB
– Only a very small portion of deep web is
indexed
5
Why is there a deep web
Deep web
• Not everything can be in surface web, for many reasons…
• Some pages are generated on the fly
– There are pages that are generated by a specific request, e.g.,
–
–
–
–
books in a library,
historical weather data,
newspaper archives,
all the accounts/members in flickr/tweeter/facebook…web sites
– There would be too many items if they are represented as web pages
– It is easier to save them in a data base instead of providing it as static web pages
– Some pages are the result of integration from various databases
• Content is not restricted to text or html. Can be image, pdf, software, music,
books, etc. E.g.,
– all the paintings in a museum.
– Books in a library
• Maybe password protected
• But still, we wish the content is searchable…
6
Deep web crawling
Deep web
• Crawl and index the deep web so that hidden data
can be surfaced
• Unlike the surface web, there are no hyperlinks to
follow
• Two tasks
– Find deep web data sources, i.e., html forms, web services
– Accessing the deep web: A survey, B He, M Patel, Z Zhang, KCC
Chang - Communications of the ACM, 2007
– Given a data source, download the data from this data source
• We focus on the second task
7
Crawling a deep web data source
Deep web
• The only interface is an html form or a web service
– if the data is hidden by HTML form
– Fill the forms
– Select and send appropriate queries
– Alexandros, Ntoulas; Petros Zerfos, and Junghoo Cho (2005).
Downloading Hidden Web Content. UCLA Computer Science.
– Yan Wang, Jianguo Lu, Jessica Chen: Crawling Deep Web Using a New
Set Covering Algorithm. ADMA 2009: 326-337.
– Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu: An Approach
to Deep Web Crawling by Sampling. Web Intelligence 2008: 718-724
– Extract relevant data from return HTML page
– If the data is hidden by a web service
– Select and send appropriate queries
– Form filling and data extraction are exempted
• It also attracts public interests
– Wright, Alex (2009-02-22). "Exploring a 'Deep Web' That Google Can’t
Grasp". New York Times.
8
Deep web crawling is not a trivial task
Deep web
• it is not easy to obtain all the data
• Query quota
• Return limit
• More importantly, high overlapping
c:\research\dictionaries\newsgroup_dict_10000_random.txt
filetype:xsd n=212000
Query
matches total
distinct
new
maven
176
176
176
176
disarm
1
185
177
1
sudanese
34
219
211
34
profession
59
278
270
59
compete
2
280
272
2
…
windsor
101
56939
16199
bosch
8
56947
16203
4
cliff
32
56979
16203
0
pursuit
14
56993
16204
1
konstantin
3
56996
16204
0
The overlapping rate is
56996/16204=3.5
When percentage is
16204/212000=0.07
9
The problem
Deep web
• Minimize the cost while dislodging most of the data
– Some people try to minimize the number of queries while we
minimize the number of total documents
• Minimize the OR (Overlapping Rate) while reaching a high
Hit Rate (HR)
– S(qj , DB) : set of results of the query qj on database DB.
10
Random queries
Random queries
• What is the cost if random queries are
sent?
• The answer depends on the
assumptions of a data source
• In the diagram, model in the lower
layer is more difficult to crawl
Model All the matched
documents
returned?
Each document has
equal probability of
being matched?
M0
yes
Yes
Mr
No
Yes
Mh
Yes
No
Mrh
No
no
example
11
Notations
Random queries
• N: the actual number of documents in a data source;
• t: the number of queries that are sent to a data source;
• mj : the number of matched documents for query j. 1 <j <t.
– n =mj is the sample size, i.e., the total number of matched documents;
• uj : the number of new documents retrieved by query j. 1 <=j <=t.
• Mi = j<i uj is the total number of unique documents that are retrieved before query i.
– Note that M1 = 0, and M2 = m1. Let M = Mt+1 denote the total number of distinct documents that
are retrieved by all the queries in the estimation process;
• di: the number of duplicate documents retrieved by query i. di+ui = mi;
• k: the maximal number of returns from a ranked data source, even if there are mj > k
number of matches.
• OR = n/M: the Overlapping Rate up to the t-th query, i.e., the ratio between the sample
size and the distinct documents;
• P = M/N: the percentage of the documents that has been sampled, i.e., the ratio
between the distinct documents and the actual size.
12
Example of crawling process
0.12
0.1
query
mi
ri
ui
di
Mi
ni
OR
P
0.08
q1
3
3
3
0
3
1
0.005
P
Random queries
0.14
0.06
q2
26
26
25
1
3
29
1.036
0.047
0.04
q3
37
30
25
5
28
59
1.113
0.088
0.02
q4
23
23
16
7
53
82
1.188
0.115
0
69
1
1.05
1.1
1.15
1.2
OR
•Suppose N=600, limit=30
13
Model M0
• Assumptions
Random queries
– All the matched documents
are returned
– Each document has equal
probability of being matched
• Result
P  1  OR2.1
– Jianguo Lu, Dingding Li,
Estimating Deep Web Data
Source Size by CaptureRecapture Method,
Information Retrieval.
Springer.
14
Model M0
Random queries
• The more accurate formula for the
relationship between P and OR is
OR   ln(1  P) / P
OR   ln(1  P) / P
• Conclusion: In model M0, it is not
difficult to crawl a data source at all
• In most cases OR will be higher than
what is calculated by the above
formula
– Because M0 is the simplest
P
OR
0.1
1.053605
0.2
1.115718
0.3
1.188916
0.4
1.277064
0.5
1.386294
0.6
1.527151
0.7
1.719961
0.8
2.011797
0.9
2.558428
0.95
3.153402
0.99
4.651687
0.999
6.91467
15
Model M0 vs Mh
Random queries
• The blue line is drawn
using equation
P=1-OR^(-2.1)
• Several real data show
different trend
• Why?
16
Model Mh
Random queries
• Assumptions:
– Each document has unequal
probability of being matched by
a query
– All matched documents are
returned
• h means heterogeneity in catch
probability
– Originally developed in
ecology, to estimate the
population of wild animals
– Process: capture a group of
animals, mark and release
them; capture another
group of animals, mark and
release them again; … …
• Mh was first proposed in
capture-recapture method
Capture frequency of news groups documents by
queries
(A) is the scatter plot when documents are
selected by queries. In total 13,600 documents
are retrieved.
(B) is the first 100 captures in Figure (A).
(C) is the histogram of (A).
(D) is the log-log plot of (C).
17
Model Mh
Random queries
• The empirical result is
P  1  OR1.1
• Obtained by linear regression
P
OR
0
1
0.533484
2
0.701347
3
0.782362
4
0.829732
5
0.860674
6
0.882404
7
0.898468
8
0.910806
9
18
File size distributions
Random queries
19
Measuring heterogeneity
Sampling based approach
• Coefficient of Variation (CV)
• Assume that the documents in
the data source have different
but fixed probabilities of being
captured, i.e.,
p = {p1, p2, …,pn}, Pj=1.
Scatter plots for various CVs. 200 random
numbers within the range of 1 and 20,000
are generated in Pareto distribution.
20
Measuring heterogeneity
Random queries

P  1  OR
CV
α
P
Relationship between CV (γ) and α
21
Model M0r
Random queries
• Assumptions
– Only top k documents are returned
– Each document has equal probability being
matched
– Documents have static ranking
22
Model M0r
Random queries
• When k and m are fixed for every query
k
P  (1  OR2.1 )
m
• Not a practical assumption
23
Model Mhr
Random queries
• Assumptions
– Only top k documents are returned
– documents have unequal probability being matched
– Documents have static ranking
• When k and m are fixed, we have
k
P  (1  OR1.1 )
m
24
Evolution of the models
Random queries
• Comparison of models M0, Mh, M0r, and Mhr. 1000 documents are sorted according to their file size in
decreasing order. 600 documents are selected in the four models, including the duplicates. k = 10;m = 20.
• Subplot M0 shows that all the documents are retrieved uniformly.
• Subplot Mh shows that large documents are preferred, but most of the documents can be eventually sampled.
• Subplot M0r exhibits a clear cut around the 500th document. Beyond this line there are almost no documents
retrieved.
• Mhr is the compound of M0r and Mh.
25
Selecting queries
• We have learnt the cost when random queries are issued.
• Can we select the queries to reduce the cost?
• What models can we apply to?
– Mh or Mhr?
26
Select queries
Select queries
•
Incremental approach
–
Method
1. send a query to download matched documents;
2. While (not most of the documents downloaded)
Analyze the downloaded documents to select the next most appropriate
query;
send the query to download documents;
– Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho, Downloading Textual
Hidden Web Content through Keyword Queries. JCDL, 2005. 100-109.
–
Disadvantages
–
–
•
Need to download many (almost all) documents
Crawler may only need to know the URL, not the entire documents
Sampling based approach
–
Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu: An
Approach to Deep Web Crawling by Sampling. Web Intelligence 2008:
718-724
27
Sampling based approach
Sampling based approach
• The queries are selected from a
sample set of documents
• In contrast to incremental
approach
• Steps
– Send a few random queries to
TotalDB;
– Obtain the matched documents
and construct the SampleDB;
– Analyse all the documents in
SampleDB, construct
QueryPool;
– Use set covering algorithms to
select the Queries;
– Send Queries to TotalDB to
retrieve documents.
• Whether the queries can cover
most of the data source?
• Whether low OR in SampleDB
can be projected to TotalDB?
• Whether SampleDB need to be
very large?
28
Hypothesis 1: vocabulary learnt from sample can cover
most of the documents in TotalDB
Sampling based approach
• Impact of sample size on HR. The queries are selected from SampleDB and
cover above 99% of the documents in SampleDB. The HR in the plot is
obtained when those queries are sent to the TotalDB. relative query pool
size is 20.
29
Hypothesis 2: low OR in sampleDB can be
projected to TotalDB
Sampling based approach
• Comparison of our
method on the four
corpora with
queries selected
randomly from
sample.
• Sample size is
3000,
• relative query pool
size is 20.
• Our method
achieves a much
smaller OR when
HR is high.
30
Hypothesis 3: both the sample size and query pool size
do not need to be very large
Sampling based approach
31
Hypothesis 3 (continue)
Sampling approach
32
Set covering problem
Set covering
• given a universe U and a family of subsets S={S1, S2, …, Sn} of U.
• a cover is a subfamily of S whose union is U.
• Let J={1,2,…,n}. J* is a cover if
S
j
U
jJ *
• set covering decision problem: the input is a pair (S,U) and an
integer k; the question is whether there is a set covering of size k or
less.
• set covering optimization problem: the input is a pair (S,U), and the
task is to find a set covering which uses the fewest sets.
• The decision version of set covering is NP-complete, and the
optimization version of set cover is NP-hard.
33
Set covering example
Set covering
t2
t1
d2
d1
t3
d3
Suppose
each row represent a term,
each column represent a document.
If the cell (i,j) is 1,
term i can retrieve document j,
or term i covers document j.
34
Set covering algorithms
Set covering
• Optimal solution is hard to obtain, within polynomial time
• various approximation algorithms are developed
– Greedy
– A classical algorithm
– Weighted greedy
– Developed for our particular application
– Yan Wang, Jianguo Lu, Jessica Chen: Crawling Deep Web Using a
New Set Covering Algorithm. [PDF] ADMA 2009: 326-337.
– Genetic algorithm
– Clustering
–…
35
Greedy algorithm
Set covering
Number of new
elements
36
Greedy algorithm may not be able to find the optimal
solution
Set covering
• There can be two solutions
– If the first set selected is t1, then
the solution is
– {t1, t2}
– The cost 4
– If the first selection is t2, then the
solution is
– {t2, t3}
– the cost is 3.
37
Weighted greedy algorithm
q1
Set covering
5
3
4
q5
8
q2
6
7
9
q4
q1
q2
q3
q4
q5
d1 d2 d3 d4 d5 d6 d7 d8 d9
0
0
1
0
1
1
0
1
0
0
0
0
0
0
1
0
1
0
1
1
1
1
0
0
0
0
1
0
1
0
0
0
1
1
0
1
0
0
1
1
1
0
0
1
1
2
1
q3
38
One solution obtained by greedy algorithm
Set covering
5
3
4
q5
5
8
6
3
7
9
2
1
q1
q2
q3
q4
q5
d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/df
0
0
1
0
1
1
0
1
0
4
4
1
0
0
0
0
0
1
0
1
0
2
2
1
1
1
1
1
0
0
0
0
1
5
5
1
0
1
0
0
0
1
1
0
1
4
4
1
0
0
1
1
1
0
0
1
1
5
5
1
q1
q2
q3
q4
q5
d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/df
0
0
0
0
0
1
0
0
0
4
1
0.25
0
0
0
0
0
1
0
0
0
2
1
0.5
1
1
0
0
0
0
0
0
0
5
2
0.4
0
1
0
0
0
1
1
0
0
4
3
0.75
0
0
0
0
0
0
0
0
0
5
0
0
q1
q2
q3
q4
q5
d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/df
0
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
2
0
0
1
0
0
0
0
0
0
0
0
5
1
0.2
0
0
0
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
5
0
0
4
q5
8
6
7
9
q4
2
1
5
3
4
q5
8
6
7
9
q4
2
1
Total cost is
5+4+5=14
q3
39
Solution obtained by weighted greedy algorithm
q1
Set covering
5
3
4
q5
8
q2
6
7
9
2
3
6
q4
7
9
6
4
q4
7
9
q4
2
2
1
1
q3
q3
q1
q2
q3
q4
q5
d1 d2 d3 d4 d5
0
0 0.3
0 0.5
0
0
0
0
0
1 0.5 0.3 0.5
0
0 0.5
0
0
0
0
0 0.3 0.5 0.5
d6 d7 d8 d9 weight
w/df
0.3
0 0.3
0
1.5
0.375
0.3
0 0.3
0 0.667 0.3333
0
0
0 0.3 2.667 0.5333
0.3
1
0 0.3 2.167 0.5417
0
0 0.3 0.3
2
0.4
q1
q2
q3
q4
q5
d1 d2 d3 d4 d5 d6 d7 d8 d9 weight
w/df
0
0 0.3
0 0.5
0
0 0.3
0 1.167 0.2917
0
0
0
0
0
0
0 0.3
0 0.333 0.1667
1
0 0.3 0.5
0
0
0
0
0 1.833 0.3667
0
0
0
0
0
0
0
0
0
0
0
0
0 0.3 0.5 0.5
0
0 0.3
0 1.667 0.3333
q1
q2
q3
q4
q5
d1 d2 d3 d4 d5 d6 d7 d8 d9 weight
w/df
0
0
0
0 0.5
0
0 0.3
0 0.833 0.2083
0
0
0
0
0
0
0 0.3
0 0.333 0.1667
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 0.5
0
0 0.3
0 0.833 0.1667
q1
5
3
4
Total cost is
4+5+4=13
8
6
7
9
2
1
q3
40
Review
• Deep web crawling
• Random queries and models
• Sampling based crawling
• Set covering algorithms
– Greedy
– Weighted greedy
– Clustering
– Genetic
– Currently they are for model Mh
– What is the solution for Model Mhr?
• For Model Mhr, we need to predict the term frequencies of the terms in
TotalDB
41

Slide 1

Transcript Slide 1

Directory