Database Discovery Presentation

Transcript Database Discovery Presentation

Database Selection Using Actual Physical and
Acquired Logical Collection Resources in a Massive
Domain-specific Operational Environment
Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem Meziou
Research & Development
Thomson Legal & Regulatory – West Group
St. Paul, Minnesota 55123 USA
{Jack.Conrad,Peter.Jackson}@WestGroup.com
Growth of Online Databases
Westlaw and Westnews
Westnews
Westlaw
16000
12000
10000
8000
6000
4000
2000
0
19
76
19
78
19
80
19
82
19
84
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
No. of Databases
14000
Outline
Background to vocabulary,
incl. that used in title
 Terminology
Of our operational environment and
overall problem space
 Overview
 Research Contributions
Aspects of the problem that haven’t been
(Novelty of Investigation) explored before, esp. wrt scale & prod. sys.
We’ll look at the data sets used, namely,
 Corpora Statistics
those listed for the next item
 Experimental Set-up
 Phase 1: Actual Physical Resources
 Phase 2: Acquired Logical Resources
We’ll compare the effectiveness of each
 Performance Evaluation
approach on each data set
 Conclusions
I’ll share what conclusions we’re able to draw and
 Future Work
discuss new directions this work may be taking
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
3
Terminology
 Database Selection
• Given O(10K) DBs composed of textual documents, need to
effectively & efficiently aid users to narrow info search
And hone in on the most relevant materials available in the system
 Actual Physical Resources
• Exist O(1K) underlying physical DBs that can be leveraged to
reduce the dimensionality of problem
• Have access to complete term distributions asso. w/ these DBs
Organized around internal criteria such as pub. year, h/w system, etc.
 Acquired Logical Resources
Wanted to convince ourselves that we could
first get reasonable results at this level
• Can re-architect underlying DBs along domain- and user-centric
content-types (e.g., Region, Topic, Doc-type, etc.)
• Then profile those DBs using random or query-based sampling
Can characterize “logical” DBs using diff. sampling techniques
4
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
Overview
Overview of Westlaw’s operational environment
Operational Environment




Over 15,000 databases consisting of 1,000s of docs
Over one million U.S. attorneys
Thousands of others in the UK, Canada, Australia, …
O(100K) qrys submitted to Westlaw system each day
Several 100,000 qrys are submitted …
Motivations for Re-architecting System
We require our users to submit a DB ID
 Showcasing 1000s of DBs typically a competitive advantage
 Segment of today’s users prefer global search environments
 Simplified activity of narrowing scope for online research
• User & domain-centric rather than hardware or maint.-centric
• Primarily concentrating on areas of law and business
 Toolkit approach to DBs and DB Selection tools
• Diverse mechanisms for focusing on relevant information
20-23 Aug. 2002
Each mech. optimized on a particular level of granularity
5
28th International VLDB '02 — J. Conrad
Contributions of Research







Represent O(10,000) DBs
DBs can contain O(100,000) documents
Collection sizes vary by several magnitudes
Documents can appear in > 1 DB
DBs cumulatively in TB, not GB range
Work reported here involves between 2 and 3 TB
Docs represent real, not simulated domain
Implemented in actual prod. environment
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
6
Westlaw Architectural Issues
Physical vs. “Logical” Databases
O (1000)
Order Mag. Diff.
Traditionally, data for the
Westlaw System were
phys. stored in silos that
were dictated by internal
considerations, that is,
those that facilitated
storage and maintenance
(publ. year, aggregate
content type, or source)
Fed.
Case_Law
(2002/03)
State
Statutes
(2002/04)
O (100)
Rather than categories of data
that made sense to system users
(in the legal domain), categories
such as legal jurisdiction
(region), legal practice area, or
document-type (e.g.,
congressional leg., treatises,
jury verdicts, etc)
...
Doc-Type
Local
WestNews
(2002/05)
Int’l
Analytical
This is what our primary
(2002/06)
objective was in re-arch the WL
Regulatory
repository to achieve our logical Jurisdiction
(2002/07)
data sets
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
The 3 cols labeled
red rep. the 3 prim.
bases for segment
Legal
Practice
Area The rows labeled in
blue are … residual
sub-groupings
resulting from this
7
strategy
Corpora Statistics
Physical
Databases
Logical
Databases
(Phase 1)
(Phase 2)
Number of
Collections
1000
128
Collections
Profiled
100
128
( 40% of WL)
(  90% of WL)
Collection
Information
Standard
Docs / Profile
Each doc partic.
in profile
Average
Docs / Collection
Average
Is basically
Tokens / Profileentire dict.
20-23 Aug. 2002
500 / 1000 Callan found 300
All
Via sampling
docs sufficed
298,935
378,468
97,299
22,296 / 47,450
Roughly 25% & 50% of the complete dict.
28th International VLDB '02 — J. Conrad
8
Alternative Scoring Models
 Scoring: CORI 1-2-3
 tf-idf based
representing df-icf
 absent terms given
default belief prob.
 Engine: WIN
 Bayesian Inference
Network
 Data: Collection Profiles
 Scoring: Language Model
 occurrence based
via df + cf
 smoothing techniques
used on absent terms
 Engine: Statistical
 Term / Concept
Probabilities
 Data: Collection Profiles
 Complete Term Distr. (Phase 1)
 Random & Query-based sample
Term Distr. (Phase 2)
20-23 Aug. 2002
 Complete Term Distr. (Phase 1)
 Random & Query-based sample
Term Distr. (Phase 2)
28th International VLDB '02 — J. Conrad
9
tf * idf Scoring — Cori_Net3
The belief p(wi|cj) in collection cj
due to observing term wi is
determined by db + (1 – db) * T * I
Where db is the minimum belief
component when term wi occurs
in collection cj
p ( wi | cj )  db  (1  db )  T  I
• Similar to Cori_Net2 but normalized w/o layered variables
df
T  dt  (1  dt ) 
df  K
Typically this tf type expr is normalized by df_max, but
cw
here we introduce K which has been inspired by exps in
K  

doc retrieval # our K is different than anything Callan or
cw
|C |0.5
)
cf
I 
log(|C |1.0 )
log(
20-23 Aug. 2002
others have used # they have a set of parameters that
are successively wrapped around each other ( ) …
This is the collection retrieval equivalent to
normalized inverse doc freq (or idf)
28th International VLDB '02 — J. Conrad
10
 Is of course between 0 and 1
Language Modeling
LM based only on a profile doc may face sparse data problems when the prob.
of a word, w, given a profile ‘doc’ is 0 (unobserved event)
• Weighted Sum Approach (Additive Model)
So it may be useful to extend the original
document model with a db model
Psum (w | d )    Pdoc (w | d )  (1  )  Pdb (w)
An additive model can help by leveraging extra
evidence from the complete collection of profiles
• Query Treated as a Sequence of Terms
(Independent
Events)
By summing in the contribution of a word at the db level, can
i 1
Psequence (Q | d )   P(wi | d )
mitigate uncertainty asso. w/ sparse data in non-add. model
m
By treating qry as sequence of terms, w/ each term viewed as a separate event,
and the qry rep. the joined event (permits dup. terms and phrasal expr.)
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
11
Test Queries and
Relevance Judgments
 Actual user submissions to DBS application
 Phase 1 (Physical Collections): 250 queries
 Mean Length: 8.0 terms
 Phase 2 (Logical Collections): 100 queries
 Mean Length: 8.8 terms
Why did we use a
diff. qry set for
Phase 2?
Wanted qrys that were less general, more specific,
with fewer positive rel. jdgmts per qry
 Complete Relevance Judgments
 Provided by domain experts before experiments run
 Followed training exercises to establish consistency
 Mean Positive Relevance Judgments per Query
 Phase 1 (Physical Collections): 17.0
 Phase 2 (Logical Collections): 9.1
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
12
Retrieval Experiments
 Database-level
 Test Parameters:
It’s important to point out that
our initial exps were at the …
Some of the variables we
examined are indicated here
 100 physical DBs vs. 128 logical DBs
 For logical DB profiles: Query-based vs. Random sampling




Qrys with … versus …
phrasal concepts vs. terms only
Stemmed terms vs.
stemming vs. no stemming
unstemmed terms
scaling vs. none (i.e., global freq reduction)
minimum term frequency thresholds
 Performance Metrics:
Inspired by speech recogn experiments – noise
 Standard Precision at 11-point Recall
 Precision at N-database cut-offs
We’ll see some
examples of these next
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
13
DBS -- Phase 1: 100 Physical Collections
Cori_Net2 vs. LM (250 Queries)
Essentially represents the best
from both methods for this Phase
100
And we see LM clearly outperforms
CORI by > 10% at the first recall points
Performance avg-ed over 250 qrys
Precision (Percent)
90
Cori2_all
Result consistent with recent results
in the doc. retrieval domain
80
LM_all
70
60
50
40
30
20
10
0
0
10
20
30
40
50
60
Recall (Percent) [11-point]
70
80
90
100
DBS -- Phase 2: 128 Logical Collections
Cori_Net2 vs. LM (100 queries)
100
When we move to the logical collections, we
see a reversal in this relative performance
90
Precision (Percent)
80
Baseline_11pt
Cori2_0.6_300_stem
Incl. the baseline in this case because it’s
rel. closer to that of the two techs.
70
LM_Rand_500_1
60
Avg. prec. of the two may be sim., but CORI
sign. better than other LM results here
(Rand_1000 and QBS 500+1000)
50
40
30
20
10
0
0
10
20
30
40
50
60
Recall (Percent) [11-point]
70
80
90
100
DBS -- Phase 2: 128 Logical Collections
Enhanced Cori_Net3 (100 Queries)
100
The final plot to be exhibited
90
80
Precision (Percent)
baseline_11pt
Here we explore a special post-process
lexical analysis of queries for
jurisdictionally relevant content
Cori3_400_1.0
Cori3_400_1.0_Lex
70
Cori3_400_1.0_Lex+
60
I.e., when no such context is found, jurisdictionally biased collections are down weighted
50
40
For results marked Lex, process applied
only to qrys w/ no juris. clues
30
20
For results marked Lex+, apply the reranking to all qrys, but
10 leave the dbs that match the lexical clues in their orig. ranks
0
0
10
20
30
40
50
60
Recall (Percent) [11-point]
70
80
90
100
Performance Evaluation
 WIN using CORI scoring
• Works better for Logical collections than Physical collections
• Best results from random sampled DBs
 Language Modeling with basic smoothing
• Performs best for Physical collections; less well for Logical
• Top results from random sampled DBs
Precision at
CORI
LM
init Recall
point
Results don’t agree w/ Callan’s, but he was
operating in a non-cooperating env.
Physical DBs
70%
80%
Logical DBs
85+%
70%
 Jurisdictional Lexical Analysis contributes > 10% to
average precision
And as we saw, adding our post-process lexical analysis,
precision increased by over 10% at the top recall points
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
17
Took 25% of our Phase 2 qrys and ran them against the top 5 CORI-ranked DBs,
then evaluated the top 20 documents (2,500 docs total) – this is what resulted
Document-level Relevance
“On Point” cat. surpasses the next three cats combined
Relevance
Category
Quantity
Percentage
% Relevant
(Cumulative)
On Point
1415
56.60%
56.60%
Relevant
439
17.56%
74.16%
Marginally
Relevant
199
7.96%
82.12%
Not
Relevant
447
17.88%
———
Combined
2500
100.00%
82.12%
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
18
Conclusions
 WIN using CORI scoring more effective than
current LM for environments that harness
database profiling via sampling
 Language Modeling more sensitive to sparse data
issues
 Post-process Lexical Analysis contributes
significantly to performance
 Random-sampling Profile Creation outperforms
Query-based sampling in the WL environment
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
19
Future Work
 Document Clustering
May show promise for domains in which
we know much less about the preexisting doc structure
 Basis for new categories of databases
 Language Modeling
Competing w/ high perf
thanks to CORI
 Harness robust smoothing techniques
 Measure contribution to logical DB performance
 Actual document-level relevance
Smoothing: Simple, linear, smallest
 Expand set of relevance judgments binomial, finite element, b-spline
 Assess doc scores based on both DB + doc beliefs
 Bi-modal User Analysis
 Complete automation vs. User interaction in DBS
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
20
Database Selection Using Actual Physical and
Acquired Logical Collection Resources in a Massive
Domain-specific Operational Environment
Jack G. Conrad, Xi S. Guo, Peter Jackson, Monem Meziou
Research & Development
Thomson Legal & Regulatory – West Group
St. Paul, Minnesota 55123 USA
{Jack.Conrad,Peter.Jackson}@WestGroup.com
Related Work
 L. Gravano, et al., Stanford (VLDB 1995)
 Presented GlOSS system to assist in DB selection task
 Used ‘Goodness’ as measure of effectiveness
 J. French, et al., U. Virginia (SIGIR 1998)
 Came up with metrics to evaluate DB selection systems
 Began to compare effectiveness of different methods
 J. Callan, et al., UMass. (SIGIR 95+99, CIKM 2000)
 Developed Collection Retrieval Inference Net (CORI)
 Showed CORI was more effective than GlOSS, CVV, others
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
22
DBS -- Phase 2: Enhanced Cori_Net3
128 Collections (100 queries)
Precision at N DBs
100
90
Baseline_P_Ndocs
Precision (Percent)
80
Cori2_0.6_300_P_Ndocs
Cori3_400_1.0_P_Ndocs
70
Cori3_400_1.0_Lex_P_Ndocs
60
Cori3_400_1.0_Lex+_P_Ndocs
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
At N Databases Retrieved
14
15
16
17
18
19
Background
 Exponential growth of data sets on Web
and in commercial enterprises
 Limited means of narrowing scope of
searches to relevant databases
 Application challenges in large domainspecific operational environments
 Need effective approaches that scale and
deliver in focused production systems
20-23 Aug. 2002
28th International VLDB '02 — J. Conrad
24
Sample Results
Query Number : 2
SOUTH CAROLINA APPELATE COURT RULES EXEMPTIONS
1
2
3
4
5
6
7
8
9
10
11
12
WUSTATES
STSOEAST
WSE
AGADM2
WZPUB
ALR5
STATENET-TRAC
CBCTEXT
WLD3
STULA
WSCT
STRULES
Relevant
Relevant
Relevant
Not Relevant
Relevant
Relevant
Relevant
Not Relevant
Not Relevant
Not Relevant
Not Relevant
Relevant
20-23 Aug. 2002
State Cases (Unreported)
Statutes -- South East
Southeast Reporter (GA, NC, SC, VA, WV)
Attorney General Opinions
Source & Publication Authority Database
American Law Reports
State Bills Tracking
Texts - Text & Treatises I (Clark Boardman Callahan)
West Legal Directory
Uniform Laws Annotated
Supreme Court Cases
Court State Rules
28th International VLDB '02 — J. Conrad
26