Compressed Data Structures for Annotated Web Search Soumen Chakrabarti Sasidhar Kasturi Bharath Balakrishnan Ganesh Ramakrishnan Rohit Saraf http://soumen.in/doc/CSAW/

Download Report

Transcript Compressed Data Structures for Annotated Web Search Soumen Chakrabarti Sasidhar Kasturi Bharath Balakrishnan Ganesh Ramakrishnan Rohit Saraf http://soumen.in/doc/CSAW/

Compressed Data Structures for
Annotated Web Search
Soumen Chakrabarti
Sasidhar Kasturi
Bharath Balakrishnan
Ganesh Ramakrishnan
Rohit Saraf
http://soumen.in/doc/CSAW/
Searching the annotated Web
 Search engines increasingly supplement “ten
blue links” using Web of objects
 From object catalogs like
•
•
•
•
WordNet: basic types and common entities
Wikipedia: millions of entities
Freebase: tens of millions of entities
Product catalogs, LinkedIn, IMDB, Zagat …
 Several new capabilities required
• Recognizing and disambiguating entity mentions
• Indexing these mentions along with text
• Query execution and entity ranking
2
Lemmas and entities
 In (Web) text, noisy and
ambiguous lemmas are
used to mention entities
 Lemma = word or phrase
 Lemma-to-entity relation
is many-to-many
 Goal: given mention in
context, find correct entity
in catalog, if any
 Lemma also called “leaf”
because we use a trie to
detect mention phrases
Michael
Michael
Jordan
Jordan
Basketball
player
Berkeley
professor
Country
River
Big Apple
New York
City
New York
Lemmas
City that
never
sleeps
A state in
USA
Entities
3
after
Bayesian
After the UNC
workshop, Jordan
gave a tutorial on
nonparametric
Bayesian methods.
Feature
vectors x
slam
leap
league
After a three-season
career at UNC, Jordan
emerged as a league
star with his leaping
ability and slam dunks.
season
Millions of features
dunk
UNC
nonparametric
tutorial
workshop
Features for disambiguation
4
Inferring the correct entity
 Each lemma is associated with a set of
candidate entities
 For each lemma ℓ and each candidate entity
e, learn a weight vector w(ℓ,e) in the same
space as feature vectors
 When deployed to resolve an ambiguity
about lemma ℓ, choose
Linear model;
dot product
5
The ℓ, f, e  w map
 Uncompressed key, value takes 12+4 bytes
= 128 bits per entry
 ~500M entries  8GB just for map
 No primitive type to hold keys
 With Java overheads, easily 20GB RAM
• From ~2M to ~100M entities?




Total marginal entropy: 33.6 bits per entry
From 128 down to 33.6 and beyond?
Must compress keys and values
And exploit correlations between them
6
Lossy encoding: signed hash
To insert:
Hash function #1
ℓ
f e
w
Hash function #2
±1
*
Accumulate ±w into bucket
Hash
buckets
 No need to remember ℓ, f, e
 w cannot be easily compressed (all buckets same
size for easy hash index)
 Sign hash ensures expected values preserved
 Value distortion and disambiguation accuracy
7
“Training through the collisions”
 Linear multiclass SVM
• Each class e has model vector we
• From spot generate feature vector x
• Predicted class (entity) is
Sign hash space with B buckets
Map
Predicted class is
 loses information, SVM training
compensates for it
 Essential




Lossless (ℓ, f )  {e w} organization
 When scanning documents for
disambiguation, we first encounter lemma ℓ
and then features f from context around it
 Initialize score accumulator for each
candidate entity e
 For each feature f in context
f1 {e w}
• Probe data structure with (ℓ, f )
• Retrieve sparse map {e w}
• For each entry in map
• Update entity scores
 Choose top candidate entity
“LFE map” or LFEM
ℓ1
f2 {e w}
f3 {e w}
f4 {e w}
9
Short entity IDs
Michael
Jordan
Lemma
Short entity IDs
wrt lemma
0
Basketball player
1
CBS, PepsiCo, Westinghouse exec
2
Machine learning researcher
3
Mycologist
4
Racing driver
5
Goalkeeper
Candidate entities
sorted by decreasing
occurrence frequency in
reference corpus
 Millions of entities globally but few for a
given lemma
 Use variable length integer codes
 Frequent short ID has shortest code
10
Encoding of (ℓ, f ){e w}
Index into start of segment
for each lemma ID
e = short ID
f1
ℓ1
f2
ℓ2
• We used code,
others may be better
• For adjacent short IDs,
we spend only one bit
• Irregular sizes record
• Must read from
beginning to
decompress
11
Random access on (ℓ, f )
 Already support random access on ℓ
 Number of distinct ℓ in O(10 million)
 Cannot afford time to decompress from the
beginning of ℓ block
 Cannot afford (full) index array for (ℓ, f )
 Within each ℓ block, allocate sync points
 Old technique in IR indexing
 New issues:
• Outer allocation of total sync among ℓ blocks
• Tuning syncs to measured (ℓ, f ) probe
distribution — inner allocation
12
Inner sync point allocation policies




Say Kℓ sync points budgeted to lemma ℓ
To which features can we seek?
For others, sequential decode
f1
DynProg: optimal expected probe
time with dynamic program
f2
 Freq: allocate syncs at f with f1
f3
largest probe prob. p(f |ℓ)
f3
f5
 Equi: measure off segments
with about equal number of bits
f4
 EquiAndFreq: split budget
13
Outer allocation policies
 Given overall budget K, how many syncs Kℓ
does leaf get?
• Hit prob pℓ, bits in leaf segment bℓ
 Analytical expression for effect of inner
allocation can be intractable
 Hit: Kℓ  pℓ
 HitBit: Kℓ  pℓ bℓ
 SqrtHitBit: Assume equispaced inner
allocation
14
Experiments
Spotter
Test
contexts
ℓ,f
workload
Train
Test
fold
Spotter
Train
contexts
ℓ,f
workload
Corpus
Compressor
L-F-E map
Annotator
Smoother
Train
Test
fold
ℓ,f(e,w)
model
“Payload”
Our best policy
compresses LFEM
down to only 18
bits/entry
compared to 33.6
bits/entry marginal
entropy, and 128
bits/entry raw data
Sampler
“Reference”
 500 million pages, mostly English, spam-free
 Catalog has about two million lemmas and entities
Disambiguation trainer
and cross-validator
Smoothed ℓ,f
distribution
Entity and
type
Indexer
Annotation
index
15
Inner policies compared
Cost (us)
Cost (us)
1.2E11
DynProg
Equi
 Equi close to
EquiAndFreq
Freq
optimal DynProg 8.E10
but fast to compute
4.E10
 Freq surprisingly
bad: long tail
0.E0
5.E5
1.E6
1.5E6
2.E6
2.5E6 Budget
3.E6
 Blending Equi and
Lookup cost: lower is better
Freq worse than
DynProg
6E10
Equi alone
Equi
EquiAndFreq
 Relative order
4E10
Freq
stable as sample
2E10
size increased:
0E0
long tail again
2
4
6
Sample
docs (M)
8
10
16
Diagnosis: Freq vs. Equi
20000
Note scales
400
10000
200
5000
100
0
0E+0
2E+6
4E+6
Features
6E+6
8E+6
Equi Cost
300
Freq Cost
15000
0
0E+0
2E+6
4E+6
Features
6E+6
8E+6
 Plots show cumulative seek cost starting at sync
• Collapse back to zero at next sync




Features with largest frequency not evenly placed
Tail features in between lead to steep seek costs
Equi never lets seek cost get out of hand
(How about permuting features? See paper)
17
Outer policies compared
Bit
HitBit
1.5E11
Cost(us)
Probe cost
2.E11
SqrtHitBit
1.E11
5.E10
0.E0
Budget->5.E5
1.E6
1.5E6
2.E6
2.5E6
3.E6
Sync budget
 Inner policy set to best (DynProg)
 SqrtHitBit better than Bit better than HitBit
 Not surprising, given DynProg behaves closer to
Equi than Freq
18
SignHash, no training through collisions
 Build w from separate
lossless training
 Distorted from SignHash
 Most model values
severely distorted
 Give lossless and
SignHash same RAM
 Most keys collide
 Completely unacceptable
accuracy (random
guessing is far better)
200
150
100
50
400000000
528679857
700000000
1000000000
1400000000
1800000000
2100000000
0
-50
-100
-150
Relative rank-->
-200
0.00
0.20
0.40
0.60
0.80
0.9
0.7
Accuracy
Collision
0.5
0.3
400
900
1400
Hash buckets (millions)
1900
SignHash, training through collisions
 Used PEGASOS stochastic gradient descent
for training
 77% of spots have label “NA” (no annotation)
 23% error by choosing NA for all spots
 11% error via lossless LFEM
 SignHash given same RAM as LFEM
 18% error via SignHash
 Much better than no training
 But a lot worse than lossless LFEM
 Surprising, given LFEM currently uses plain
old naïve Bayes
Comparison with other systems
1E7
Page time (ms)
1E6
Spotlight
WMiner
Zemanta
Lfem
System
ms/spot
Spotlight
158
1E4
WMiner
21
1E3
Zemanta
9.5
LFEM
0.6
LFEM-sync
42
1E5
1E2
1E1
1E0
1E0




1E1
1E2
1E3 #spots 1E4
Downloaded software or network services
Regression removes per-page, per-token overhead
LFEM wins, largely because of syncs
LFEM RAM << downloaded software
21
Conclusion
 Compressed in-memory multilevel maps for
disambiguation
 Random access via tuned sync allocation
 >20 GB down to 1.15 GB
 Faster than public disambiguation systems
 Annotate 500M pages with 2M Wikipedia
entities + index on 408 cores in ~18 hours
 Sparse models for better storage?
 Also in the paper: design of compressed
annotation index posting list
22