Transcript mg2009 7776

Overcoming the L1 NonEmbeddability Barrier
Robert Krauthgamer (Weizmann Institute)
Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Algorithms on Metric Spaces


Hamming distance
Fix a metric M
Fix a computational problem
Ulam metric
Compute distance between x,y

Earthmover distance
Solve problem under M
ED(x,y) = minimum number of edit
operations that transform x into y.
edit operation = insert/delete/
substitute a character
Nearest Neighbor Search:
Preprocess n strings, so that
given a query string, can find
the closest string to it.
ED(0101010,
1010101) = 2
…
…
Overcoming the L_1 non-embeddability barrier
2
Motivation for Nearest Neighbor

Many applications:



Image search (Euclidean dist, Earth-mover dist)
Processing of genetic information, text processing (edit dist.)
many others…
Generic
Search
Engine
Overcoming the L_1 non-embeddability barrier
3
A General Tool: Embeddings

An embedding of M into a host metric
(H,dH) is a map f : M→H
 preserves distances approximately


has distortion A ≥1 if for all x,yM,
dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y)
Why?
 If H is “easy” (= can solve efficiently
computational problems like NNS)

Then get good algorithms for the original
space M!
Overcoming the L_1 non-embeddability barrier
f
4
Host space?
ℓ1=real space with
d1(x,y) =∑i |xi-yi|
Popular target metric: ℓ1

Have efficient algorithms:



Distance estimation: O(d) for d-dimensional space (often less)
NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98]
Powerful enough for some things…
Metric
References
Upper bound
Lower bound
Edit distance over {0,1}d
[OR05];
[KN05,KR06,AK07]
2Õ(√log d)
Ω(log d)
Ulam (= edit distance over
permutations)
[CK06];
[AK07]
O(log d)
Ω̃(log d)
Block edit distance over {0,1}d
[MS00, CM07];
[Cor03]
Õ(log d)
4/3
Earthmover distance in 2
(sets of size s)
[Cha02, IT03];
[NS07]
O(log s)
(log1/2 s)
Earthmover distance in {0,1}d
(set of size s)
[AIK08];
[KN05]
O(log s*log d) (log s)
Overcoming the L_1 non-embeddability barrier
5
Below logarithmic?

Cannot work with ℓ1

Other possibilities?

(ℓ2)p is bigger and algorithmically tractable


ℓ∞=real space with
dist∞(x,y)=maxi|xi-yi|
but not rich enough (often same lower bounds)
ℓ∞ is rich (includes all metrics),


(ℓ2)p=real space with
dist2p(x,y)=||x-y||2p
but not efficient computationally usually (high dimension)
And that’s roughly it… 

(at least for efficient NNS)
Overcoming the L_1 non-embeddability barrier
6
α
Meet our new host

Iterated product space, Ρ22,∞,1=
L°
(`2
)2
L¯¯
L
…
…
…
®
`
``1
1 1
d1
d1
d1
β
d∞,1
d∞,1
d∞,1
d22,∞,1
γ
x = (x1 ; : : P
: xa ) 2 R®
®
d1 (x; y) = i=1 jxi ¡ yi j
®
®
x = (x1 ; : : : x¯ ) 2 `®
1 £ `1 £ : : : `1
d1;1 (x; y) = max¯i=1 d1 (xi ; yi )
L¯ ® L¯ ®
L¯ ®
x = (x1 ; : : : x° ) 2 `1 `1 £ `1 `1 £ : : : `1 `1
P°
d22;1;1 (x; y) = i=1 (d1;1 (xi ; yi ))2
Overcoming the L_1 non-embeddability barrier
7
Why Ρ22,∞,1?


Dimensions (γ,β,α)=(d, log d, d)
Theorem 2. Ρ22,∞,1 admits NNS on n points with



(`2 )2
L¯
`1
Because we can…
Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion


L°
O(log log n) approximation
O(nε) query time and O(n1+ε) space
®
`1
Rich
Algorithmically
tractable
In fact, there is more for Ulam…
Overcoming the L_1 non-embeddability barrier
8
Our Algorithms for Ulam

ED(1234567,
7123456) = 2
Ulam = edit on strings where each symbol appears at most once
 A classical distance between rankings
 Exhibits hardness of misalignments (as in general edit)


All lower bounds same as for general edit (up to Θ̃() )
Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ̃(log d)
If we ever hope for approximation <<log d for NNS under general edit,
first we have to get it under Ulam!

Our approach implies new algorithms for Ulam:
1. NNS with O(log log n) approx, O(nε) query time

Can improve to O(log log d) approx
2. Sketching with O(1)-approx in logO(1) d space
3. Distance estimation with O(1)-approx in time
[BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time
Overcoming the L_1 non-embeddability barrier
9
Theorem 1

(`2 )2
L¯
`1
®
`1
Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion


L°
Dimensions (γ,β,α)=(d, log d, d)
Proof


“Geometrization” of Ulam characterizations
Previously studied in the context of testing monotonicity (sortedness):


Sublinear algorithms [EKKRV98, ACCL04]
Data-stream algorithms [GJKK07, GG07, EH08]
Overcoming the L_1 non-embeddability barrier
10
Thm 1: Characterizing Ulam

Consider permutations x,y over [d]


Idea:



Assume for now: x = identity permutation
Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y))
Call them faulty characters
Issues:


Ambiguity…
How do we count them?
X=
123456789
123456789
y=
234657891
341256789
Overcoming the L_1 non-embeddability barrier
11
Thm 1: Characterization – inversions

Definition: chars a<b form inversion if b precedes a in y

How to identify faulty char?

Has an inversion?


Has many inversions?


Doesn’t work: all chars might have inversion
Still can miss “faulty” chars
Check if either
is true!
Has many inversions locally?

Same problem
X= 123456789
123456789
123456789
y= 234567891
213456798
567981234
Overcoming the L_1 non-embeddability barrier
12
Thm 1: Characterization – faulty chars

Definition 1: a is faulty if exists K>0 s.t.



a is inverted w.r.t. a majority of the K symbols preceding a in y
(ok to consider K=2k)
Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)).
123456789
234567891
4 characters preceding 1 (all inversions with 1)
Overcoming the L_1 non-embeddability barrier
13
Thm 1: CharacterizationEmbedding

To get embedding, need:
1.
2.

Symmetrization (neither string is identity)
Deal with “exists”, “majority”…?
X[5;4]
123456789
To resolve (1), use instead X[a;K] …
123467895

Definition 2: a is faulty if exists K=2k such that

|X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference)
Y[5;4]
°
°
°1X[a;2k ] ¡ 1Y [a;2k ] ° > 2k
1
E:g: 1X[5;22 ] = (1; 1; 1; 1; 0; 0; 0; 0; 0)
Overcoming the L_1 non-embeddability barrier
14
Thm 1: Embedding – final step

X[5;22]
123456789
We have
123467895
Ulam(x; y) ¼


d
X
a=1
max
k=1¢¢¢ log d
h°
i
°
 °1X[a;2k ] ¡ 1Y [a;2k ] °1 > 2k
Y[5;22]
equal 1 iff true
Replace by weight?
d
X
2
)
k1X[a;2k ] ¡ 1Y [a;2k ] k1
Ulam(x; y) ¼
max
k
k=1¢¢¢ log d
2
¢
2
a=1
Final embedding:
³
f(x) =
¡
¢
(
´
1
1
k
2¢2k X[a;2 ] k=1::: log d] a=1:::d
Overcoming the L_1 non-embeddability barrier
2
Ld
(`2
)2
Llog d
`1
`d1
15
L°
Theorem 2

Theorem 2. Ρ22,∞,1 admits NNS on n points


O(log log n) approximation
O(nε) query time and O(n1+ε) space for any small ε



(`2 )2
L¯
`1
®
`1
(ignoring (αβγ)O(1))
A rather general approach
“LSH” on ℓ1-products of general metric spaces

Of course, cannot do, but can reduce to ℓ∞-products
Overcoming the L_1 non-embeddability barrier
16
L°
Thm 2: Proof

Let’s start from basics: ℓ1α

`1
®
`1
[IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space


(`2 )2
L¯
(ignoring αO(1))
L¯
®
`
`1 1
Ok, what about
Suppose: NNS for M with
• cM-approx
• QM query time
• SM space.
[I02]
L¯
Then: NNS for
`1 M
• O(cM * log log n) -approx
• Õ(QM) query time
• O(SM * n1+ε) space.
Overcoming the L_1 non-embeddability barrier
17
Thm 2: What about (ℓ2)2-product?

Enough to consider


`1
M
(for us, M is the l1-product)
Off-the-shelf?


L°
[I04]: gives space ~n or >log n approximation
We reduce to multiple NNS queries under

Instructive to first look at NNS for standard ℓ1 …
Overcoming the L_1 non-embeddability barrier
L°
`1
M
18
Thm 2: Review of NNS for ℓ1


LSH family: collection H of
hash functions such that:

For random hH (parameter >0)
Pr[h(q)=h(p)] ≈ 1-||q-p||1 / 
q

Query just uses primitive:
p
“return all points p such that h(q)=h(p)

Can obtain H by imposing randomly-shifted grid of side-length 

Then for h defined by ri2[0, ] at random, primitive becomes:
“return all p s.t. |qi-pi|<ri for all i[d]
Overcoming the L_1 non-embeddability barrier
19
Thm 2: LSH for ℓ1-product


Intuition: abstract LSH!
Recall we had:
for ri random from [0, ],
point p returned if for all i: |qi-pi|<ri


q
p
Equivalently

For all i:
maxi
For ℓ1
For
1
ri jqi
¡ pi j < 1
ℓ∞ product of R!
“return all p s.t. |qi-pi|<ri for all i[d]
L°
`1
M
“return all points p’s such that
maxi dM(qi,pi)/ri<1
Overcoming the L_1 non-embeddability barrier
20
Thm 2: Final
Thus, sufficient to solve primitive:
L°
For
`1 M “return all points p’s such that maxi dM(qi,pi)/ri<1

(in fact, for k independent choices of (r1,…rd))

We reduced NNS over
L°
`1
M
L°k
to several instances of NNS over
`1
(with appropriately scaled coordinates)


M
Approximation is O(1)*O(log log n)
Done!
Overcoming the L_1 non-embeddability barrier
21
Take-home message:

(`2 )2
L¯
`1
®
`1
Can embed combinatorial metrics into iterated product spaces


L°
Works for Ulam (=edit on non-repetitive strings)
Approach bypasses non-embeddability results into usual-suspect
spaces like ℓ1, (ℓ2)2 …
Open:

Embeddings for edit over
{0,1}d, EMD, other metrics?

Understanding product
spaces?
[Jayram-Woodruff]: sketching
Overcoming the L_1 non-embeddability barrier
22