LEARNING THE CROWD KERNEL Adam Tauman Kalai - MSR Joint work with Serge Belongie (UCSD), Ce Liu (MSR), Ohad Shamir (MSR), and Omer Tamuz.
Download
Report
Transcript LEARNING THE CROWD KERNEL Adam Tauman Kalai - MSR Joint work with Serge Belongie (UCSD), Ce Liu (MSR), Ohad Shamir (MSR), and Omer Tamuz.
LEARNING THE
CROWD KERNEL
Adam Tauman Kalai - MSR
Joint work with Serge Belongie (UCSD), Ce Liu (MSR),
Ohad Shamir (MSR), and Omer Tamuz (Weizmann/MSR)
Large datasets
# problem domains
Each domain requires expertise in the
form of specific features/kernel.
1. INPUT
Database of π objects, say images.
1. INPUT
+
2. CROWD QUERIES
Adaptively chosen
Randomly chosen
3. OUTPUT
π
embedding in β
NEAREST NEIGHBORS
Works on any
image set
Ideal system: all nearest neighbors in π(π log π) comparisons?
Thatβs π noisy sorting problems with π(log π) comarisons/item.
LURE OF ADAPTIVITY
Toy example: complete binary trees with π leaves, depth πΆ(log π)
Avg. cost is π― π from random queries.
Avg. cost is π― π₯π¨π π from adaptive queries.
Tie
store
Scarves
Tie clips
Bow
ties
Neck
ties
EMBEDDINGS AND KERNELS
Embedding of the π objects into βπ
for some π
β€ π.
Kernel π² β βπΓπ , π²ππ = ππ β
ππ .
ππ
ππ
ππ
ππ
ππ
ππ β
ππ ππ β
ππ ππ β
ππ ππ β
ππ
ππ
ππ β
ππ ππ β
ππ ππ β
ππ ππ β
ππ
ππ
ππ β
ππ ππ β
ππ ππ β
ππ ππ β
ππ
ππ
ππ β
ππ ππ β
ππ ππ β
ππ ππ β
ππ
π²β½π
Assume
2
π₯π
= 1.
EMBEDDINGS AND KERNELS
Embedding of the π objects into βπ
for some π
β€ π.
Kernel π² β βπΓπ , π²ππ = ππ β
ππ .
ππ
ππ
ππ
ππ
π
ππ
ππ β
ππ
ππ
ππ β
ππ ππ β
ππ
ππ
ππ β
ππ ππ β
ππ ππ β
ππ
ππ
ππ β
ππ ππ β
ππ ππ β
ππ
π
ππ β
ππ ππ β
ππ
π
ππ β
ππ
π
π²β½π
Assume
2
π₯π
= 1.
EMBEDDINGS AND KERNELS
Embedding of the π objects into βπ
for some π
β€ π.
Kernel π² β βπΓπ , π²ππ = ππ β
ππ .
ππ
ππ
ππ
ππ
π
ππ
ππ
ππ
πππ.β
πππ ππ β
ππ πππ.β
πππ
π
ππ β
ππ πβ₯π .62
β
ππ
π
ππ
ππ β
ππ
π
π²β½π
Convex set {πΎ β½ 0 β§ πΎππ = 1}
generalizes. [Srebro&Shraibman,COLTβ05]
ADAPTIVE
ALGORITHM
Turk
random
triples
Fit π² to all
data so far
Maximum likelihood fit
to logistic or relative model
using gradient descent
Which triples are most informative?
Those which current model says 50/50?
Like labeling examples βclosest to the marginβ
in active learning.
Turk βmost
informative
triplesβ
ADAPTIVE
ALGORITHM
Turk
random
triples
Fit π² to all
data so far
Turk βmost
informative
triplesβ
Maximum likelihood fit
to logistic or relative model
using gradient descent
We use probabilistic model +
information gain to decide how informative a triple is.
ROADMAP
1. Fitting π² to data using a model
2. Adaptively choosing triples
3. Two different models
a) Convex logistic model
b) Relative model
4. Performance evaluation
a) 20 Questions metric
b) Using learned kernel
5. Related work
FIT π² TO DATA
π²ππ
π
π₯π¨π π
πππ
+
π
π₯π¨π π
πππ
+
π₯π¨π
π
ππππ
π²ππ
π²ππ
π²ππ
π²ππ
π²ππ
ππππ = Probability that random turker reports
βπ is more similar to π than to π.β
= ππ (π²ππ , π²ππ ).
Find max-likelihood π² β½ π with π²ππ = π.
Equivalently, minimize log-loss.
Done by gradient-projection descent.
Regularization parameter π chosen based
upon independent hold-out set.
ADAPTIVELY
CHOOSING TRIPLES
First, π = 10 random triples are chosen per object.
Then, at each round π = ππ, ππ, β¦ , ππ, for each object π:
β’ Pick a triple comparing π to two other objects
β’ Get posterior distribution over where π is embedded.
explain how
β’ Choose triple that results in largest expected decrease
in entropy (greatest mutual information).
ADAPTIVELY
CHOOSING TRIPLES
First, π = 10 random triples are chosen per object.
Then, at each round π = ππ, ππ, β¦ , ππ, for each object π:
β’ Pick a triple comparing π to two other objects
β’ Prior is:
β’ Fix embedding π₯π of all π β π.
β’ Suppose π₯π is equal to a uniformly random π₯π .
β’ Posterior is prior updated by triples.
(data-driven prior/posterior)
β’ Choose triple that results in largest expected decrease
in entropy (greatest mutual information).
FIT π² TO DATA
ππππ = Probability that random turker reports
βπ is more similar to π than to π.β
= ππ (π²ππ , π²ππ ).
Logistic model:
ππ π²ππ , π²ππ
π
π²ππ β π²ππ
ππ β
(ππ β ππ )
=π
=π
π
π
π
FIT π² TO DATA
ππππ = Probability that random turker reports
βπ is more similar to π than to π.β
= π(π²ππ , π²ππ ).
New relative model:
π π²ππ , π²ππ
π±π β π±π π
=
π±π β π±π π + π±π β π±π
Assume π² β½ ππ
π
π β ππ²ππ
=
π β ππ ππ β ππ ππ
FIT π² TO DATA
ππππ = Probability that random turker reports
βπ is more similar to π than to π.β
= ππ (π²ππ , π²ππ ).
New relative model:
ππ π²ππ , π²ππ
π + π±π β π±π π
=
ππ + π± π β π± π π + π± π β π± π
Assume π² β½ ππ π² β½ 0
πβ²
πβ²
π
π
π
π
π
π
RELATIVE
MODEL FITTING
Fitting best relative model is not convex optimization. :(
However, say that the true probability distribution fits our
model, i.e., there exists a π²β β½ π (with π²βππ = π) such that:
π + π β ππ²βππ
π
πππ =
ππ + π β ππ²βππ β ππ²βππ
Theorem: For any dist. π over π, π, π β€ π, (given sufficient
data) with high probability stochastic gradient descent will
find π² β½ π (with π²ππ = π) satisfying,
EπππβΌπ
π + π β ππ²βππ
β
β
β
ππ + π β ππ²ππ β ππ²ππ ππ + π β ππ²ππ β ππ²ππ
π + π β ππ²ππ
β€ π.
GOOD QUESTION?
Say we have βfigured outβ itβs a small white dog and
we want to know how furry it is.
RELATED WORK
Multidimensional scaling and matrix completion [β¦]
β’ How similar are A and B? How does person A rate item B?
β’ Triple-based multidimensional scaling (nonadaptive)
[AgarwalWillsCaytonLanckrietKriegmanBelongie β07]
Crowd-based visual search
β’ Visipedia [WelinderBransonBelongiePerona β10]
(requires domain-specific features)
Content-based image retrieval [β¦]
Collaborative filtering [β¦]
PERFORMANCE EVALUATION
20 Questions metric
β’
Random object is chosen secretly
β’
System asks 20 questions and then ranks objects in terms of likelihood
Dataset: 75 ties+75 tiles+75 flags
PERFORMANCE EVALUATION:
USING KERNEL
THE SYSTEM
β’ Takes any set of images as input.
β’ Farms out adaptive queries to turkers, in rounds.
β’ Fits data to an embedding in π
dimensions.
β’ Creates a visual search browser.
All for about 15β΅ per object. For π = πππ, costs about $75.
Naive approach on π = πππ objects would
$ππ, πππ.
ππ
cost
π
Γ π. πππ β
What we pay per
similarity assessment.
Cool game...really enjoyed it...I was getting
good towards the end...lol
fun, getting better as i do more
Thanks for the bonuses, I feel Good Now!
This was interesting, but difficult. I was
always happy when it was a girl!
CONCLUSIONS AND
FUTURE WORK
Approximating the crowd kernel works across domains.
Adaptivity helps save $$ and improve performance.
Future work
β’ Approximate by βmachine featuresβ + humans
π²ππ = π π β
π π + π π β
π(π)
β’ Interactive ML for solving AI-hard classification problems
THANK YOU!