LEARNING THE CROWD KERNEL Adam Tauman Kalai - MSR Joint work with Serge Belongie (UCSD), Ce Liu (MSR), Ohad Shamir (MSR), and Omer Tamuz.

Download Report

Transcript LEARNING THE CROWD KERNEL Adam Tauman Kalai - MSR Joint work with Serge Belongie (UCSD), Ce Liu (MSR), Ohad Shamir (MSR), and Omer Tamuz.

LEARNING THE
CROWD KERNEL
Adam Tauman Kalai - MSR
Joint work with Serge Belongie (UCSD), Ce Liu (MSR),
Ohad Shamir (MSR), and Omer Tamuz (Weizmann/MSR)
Large datasets
# problem domains
Each domain requires expertise in the
form of specific features/kernel.
1. INPUT
Database of 𝒏 objects, say images.
1. INPUT
+
2. CROWD QUERIES
Adaptively chosen
Randomly chosen
3. OUTPUT
𝑑
embedding in ℝ
NEAREST NEIGHBORS
Works on any
image set
Ideal system: all nearest neighbors in 𝑂(𝑛 log 𝑛) comparisons?
That’s 𝑛 noisy sorting problems with 𝑂(log 𝑛) comarisons/item.
LURE OF ADAPTIVITY
Toy example: complete binary trees with 𝒏 leaves, depth 𝑢(log 𝒏)
Avg. cost is 𝚯 𝒏 from random queries.
Avg. cost is 𝚯 π₯𝐨𝐠 𝒏 from adaptive queries.
Tie
store
Scarves
Tie clips
Bow
ties
Neck
ties
EMBEDDINGS AND KERNELS
Embedding of the 𝒏 objects into ℝ𝒅 for some 𝒅 ≀ 𝒏.
Kernel 𝑲 ∈ ℝ𝒏×𝒏 , π‘²π’Šπ’‹ = π’™π’Š β‹… 𝒙𝒋 .
π’™πŸ
π’™πŸ
π’™πŸ‘
π’™πŸ’
π’™πŸ
π’™πŸ β‹… π’™πŸ π’™πŸ β‹… π’™πŸ π’™πŸ β‹… π’™πŸ‘ π’™πŸ β‹… π’™πŸ’
π’™πŸ
π’™πŸ β‹… π’™πŸ π’™πŸ β‹… π’™πŸ π’™πŸ β‹… π’™πŸ‘ π’™πŸ β‹… π’™πŸ’
π’™πŸ‘
π’™πŸ‘ β‹… π’™πŸ π’™πŸ‘ β‹… π’™πŸ π’™πŸ‘ β‹… π’™πŸ‘ π’™πŸ‘ β‹… π’™πŸ’
π’™πŸ’
π’™πŸ’ β‹… π’™πŸ π’™πŸ’ β‹… π’™πŸ π’™πŸ’ β‹… π’™πŸ‘ π’™πŸ’ β‹… π’™πŸ’
π‘²β‰½πŸŽ
Assume
2
π‘₯𝑖
= 1.
EMBEDDINGS AND KERNELS
Embedding of the 𝒏 objects into ℝ𝒅 for some 𝒅 ≀ 𝒏.
Kernel 𝑲 ∈ ℝ𝒏×𝒏 , π‘²π’Šπ’‹ = π’™π’Š β‹… 𝒙𝒋 .
π’™πŸ
π’™πŸ
π’™πŸ‘
π’™πŸ
𝟏
π’™πŸ
π’™πŸ β‹… π’™πŸ
π’™πŸ‘
π’™πŸ‘ β‹… π’™πŸ π’™πŸ‘ β‹… π’™πŸ
π’™πŸ’
π’™πŸ’ β‹… π’™πŸ π’™πŸ’ β‹… π’™πŸ π’™πŸ’ β‹… π’™πŸ‘
π’™πŸ’
π’™πŸ β‹… π’™πŸ π’™πŸ β‹… π’™πŸ‘ π’™πŸ β‹… π’™πŸ’
𝟏
π’™πŸ β‹… π’™πŸ‘ π’™πŸ β‹… π’™πŸ’
𝟏
π’™πŸ‘ β‹… π’™πŸ’
𝟏
π‘²β‰½πŸŽ
Assume
2
π‘₯𝑖
= 1.
EMBEDDINGS AND KERNELS
Embedding of the 𝒏 objects into ℝ𝒅 for some 𝒅 ≀ 𝒏.
Kernel 𝑲 ∈ ℝ𝒏×𝒏 , π‘²π’Šπ’‹ = π’™π’Š β‹… 𝒙𝒋 .
π’™πŸ
π’™πŸ
π’™πŸ
π’™πŸ‘
𝟏
π’™πŸ
π’™πŸ‘
π’™πŸ’
π’™πŸπŸŽ.β‹… πŸ—π’™πŸ π’™πŸ β‹… π’™πŸ‘ π’™πŸπŸŽ.β‹… πŸ—π’™πŸ’
𝟏
π’™πŸ β‹… π’™πŸ‘ 𝒙β‰₯𝟐 .62
β‹… π’™πŸ’
𝟏
π’™πŸ’
π’™πŸ‘ β‹… π’™πŸ’
𝟏
π‘²β‰½πŸŽ
Convex set {𝐾 ≽ 0 ∧ 𝐾𝑖𝑖 = 1}
generalizes. [Srebro&Shraibman,COLT’05]
ADAPTIVE
ALGORITHM
Turk
random
triples
Fit 𝑲 to all
data so far
Maximum likelihood fit
to logistic or relative model
using gradient descent
Which triples are most informative?
Those which current model says 50/50?
Like labeling examples β€œclosest to the margin”
in active learning.
Turk β€œmost
informative
triples”
ADAPTIVE
ALGORITHM
Turk
random
triples
Fit 𝑲 to all
data so far
Turk β€œmost
informative
triples”
Maximum likelihood fit
to logistic or relative model
using gradient descent
We use probabilistic model +
information gain to decide how informative a triple is.
ROADMAP
1. Fitting 𝑲 to data using a model
2. Adaptively choosing triples
3. Two different models
a) Convex logistic model
b) Relative model
4. Performance evaluation
a) 20 Questions metric
b) Using learned kernel
5. Related work
FIT 𝑲 TO DATA
π‘²πŸπŸ
𝟏
π₯𝐨𝐠 πŸ’
π’‘πŸ‘πŸ
+
𝟏
π₯𝐨𝐠 𝟐
π’‘πŸ’πŸ‘
+
π₯𝐨𝐠
𝟏
π’‘πŸ’πŸπŸ
π‘²πŸπŸ‘
π‘²πŸπŸ’
π‘²πŸπŸ‘
π‘²πŸπŸ’
π‘²πŸ‘πŸ’
𝒑𝒂𝒃𝒄 = Probability that random turker reports
β€œπ’‚ is more similar to 𝒃 than to 𝒄.”
= 𝒇𝝀 (𝑲𝒂𝒃 , 𝑲𝒂𝒄 ).
Find max-likelihood 𝑲 ≽ 𝟎 with π‘²π’Šπ’Š = 𝟏.
Equivalently, minimize log-loss.
Done by gradient-projection descent.
Regularization parameter 𝝀 chosen based
upon independent hold-out set.
ADAPTIVELY
CHOOSING TRIPLES
First, π‘˜ = 10 random triples are chosen per object.
Then, at each round 𝒕 = 𝟏𝟏, 𝟏𝟐, … , πŸ’πŸŽ, for each object π’Š:
β€’ Pick a triple comparing π’Š to two other objects
β€’ Get posterior distribution over where π’Š is embedded.
explain how
β€’ Choose triple that results in largest expected decrease
in entropy (greatest mutual information).
ADAPTIVELY
CHOOSING TRIPLES
First, π‘˜ = 10 random triples are chosen per object.
Then, at each round 𝒕 = 𝟏𝟏, 𝟏𝟐, … , πŸ’πŸŽ, for each object π’Š:
β€’ Pick a triple comparing π’Š to two other objects
β€’ Prior is:
β€’ Fix embedding π‘₯𝑗 of all 𝑗 β‰  𝑖.
β€’ Suppose π‘₯𝑖 is equal to a uniformly random π‘₯𝑗 .
β€’ Posterior is prior updated by triples.
(data-driven prior/posterior)
β€’ Choose triple that results in largest expected decrease
in entropy (greatest mutual information).
FIT 𝑲 TO DATA
𝒑𝒂𝒃𝒄 = Probability that random turker reports
β€œπ’‚ is more similar to 𝒃 than to 𝒄.”
= 𝒇𝝀 (𝑲𝒂𝒃 , 𝑲𝒂𝒄 ).
Logistic model:
𝒇𝝀 𝑲𝒂𝒃 , 𝑲𝒂𝒄
𝑏
𝑲𝒂𝒃 βˆ’ 𝑲𝒂𝒄
𝒙𝒂 β‹… (𝒙𝒃 βˆ’ 𝒙𝒄 )
=𝝈
=𝝈
𝝀
𝝀
𝑐
FIT 𝑲 TO DATA
𝒑𝒂𝒃𝒄 = Probability that random turker reports
β€œπ’‚ is more similar to 𝒃 than to 𝒄.”
= 𝒇(𝑲𝒂𝒃 , 𝑲𝒂𝒄 ).
New relative model:
𝒇 𝑲𝒂𝒃 , 𝑲𝒂𝒄
𝐱𝐚 βˆ’ 𝐱𝐜 𝟐
=
𝐱𝐚 βˆ’ 𝐱𝐛 𝟐 + 𝐱𝐚 βˆ’ 𝐱𝐜
Assume 𝑲 ≽ π€πˆ
𝟐
𝟐 βˆ’ πŸπ‘²π’‚π’„
=
πŸ’ βˆ’ 𝟐𝐊 πšπ› βˆ’ 𝟐𝐊 𝐚𝐜
FIT 𝑲 TO DATA
𝒑𝒂𝒃𝒄 = Probability that random turker reports
β€œπ’‚ is more similar to 𝒃 than to 𝒄.”
= 𝒇𝝀 (𝑲𝒂𝒃 , 𝑲𝒂𝒄 ).
New relative model:
𝒇𝝀 𝑲𝒂𝒃 , 𝑲𝒂𝒄
𝝀 + 𝐱𝐚 βˆ’ 𝐱𝐜 𝟐
=
πŸπ€ + 𝐱 𝐚 βˆ’ 𝐱 𝐛 𝟐 + 𝐱 𝐚 βˆ’ 𝐱 𝐜
Assume 𝑲 ≽ π€πˆ 𝑲 ≽ 0
𝑏′
𝑐′
π‘Ž
𝑏
𝑐
𝑏
𝑐
𝟐
RELATIVE
MODEL FITTING
Fitting best relative model is not convex optimization. :(
However, say that the true probability distribution fits our
model, i.e., there exists a π‘²βˆ— ≽ 𝟎 (with π‘²βˆ—π’Šπ’Š = 𝟏) such that:
𝝀 + 𝟐 βˆ’ πŸπ‘²βˆ—π’‚π’„
𝒂
𝒑𝒃𝒄 =
πŸπ€ + πŸ’ βˆ’ πŸπ‘²βˆ—π’‚π’ƒ βˆ’ πŸπ‘²βˆ—π’‚π’„
Theorem: For any dist. 𝝆 over 𝒂, 𝒃, 𝒄 ≀ 𝒏, (given sufficient
data) with high probability stochastic gradient descent will
find 𝑲 ≽ 𝟎 (with π‘²π’Šπ’Š = 𝟏) satisfying,
Eπ’‚π’ƒπ’„βˆΌπ†
𝝀 + 𝟐 βˆ’ πŸπ‘²βˆ—π’‚π’„
βˆ’
βˆ—
βˆ—
πŸπ€ + πŸ’ βˆ’ πŸπ‘²π’‚π’ƒ βˆ’ πŸπ‘²π’‚π’„ πŸπ€ + πŸ’ βˆ’ πŸπ‘²π’‚π’ƒ βˆ’ πŸπ‘²π’‚π’„
𝝀 + 𝟐 βˆ’ πŸπ‘²π’‚π’„
≀ 𝝐.
GOOD QUESTION?
Say we have β€œfigured out” it’s a small white dog and
we want to know how furry it is.
RELATED WORK
Multidimensional scaling and matrix completion […]
β€’ How similar are A and B? How does person A rate item B?
β€’ Triple-based multidimensional scaling (nonadaptive)
[AgarwalWillsCaytonLanckrietKriegmanBelongie β€˜07]
Crowd-based visual search
β€’ Visipedia [WelinderBransonBelongiePerona ’10]
(requires domain-specific features)
Content-based image retrieval […]
Collaborative filtering […]
PERFORMANCE EVALUATION
20 Questions metric
β€’
Random object is chosen secretly
β€’
System asks 20 questions and then ranks objects in terms of likelihood
Dataset: 75 ties+75 tiles+75 flags
PERFORMANCE EVALUATION:
USING KERNEL
THE SYSTEM
β€’ Takes any set of images as input.
β€’ Farms out adaptive queries to turkers, in rounds.
β€’ Fits data to an embedding in 𝒅 dimensions.
β€’ Creates a visual search browser.
All for about 15β‚΅ per object. For 𝒏 = πŸ“πŸŽπŸŽ, costs about $75.
Naive approach on 𝒏 = πŸ“πŸŽπŸŽ objects would
$πŸ‘πŸŽ, 𝟎𝟎𝟎.
π’πŸ‘
cost
𝟐
× πŸŽ. πŸŽπŸŽπŸ“ β‰ˆ
What we pay per
similarity assessment.
Cool game...really enjoyed it...I was getting
good towards the end...lol
fun, getting better as i do more
Thanks for the bonuses, I feel Good Now!
This was interesting, but difficult. I was
always happy when it was a girl!
CONCLUSIONS AND
FUTURE WORK
Approximating the crowd kernel works across domains.
Adaptivity helps save $$ and improve performance.
Future work
β€’ Approximate by β€œmachine features” + humans
𝑲𝒂𝒃 = 𝝓 𝒂 β‹… 𝝓 𝒃 + 𝒉 𝒂 β‹… 𝒉(𝒃)
β€’ Interactive ML for solving AI-hard classification problems
THANK YOU!