Author-Topic Modeling from Large Document Collections

Transcript Author-Topic Modeling from Large Document Collections

MADLAB
The Wisdom of Crowds in the Recollection of Order Information
The Memory and
Decisions Laboratory
Mark Steyvers, Michael Lee, Brent Miller & Pernille Hemmer
More information about our lab:
http://psiexp.ss.uci.edu/research/madlab.htm
do the experiments yourself at
http://psiexp.ss.uci.edu/
University of California, Irvine
Thurstonian
Thurstonian state
(z=1)model (z = 1)
Wisdom of Crowds and Rank Aggregation
• Wisdom of crowds phenomenon: aggregating over individuals in a
group often leads to an estimate that is better than any of the
individual estimates (e.g. Surowiecki, 2004)
Thurstonian Model (v1)
B
x1
• Approach: develop unsupervised Bayesian models for rank
aggregation that take individual
latent ground truth
differences into account
• Individual differences: each individual is inx one
of
two
A
C B
states: the Thurstonian state (z=1) and a guessing state
y : A< B < C
(z=0) where there are no differences betweeny items
: A< C < B
x1
A
x2
x3
C
B
A
C
x4
B
y3 : C < B < A
Guessing
y2 : A model
< C < B(z = 0)
y4 : C < A < B
Guessing state (z=0)
x3
x4
2
C
C
A
B. James Madison
A
B
ADBC
John Adams (2)
Canada (4)
Thomas Jefferson (3)
China (2)
James Monroe (5)
United States (3)
Andrew Jackson (4)
Brazil (7)
Theodore Roosevelt (6)
Australia (5)
Woodrow Wilson (7)
India (6)
y3 : C < B < A
2
y4 : C < A < B
Freedom of speech & religion (1)
George Washington (1)
John Adams (2)
Thomas Jefferson (3)
James Madison (4)
James Monroe (6)
John Quincy Adams (5)
Andrew Jackson (7)
Martin Van Buren (8)
William Henry Harrison (21)
John Tyler (10)
James Knox Polk (18)
Zachary Taylor (16)
Millard Fillmore (11)
Franklin Pierce (19)
James Buchanan (13)
Abraham Lincoln (9)
Andrew Johnson (12)
Ulysses S. Grant (17)
Rutherford B. Hayes (20)
James Garfield (22)
Chester Arthur (15)
Grover Cleveland 1 (23)
Benjamin Harrison (14)
Grover Cleveland 2 (25)
William McKinley (24)
Theodore Roosevelt (29)
William Howard Taft (27)
Woodrow Wilson (30)
Warren Harding (26)
Calvin Coolidge (28)
Herbert Hoover (31)
Franklin D. Roosevelt (32)
Harry S. Truman (33)
Dwight Eisenhower (34)
John F. Kennedy (37)
Lyndon B. Johnson (36)
Richard Nixon (39)
Gerald Ford (35)
James Carter (38)
Ronald Reagan (40)
George H.W. Bush (41)
William Clinton (42)
George W. Bush (43)
Barack Obama (44)
Right to bear arms (2)
No quartering of soldiers (4)
No unreasonable searches (3)
Due process (5)
Trial by Jury (6)
Worship any other God (1)
Civil Trial by Jury (7)
Make a graven image (7)
Take the Lord's name in vain (2)
No cruel punishment (8)
Break the Sabbath (3)
Dishonor your parents (4)
Kazakhstan (10)
Murder (6)
Right to non-specified rights (10)
Sudan (9)
Commit adultery (8)
Steal (5)
Power for the States & People (9)
Bear false witness (9)
Covet (10)
First
Last
Largest
Experiment to collect human ordering data
Mallows Model
• Distance-based model that assumes that observed orderings that are close to
the group ordering are more likely than those far away. The probability of any
observed order, given the group order is:
Ten Commandments
1
 d ( y , ω )
p( y | ω,  ) 
e
( )
• Performance was measured using Kendall’s Tau: The number of
adjacent pair-wise swaps between recalled and true order.
A
True Order
B
C
D
=2
78 individuals
A
B
C
D
E
F
G
H
I
J
0
2
A
B
C
D
E
F
G
H
I
J
0
5
E
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA CAH
B B B C B B B C B B B C B B B B B B B B B B B B B C CD B B B B C C C C B B C E E F B B B C C C B C C C E B CH B B B C B C C E E C E G J C
C C C B C C C B C C C B C C C CDD C C C C C C E B B C C C CD B B B E CD B CD C C C C BD E C B B F B I B B CD GF C E F D C F G J GD
DD EDD E EDD E E EDD E E C ED E E E E E CDD BD E E C E E E B E C E B BD E E J F B B E F F E C GE E G J C EH BH I B BDA I I
E E D E E DD E E DDD E E DD E C I DDDD F D E E E E D GE D G I GG J GF C BDDDD E D I E GD F C J C J E F B J I E G J J C E D J
F GF F F F GF F F GF F H F GF F E F F H I D I GH F J J D J F DDDH E HD GGH J F G I J H J H BH E GDD G I J I F D B I H J B C E
GF GGHH F GGGF G J GH F I I G J J GG J G I F I I G I F I I G I F GD I F H J H G J J GF H E I I DH J I CH GD J GCD I F I H G
HHHH GGHH J I I H GF I J GGF H I J H GF J J H F F J I J F F F I F J H J E I I E I GF GD J H GF D F H I J I E H J H GE I D B F
J I I I J I I J I J H J I J J H J H J I GF F I H F I GG I F GH J HH J I F GH I GF I E F I D GD G J H I I E HDD GGB F H GH F F A
I J J J I J J I HH J I H I G I H J H GH I J H J H G J HHHH GH J J DH I J I J F GHHHH J I I J D J F GF F E H F D I J F D BH E B
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10 10 11 12 12 13 13 13 13 14 14 14 14 15 17 18 19 26 28
1 2 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2
zj  0
error bars = median and minimum sigma
1
R=0.941
0
5
10
15
20
25
30
35
40
group
ordering
scaling
parameter
Kendall tau
distance function
350
300
45
Thurstonian Model
Perturbation
Borda count
Individuals
Ten Amendments
300
6
zj 1
zj  0
4
2
0
5
10
15
20
25
30
35
40
45

d ( y j , ω)
• Two-state model: an individual either produces an ordering according to a
Mallows model (z=1) or a guessing process (z=0). We estimate the latent
assignment z for each individual. This approach is related to Klementiev, Roth et
al. 2009
E
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA B BAAA BAA BAAAAADA BA B EAD C B E C I J J H
B B B B B B B B B B B B B B B B B B B B B C E B B B B B B B B B B B B B B C F A C B B GA B BA B B F B F F BA CA F H E I J H B E GG J
C C C C C C CDD C C C C C C CDD CDD B B CDDD E C CDD F F C C F B B CD C C C C CD F D E C F E BD E GE C C I GH G I A B I I
DD E DD E E C CDDDD E F F C F E C CD C GF F F F D E CH CDDH C F D E AH I B F H C CH I B J C C I I F I GE HA C B GHHD G
E F D E F D F E F E E E F F DD F C F F H ED F C E ED F DH CD CH F ED CH F F F D J I HH I D I DD E F F BHADAD I J H G I H E
F E F F E F D F E F GH E D E E E E GE E F F DH C C CH GF E E E I E D I E DHD E F DD F D C F D C BA E H J CD F B F AA J D F E F
GGGH GGGH G I H F I GGHH GDH F H GE E GH G I J E F I H F D I E H I E I D E E F I I F C E E I GC CDD B J F HD F F F E CD
I HH I I I H GHH I GG I I GGH I I G I I H G I GH GF I I H I E GHH I F GE G I GE J E E HHH G I J DH J H I C E F DD CA B C
H I I GHH I I I GF J HHH I I I H G I J H I I H I I E I GGGGG I GGGG J GHHH GE GGGGGHHH G I F I B GB E C E B C F B
J J J J J J J J J J J I J J J J J J J J J G J J J J J J J H J J J J J J J J J J I J J J I J G J J J J I J J G J E G J G J J G I A J DAA
1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 7 8 8 8 8 9 9 9 10 10 10 11 11 11 12 13 14 14 14 16 18 20 22 24 26 26 33 37 42
1 5 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
zj 1
0
normalization
constant
D
Example Raw Data
3
0
= 1+1
D
y j  rank (x j )
Experiment with
26 individuals
ordering all 44 US
presidents
4
8
=1
C
5

ordering by
individual
Ordering by Individual
A
B
E
C
D
Number of Individuals
6

250

individual
200
distance
to ground
truth
250

200
150
150
100
100
50
50
0
0.1
0.2

0.3

0.4
0
1
10
20
30
Individuals
inferred noise level for
each individual
Strong wisdom of crowds effect across tasks
Conclusion
True ordering
Mean Kendall tau averaged over all 17 tasks
A = Oregon
B = Utah
C = Nebraska
D = Iowa
E = Alabama
F = Ohio
G = Virginia
H = Delaware
I = Connecticut
J = Maine
A = George Washington
B = John Adams
C = Thomas Jefferson
D = James Monroe
E = Andrew Jackson
F = Theodore Roosevelt
G = Woodrow Wilson
H = Franklin D. Roosevelt
I = Harry S. Truman
J = Dwight D. Eisenhower
Problem
books
city population europe
city population us
city population world
country landmass
country population
hardness
holidays
movies releasedate
oscar bestmovies
oscar movies
presidents
rivers
states westeast
superbowl
ten amendments
ten commandments
AVERAGE
BEST INDIVIDUAL
Humans
τ
PC
.000 12.3
.000 16.9
.000 15.9
.000 19.3
.000 10.9
.000 14.6
.000 15.3
.051 8.9
.013 7.3
.013 11.2
.000 11.9
.064 7.5
.000 16.1
.026 8.2
.000 18.6
.013 14.0
.000 16.8
.011 13.3
0 7.8
Thurstonian Model
C
τ
Rank
0
5
91
0 11
81
0
7
96
0 16
73
0
5
95
0 12
74
0 14
64
0
4
78
0
2
95
0
4
90
0
1
100
0
2
87
0 13
77
0
2
88
0 16
65
0
2
97
0
8
90
.00 7.29
84.8
Mallows Model
C
τ Rank
0
5 91
0 12 77
0
7 96
0 16 73
0
5 95
0 11 82
0 14 64
0
5 77
0
2 95
0
4 90
0
1 100
0
1 94
0 14 67
0
2 88
0 15 71
0
3 96
0
7 91
.00 7.29 85.1
Borda Counts
C
τ Rank
0
7 82
0 11 81
0 12 67
0 15 77
0
5 95
0 11 82
0 11 91
0
4 78
0
2 95
0
3 97
0
2 96
0
3 79
0 11 91
0
3 78
0 10 96
0
5 90
0 12 74
.00 7.47 85.3
C
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
.12
Mode
τ Rank
12 40
17 42
16 45
19 44
7 76
15 53
15 46
0 100
2 95
3 97
2 96
0 100
16 42
1 97
19 40
4 95
17 51
9.67 68.2
• Using unsupervised Bayesian models for rank data, we can
aggregate orderings across individuals such that aggregated
ordering better approximates ground truth than any individual in the
crowd: strong wisdom of crowd effect
Thurstonian Model v2
Thurstonian Model v1
Perturbation Model
Mallows Model
Borda count
Individuals
25
20
Mean 

• Example tasks: order of US presidents, the order of countries by
landmass, the order of the ten commandments and the ten
amendments.
Smallest
Number of Individuals
• We tested 78 individuals on their ability to reconstruct from memory
the order of items in 17 different tasks
B
yj
Ten Amendments
Argentina (8)
Dwight D. Eisenhower (10)
A
xij |  ,  j ~ N   ,  j 
C. Andrew Jackson

ACBD
Russia (1)
Harry S. Truman (8)
E
xj
B
1
Country Landmass
Franklin D. Roosevelt (9)
C
 j ~ Gamma  ,1/  

Presidents
George Washington (1)
Generative Model
B
j
μ
A. George Washington
j individuals
Incorporate individual differences
A
• Assumption: each individual has a unique variance (same for all
C
A
B
items) but shares the same set of item means with the group. This
C
A
B
model can represent varying degrees of “expertise”
y1 : A < B < C
C
B
A
????
BADC
C
• Items are represented by coordinates on interval scale.
• Goal: apply this idea to human ordering / ranking data: how can we
aggregate the recollected orderings across individuals to best
approximate some underlying ground truth?
ABDC
Thurstonian Model v2: allowing Partial Knowledge
A
• Normal distributions represent uncertainty about item
position – to order items, each individual draws one
Thurstonian model (z = 1)
sample from each normal distribution and orders
the
A
B
items according to the samples. Means and standard
C
deviations are shared among all individuals
DABC
Guessing model (z = 0)
15
• It is important to incorporate individual differences – some
individuals are more expert than others. Models can estimate
expertise levels in unsupervised fashion – individuals near
consensus orderings are likely to be more expert (if individuals
performed task independently)
10
5
0
1
10
20
30
40
Individuals
50
60
70
80