Semantic association in humans and machines

Download Report

Transcript Semantic association in humans and machines

Probabilistic Topic Models
Mark Steyvers
Department of Cognitive Sciences
University of California, Irvine
Joint work with:
Tom Griffiths, UC Berkeley
Padhraic Smyth, UC Irvine
Dave Newman, UC Irvine
Chaitanya Chemudugunta , UC Irvine
Problems of Interest
• Information retrieval
– Finding relevant documents in databases
– How can we go beyond keyword matches?
• Psychology
– Retrieving information from episodic/semantic memory
– How to access memory at multiple levels of abstraction?
Overview
I
Probabilistic Topic Models
II
Topic models for association
III Topic models for recall
IV Topic models for information retrieval
V Conclusion
Probabilistic Topic Models
• Originated in domain of statistical machine learning
• Performs unsupervised extraction of topics from large text
collections
• Text documents:
– scientific articles
– web pages
– email
– list of words in memory experiment
Probabilistic Topic Models
• Each topic is a probability distribution over words
• Each document is modeled as a mixture of topics
• We do not observe these distributions but we can infer them
statistically
The Generative Model
1.
for each document, choose
a mixture of topics
2.
For every word slot,
sample a topic [1..T]
from the mixture
TOPIC
...
TOPIC
sample a word from the topic
WORD
...
WORD
3.
TOPIC MIXTURE
Prob
Toy Example
1
Topic mixture
0
ic 1 opic 2
top
t
Topic 1
HEART
LOVE
SOUL
MYSTERY
SCIENTIFIC
KNOWLEDGE
RESEARCH
Topic 2
HEART
LOVE
SOUL
MYSTERY
SCIENTIFIC
KNOWLEDGE
RESEARCH
Document: HEART, LOVE, LOVE, SOUL, HEART, ….
Word probabilities
for each topic
Prob
Toy Example
1
Topic mixture
0
ic 1 opic 2
p
o
t
t
Topic 1
HEART
LOVE
SOUL
MYSTERY
SCIENTIFIC
KNOWLEDGE
RESEARCH
Topic 2
HEART
LOVE
SOUL
MYSTERY
SCIENTIFIC
KNOWLEDGE
RESEARCH
Word probabilities
for each topic
Document: SCIENTIFIC, KNOWLEDGE, SCIENTIFIC, RESEARCH, ….
Prob
Toy Example
1
Topic mixture
0
c 1 pic 2
i
p
to
to
Topic 1
HEART
LOVE
SOUL
MYSTERY
SCIENTIFIC
KNOWLEDGE
RESEARCH
Topic 2
HEART
LOVE
SOUL
MYSTERY
SCIENTIFIC
KNOWLEDGE
RESEARCH
Document: LOVE, SCIENTIFIC, SOUL, LOVE, KNOWLEDGE, ….
Word probabilities
for each topic
Applying model to large corpus of text
• Simultaneously infer the topics and topic mixtures that best
“reconstructs” a corpus of text
• Bayesian methods (MCMC)
• Unsupervised!!
Example topics from TASA: an educational corpus
•
•
37K docs 26K word vocabulary
300 topics e.g.:
TOPIC 77
MUSIC
DANCE
SONG
PLAY
SING
SINGING
BAND
PLAYED
SANG
SONGS
DANCING
PIANO
PLAYING
RHYTHM
TOPIC 82
LITERATURE
POEM
POETRY
POET
PLAYS
POEMS
PLAY
LITERARY
WRITERS
DRAMA
WROTE
POETS
WRITER
SHAKESPEARE
0
0.0
5
0.0
0
0.1
P( w | z )
TOPIC 137
RIVER
LAND
RIVERS
VALLEY
BUILT
WATER
FLOOD
WATERS
NILE
FLOWS
RICH
FLOW
DAM
BANKS
TOPIC 254
READ
BOOK
BOOKS
READING
LIBRARY
WROTE
WRITE
FIND
WRITTEN
PAGES
WORDS
PAGE
AUTHOR
TITLE
0 1 2 3 4
0.0 0.0 0.0 0.0 0.0
0 6 2 8
0.0 0.0 0.1 0.1
0 8 6 4
0.0 0.0 0.1 0.2
P( w | z )
P( w | z )
P( w | z )
Topics from psych review abstracts
• All 1281 abstracts since 1967
• 50 topics – examples:
SIMILARITY
CATEGORY
CATEGORIES
RELATIONS
DIMENSIONS
FEATURES
STRUCTURE
SIMILAR
REPRESENTATION
ONJECTS
STIMULUS
CONDITIONING
LEARNING
RESPONSE
STIMULI
RESPONSES
AVOIDANCE
REINFORCEMENT
CLASSICAL
DISCRIMINATION
MEMORY
RETRIEVAL
RECALL
ITEMS
INFORMATION
TERM
RECOGNITION
ITEMS
LIST
ASSOCIATIVE
GROUP
INDIVIDUAL
GROUPS
OUTCOMES
INDIVIDUALS
GROUPS
OUTCOMES
INDIVIDUALS
DIFFERENCES
INTERACTION
...
EMOTIONAL
EMOTION
BASIC
EMOTIONS
AFFECT
STATES
EXPERIENCES
AFFECTIVE
AFFECTS
RESEARCH
Application: Research Browser
http://yarra.calit2.uci.edu/calit2/
• Automatically analyzes
research papers by UC San
Diego and UC Irvine
researchers related to
California Institute for
Telecommunications and
Information Technology
(CalIT2)
• Finds topically related
researchers
Notes
• Determining number of topics
• Word order is ignored -- bags of words assumption
• Function words (the, a, etc) are typically removed
Latent Semantic Analysis
(Landauer & Dumais, 1997)
(semantic space)
JOY
LOVE
MYSTERY
• Learned from large text collections
• Spatial representation:
– Each word is a single point in
semantic space
– Nearby words are similar in
meaning
RESEARCH
MATHEMATICS
• Problems:
– Static representation for
ambiguous words
– Dimensions are uninterpretable
– Little flexibility to expand
Overview
I
Probabilistic Topic Models
II
Topic models for association
III Topic models for recall
IV Topic models for information retrieval
V Conclusion
Word Association
HUMANS
Responses
CUE:
PLAY
Word
P( word )
FUN
.141
BALL
.134
GAME
.074
WORK
.067
GROUND
.060
MATE
.027
CHILD
.020
ENJOY
.020
WIN
.020
ACTOR
.013
FIGHT
.013
HORSE
.013
KID
.013
MUSIC
.013
TOPICS (T=500)
Word
P( word )
BALL
.041
GAME
.039
CHILDREN BAT
.019
ROLE
.014
BALL
BASEBALL
GAMES
.014
MUSIC
.009 GAME
BASEBALL
.009
HIT
PLAY.008
FUN
.008
TEAM
.008
IMPORTANT
.006
STAGE
THEATER
BAT
.006
RUN
.006
STAGE
.005
(Nelson, McEvoy, & Schreiber USF word association norms)
V
P
R
Topic model for Word Association
BAT
BAT
BALL
BASEBALL
BALL
BASEBALL
TOPIC
GAME
GAME
PLAY
PLAY
TOPIC
STAGE
THEATER
STAGE
THEATER
• Topics are latent factors that group words
• Word association modeled as probabilistic “spreading of activation”
–Cue activates topics
–Topics activates other words
Model predictions from TASA corpus
HUMANS
TOPICS (T=500)
LSA (5
Word
P( word )
FUN
.141
BALL
.134
GAME
.074
WORK
.067
GROUND
.060
MATE
.027
CHILD
.020
ENJOY
.020
WIN
.020
ACTOR
.013
FIGHT
.013
HORSE
.013
KID
.013
MUSIC
.013
Word
P( word )
BALL
.041
GAME
.039
CHILDREN
.019
ROLE
.014
GAMES
.014
MUSIC
.009
BASEBALL
.009
HIT
.008
FUN
.008
TEAM
.008
IMPORTANT .006
BAT
.006
RUN
.006
STAGE
.005
Wor
KICKB
VOLLEY
GAME
COSTU
DRAM
ROL
PLAYWR
FUN
RANK 9 ACTO
REHEAR
GAM
ACTO
CHECK
MOLIE
Median rank of first associate
TOPICS
LSA
TOPICS
40
30
30
20
20
10
10
0
0
30
0
50
0
70
0
90
11 0
0
13 0
0
15 0
0
17 0
00
Median Rank
40
# Topics
Cosine
Inner
product
Median rank of first associate
LSA
LSA
TOPICS
TOPICS
40
30
30
20
20
10
10
0
0
30
0
50
0
70
0
90
11 0
0
13 0
0
15 0
0
17 0
00
Median Rank
40
# Topics
Cosine
Inner
product
What about collocations?
• Why are these words related?
– PLAY - GROUND
– DOW - JONES
– BUMBLE - BEE
• Suggests at least two routes for association:
– Semantic
– Collocation
 Integrate collocations into topic model
Collocation Topic Model
TOPIC MIXTURE
If x=0, sample a word from
the topic
If x=1, sample a word from
the distribution based
on previous word
TOPIC
TOPIC
TOPIC
...
WORD
WORD
WORD
...
X
X
...
Collocation Topic Model
Example:
“DOW JONES RISES”
TOPIC MIXTURE
JONES is more likely
explained as a word
following DOW than as
word sampled from topic
TOPIC
Result: DOW_JONES
recognized as
collocation
DOW
JONES
X=1
TOPIC
...
RISES
...
X=0
...
Examples Topics from New York Times
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
SEPT_11
WAR
SECURITY
IRAQ
TERRORISM
NATION
KILLED
AFGHANISTAN
ATTACKS
OSAMA_BIN_LADEN
AMERICAN
ATTACK
NEW_YORK_REGION
NEW
MILITARY
NEW_YORK
WORLD
NATIONAL
QAEDA
TERRORIST_ATTACKS
WALL_STREET
ANALYSTS
INVESTORS
FIRM
GOLDMAN_SACHS
FIRMS
INVESTMENT
MERRILL_LYNCH
COMPANIES
SECURITIES
RESEARCH
STOCK
BUSINESS
ANALYST
WALL_STREET_FIRMS
SALOMON_SMITH_BARNEY
CLIENTS
INVESTMENT_BANKING
INVESTMENT_BANKERS
INVESTMENT_BANKS
WEEK
DOW_JONES
POINTS
10_YR_TREASURY_YIELD
PERCENT
CLOSE
NASDAQ_COMPOSITE
STANDARD_POOR
CHANGE
FRIDAY
DOW_INDUSTRIALS
GRAPH_TRACKS
EXPECTED
BILLION
NASDAQ_COMPOSITE_INDEX
EST_02
PHOTO_YESTERDAY
YEN
10
500_STOCK_INDEX
BANKRUPTCY
CREDITORS
BANKRUPTCY_PROTECTION
ASSETS
COMPANY
FILED
BANKRUPTCY_FILING
ENRON
BANKRUPTCY_COURT
KMART
CHAPTER_11
FILING
COOPER
BILLIONS
COMPANIES
BANKRUPTCY_PROCEEDINGS
DEBTS
RESTRUCTURING
CASE
GROUP
Overview
I
Probabilistic Topic Models
II
Topic models for association
III Topic models for recall
IV Topic models for information retrieval
V Conclusion
Experiment
• Study the following words for a memory test:
TOMATOES
CABBAGE
CARROTS
HAMMER
LETTUCE
SPINACH
SQUASH
BEANS
CORN
PEAS
Recall words.....
Result: outlier word “HAMMER” better remembered
Hunt & Lamb (2001 exp. 1)
CONTROL LIST
PEAS
SAW
CARROTS
SCREW
BEANS
CHISEL
SPINACH
DRILL
LETTUCE
SANDPAPER
HAMMER
HAMMER
TOMATOES
NAILS
CORN
BENCH
CABBAGE
RULER
SQUASH
ANVIL
DATA
1.0
Target
Background
0.8
Prob. of Recall
OUTLIER LIST
0.6
0.4
0.2
0.0
outlier list pure list
Semantic Isolation/ Von Restorff Effects
• Verbal explanations:
– Attention, surprise, distinctiveness
• Our approach:
– Multiple ways to encode each word
• Isolated item stored verbatim – direct access route
• Related items stored by topics – gist route
Dual route topic model
Encoding
Retrieval
Topics
Study list
(distributed representation)
X
X
Reconstructed study list
Word distribution
(localist representation)
Encoding involves
choosing routes to store
words -- each word is
represented by one route
Recall involves a
reconstructive process
from stored information
RETRIEVAL
ENCODING
Topic Probability
VEGETABLES
FURNITURE
TOOLS
0.0
Study words
PEAS
CARROTS
BEANS
SPINACH
LETTUCE
HAMMER
TOMATOES
CORN
CABBAGE
SQUASH
0.1
0.2
Retrieval Probability
Route Probability
Verbatim
Topic
0.0
0.5
1.0
Verbatim Word Probability
HAMMER
PEAS
SPINACH
CABBAGE
CARROTS
LETTUCE
SQUASH
BEANS
TOMATOES
CORN
0.00
HAMMER
BEANS
CORN
PEAS
SPINACH
CABBAGE
LETTUCE
CARROTS
SQUASH
TOMATOES
0.00
0.05
0.10
0.15
0.01
0.02
Model Predictions
DATA
PREDICTED
1.0
0.05
Target
Background
Prob. of Retrieval
Prob. of Recall
0.8
Target
Background
0.6
0.4
0.2
0.04
0.03
0.02
0.01
0.00
0.0
outlier list pure list
outlier list pure list
Interpretation of Von Restorff effects
• Contextually unique words are stored in verbatim route
• Contextually expected items stored by topics
• Reconstruction is better for words stored in verbatim route
Robinson & Roediger (1997)
False memory effects
DATA
1.0
Number of Associates
MAD
FEAR
HATE
SMOOTH
NAVY
HEAT
SALAD
TUNE
COURTS
CANDY
PALACE
PLUSH
TOOTH
BLIND
WINTER
MAD
FEAR
HATE
RAGE
TEMPER
FURY
SALAD
TUNE
COURTS
CANDY
PALACE
PLUSH
TOOTH
BLIND
WINTER
9
MAD
FEAR
HATE
RAGE
TEMPER
FURY
WRATH
HAPPY
FIGHT
CANDY
PALACE
PLUSH
TOOTH
BLIND
WINTER
Prob. of Recall
6
0.8
0.6
0.4
0.2
0.0
3
6
9
12
15
PREDICTED
Number of Associates Studied
Studied associates
Nonstudied (lure)
0.03
Prob. of Retrieval
3
Studied items
Nonstudied (lure)
0.02
0.01
0.00
(lure = ANGER)
3
6
9
12
15
Number of Associates Studied
Relation to dual retrieval models
• Familiarity/ Recollection distinction
• e.g. Atkinson, Juola, 1972; Mandler, 1980
• Not clear how these retrieval processes relate to encoding routes
• Gist/Verbatim distinction
• e.g. Reyna, Brainerd, 1995
• Maps onto topics and verbatim route
 Difference
– Our approach specifies both encoding and retrieval representations
and processes
– Routes are not independent
– Model explains performance for actual word lists
Overview
I
Probabilistic Topic Models
II
Topic models for association
III Topic models for recall
IV Topic models for information retrieval
V Conclusion
Example: finding Dave Huber’s website
Works well when keywords are
distinctive and match exactly
Query
Document
Dave Huber  Dave Huber
UCSD
 UCSD
Problem: related but non-matching keywords
Query
Document
Dave Huber  Dave Huber
Mathematical modeling  ??
Challenge: how to match at
both the word level as well as
a conceptual level
Dual route model for information retrieval
• Encode documents with two routes
– contextually unique words  verbatim route
– Thematic words  topics route
Example encoding
Kruschke, J. K.. ALCOVE: An exemplar-based
connectionist model of category learning. Psychological
Review, 99, 22-44.
alcove attention learning covering map is a
connectionist model of category learning that
incorporates an exemplar based representation d . l .
medin and m . m . schaffer 1978 r . m . nosofsky 1986
with error driven learning m . a . gluck and g . h . bower
1988 d . e . rumelhart et al 1986 . alcove selectively
attends to relevant stimulus dimensions is sensitive to
correlated dimensions can account for a form of base
rate neglect does not suffer catastrophic forgetting and
can exhibit 3 stage u shaped learning of high frequency
exceptions to rules whereas such effects are not easily
accounted for by models using other combinations of
representation and learning method .
Contextually unique words:
ALCOVE, SCHAFFER, MEDIN,
NOSOFSKY
Topic 1 (p=0.21): learning
phenomena acquisition learn
acquired ...
Topic 22 (p=0.17): similarity
objects object space category
dimensional categories spatial
Topic 61 (p=0.08): representations
representation order alternative 1st
higher 2nd descriptions problem
form
Retrieval Experiments
• Corpora:
– NIPS conference papers
• For each candidate document, calculate how likely the
query was “generated” from the model’s encoding
P Query | Doc  
 P  x  1 | Doc  P  q | verbatim  P  x  0 | Doc  P  q | topics 
qQuery
Retrieval Results with Short Queries
• NIPS corpus
• Queries consist
of 2-4 words
Dual route
Standard topics
LSA
precision 
hits
retrieved
hits
recall 
relevant
Overview
I
Probabilistic Topic Models
II
Topic models for association
III Topic models for recall
IV Topic models for information retrieval
V Conclusion
Conclusions (1)
1) Memory as a reconstructive process
– Encoding
– Retrieval
2) Information retrieval as a reconstructive process
– What candidate document best “reconstructs” the
query?
Conclusions (2)
• Modeling multiple levels of organization in human
memory
• Information retrieval
– Matching of query to documents should operate at
multiple levels
Software
Public-domain MATLAB toolbox for topic modeling on the Web:
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
Retrieval Results with Long Queries
Dual route
Standard topics
LSA
• Cranfield corpus
• Queries consist of
sentences
precision 
hits
retrieved
hits
recall 
relevant
Precision/ Recall
All docs
Retrieved
hits
hits
precision 

retrieved
hits  fa
recall 
Relevant
hits
hits

 " hit rate "
relevant hits  misses
Gibbs Sampler Stability
Comparing topics from two runs
BEST KL = .46
16
10
14
12
30
10
40
50
8
.042
.028
.025
.025
.022
.021
.018
.017
.015
.015
.015
.015
.015
.014
.013
Run 2
ACCOUNT
ACCOUNTS
BALANCE
CASH
AMOUNT
COLUMN
BUSINESS
CHECK
RECORD
TOTAL
ACCOUNTING
JOURNAL
PERIOD
CREDIT
RECORDED
.041
.026
.025
.023
.022
.021
.017
.016
.016
.016
.016
.015
.015
.014
.014
60
KL distance
Re-ordered Topics Run 2
20
Run 1
ACCOUNT
ACCOUNTS
CASH
BALANCE
AMOUNT
COLUMN
BUSINESS
TOTAL
JOURNAL
CHECK
RECORDED
CREDIT
RECORD
GENERAL
PERIOD
70
80
90
100
20
40
60
Topics from Run 1
80
100
6
4
2
WORST KL = 9.40
Run 1
MONEY
GOLD
POOR
FOUND
RICH
SILVER
HARD
DOLLARS
GIVE
WORTH
BUY
WORKED
LOST
SOON
PAY
.094
.044
.034
.023
.021
.020
.019
.018
.016
.016
.015
.014
.013
.013
.013
Run 2
MONEY
PAY
BANK
INCOME
INTEREST
TAX
PAID
TAXES
BANKS
INSURANCE
AMOUNT
CREDIT
DOLLARS
COST
FUNDS
.086
.033
.027
.027
.022
.021
.016
.016
.015
.015
.011
.010
.010
.008
.008
Extensions of Topic Models
• Combining topics and syntax
• Topic hierarchies
• Topic segmentation
– no need for document boundaries
• Modeling authors as well as documents
– who wrote this part of the paper?
Words can have high probability in
multiple topics
(Based on TASA corpus)
PRINTING
PAPER
PRINT
PRINTED
TYPE
PROCESS
INK
PRESS
IMAGE
PRINTER
PRINTS
PRINTERS
COPY
COPIES
FORM
OFFSET
GRAPHIC
SURFACE
PRODUCED
CHARACTERS
PLAY
PLAYS
STAGE
AUDIENCE
THEATER
ACTORS
DRAMA
SHAKESPEARE
ACTOR
THEATRE
PLAYWRIGHT
PERFORMANCE
DRAMATIC
COSTUMES
COMEDY
TRAGEDY
CHARACTERS
SCENES
OPERA
PERFORMED
TEAM
GAME
BASKETBALL
PLAYERS
PLAYER
PLAY
PLAYING
SOCCER
PLAYED
BALL
TEAMS
BASKET
FOOTBALL
SCORE
COURT
GAMES
TRY
COACH
GYM
SHOT
JUDGE
TRIAL
COURT
CASE
JURY
ACCUSED
GUILTY
DEFENDANT
JUSTICE
EVIDENCE
WITNESSES
CRIME
LAWYER
WITNESS
ATTORNEY
HEARING
INNOCENT
DEFENSE
CHARGE
CRIMINAL
HYPOTHESIS
EXPERIMENT
SCIENTIFIC
OBSERVATIONS
SCIENTISTS
EXPERIMENTS
SCIENTIST
EXPERIMENTAL
TEST
METHOD
HYPOTHESES
TESTED
EVIDENCE
BASED
OBSERVATION
SCIENCE
FACTS
DATA
RESULTS
EXPLANATION
STUDY
TEST
STUDYING
HOMEWORK
NEED
CLASS
MATH
TRY
TEACHER
WRITE
PLAN
ARITHMETIC
ASSIGNMENT
PLACE
STUDIED
CAREFULLY
DECIDE
IMPORTANT
NOTEBOOK
REVIEW
Problems with Spatial
Representations
Violation of triangle inequality
A
B
AC  AB + BC
C
Can find similarity judgments that violate this:
Violation of triangle inequality
A
AC  AB + BC
B
C
Can find associations that violate this:
SOCCER
FIELD
MAGNETIC
No Triangle Inequality with Topics
TOPIC 1
TOPIC 2
SOCCER
FIELD
MAGNETIC
Topic structure easily explains violations of triangle inequality
Small-World Structure of Associations
(Steyvers & Tenenbaum, 2005)
• Properties:
BAT
BALL
BASEBALL
GAME
PLAY
STAGE
1) Short path lengths
2) Clustering
3) Power law degree
distributions
THEATER
• Small world graphs arise
elsewhere: internet, social
relations, biology
Power law degree distributions
in Semantic Networks
UNDIRECTED
ASSOCIATIVE NETWORK
100
ROGET'S
THESAURUS
WORDNET
100
100
P( k )
10-1
10-1
10-1
10-2
10-2
10-3
10-3
10-4
10-2
10-4
10-5
10-5
10-6
101
k
102
10-3
100
k
101
10-6
100
101
k
Power law degree distribution  some words are “hubs” in a semantic network
102
Creating Association Networks
TOPICS MODEL:
Connect i to j when P( w=j | w=i ) > threshold
For each word, generate K associates
by picking K nearest neighbors in
semantic space
100
100
g=-2.05
10-1
P( k )
10-1
P( k )
•
LSA:
10-2
10-2
10-3
10-3
10-4
10-4
101
102
k = #incoming links
d=50
d=200
d=400
100
101
102
k = #incoming links
Associations in free recall
STUDY THESE WORDS:
Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket,
Doze, Slumber, Snore, Nap, Peace, Yawn, Drowsy
RECALL WORDS .....
FALSE RECALL: “Sleep” 61%
Recall as a reconstructive process
• Reconstruct study list based on the stored “gist”
• The gist can be represented by a distribution over
topics

P  wansingle
 assumption:
P  wn1 | z  P  z | w 
1 | w  topic
• Under
z
Retrieved word
Study list
Predictions for the “Sleep” list
0
STUDY
LIST
EXTRA
LIST
(top 8)
0.02
0.04
0.06
0.08
BED
REST
TIRED
AWAKE
WAKE
NAP
DREAM
YAWN
DROWSY
BLANKET
SNORE
SLUMBER
PEACE
DOZE
0.1
0.12
0.14
0.16
0.18
Pwn1 | w 
SLEEP
NIGHT
ASLEEP
MORNING
HOURS
SLEEPY
EYES
AWAKENED
0.2
Hidden Markov Topic Model
Hidden
Markov Topics Model
(Griffiths, Steyvers, Blei, & Tenenbaum, 2004)
• Syntactic dependencies  short range dependencies
• Semantic dependencies  long-range
q
z1
z2
z3
z4
w1
w2
w3
w4
s1
s2
s3
s4
Semantic state: generate
words from topic model
Syntactic states: generate
words from HMM
Transition between semantic state and syntactic states
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
z = 2 0.6
SCIENTIFIC
0.2
KNOWLEDGE 0.2
WORK
0.2
RESEARCH
0.2
MATHEMATICS 0.2
0.7
0.3
0.1
0.2
0.9
THE
0.6
A
0.3
MANY 0.1
Combining topics and syntax
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
0.2
KNOWLEDGE 0.2
WORK
0.2
RESEARCH
0.2
MATHEMATICS 0.2
0.7
0.1
0.3
0.2
0.9
THE ………………………………
x=3
THE
0.6
A
0.3
MANY 0.1
Combining topics and syntax
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
0.2
KNOWLEDGE 0.2
WORK
0.2
RESEARCH
0.2
MATHEMATICS 0.2
0.7
0.1
0.3
0.2
0.9
THE LOVE……………………
x=3
THE
0.6
A
0.3
MANY 0.1
Combining topics and syntax
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
0.2
KNOWLEDGE 0.2
WORK
0.2
RESEARCH
0.2
MATHEMATICS 0.2
0.7
0.1
0.3
0.2
0.9
THE LOVE OF………………
x=3
THE
0.6
A
0.3
MANY 0.1
Combining topics and syntax
x=2
x=1
z = 1 0.4
HEART
LOVE
SOUL
TEARS
JOY
0.2
0.2
0.2
0.2
0.2
OF
0.6
FOR
0.3
BETWEEN 0.1
0.8
z = 2 0.6
SCIENTIFIC
0.2
KNOWLEDGE 0.2
WORK
0.2
RESEARCH
0.2
MATHEMATICS 0.2
0.7
0.1
0.3
0.2
0.9
THE LOVE OF RESEARCH ……
x=3
THE
0.6
A
0.3
MANY 0.1
Semantic topics
MAP
FOOD
NORTH
FOODS
EARTH
BODY
SOUTH
NUTRIENTS
POLE
DIET
MAPS
FAT
EQUATOR
SUGAR
WEST
ENERGY
LINES
MILK
EAST
EATING
AUSTRALIA
FRUITS
GLOBE
VEGETABLES
POLES
WEIGHT
HEMISPHERE
FATS
LATITUDE
NEEDS
CARBOHYDRATES PLACES
LAND
VITAMINS
WORLD
CALORIES
COMPASS
PROTEIN
CONTINENTS
MINERALS
GOLD
CELLS
BEHAVIOR
DOCTOR
BOOK
IRON
CELL
SELF
PATIENT
BOOKS
SILVER
ORGANISMS
INDIVIDUAL
HEALTH
READING
ALGAE
PERSONALITY
HOSPITAL
INFORMATION COPPER
METAL
BACTERIA
RESPONSE
MEDICAL
LIBRARY
METALS
MICROSCOPE
SOCIAL
CARE
REPORT
STEEL
MEMBRANE
EMOTIONAL
PATIENTS
PAGE
CLAY
ORGANISM
LEARNING
NURSE
TITLE
LEAD
FOOD
FEELINGS
DOCTORS
SUBJECT
ADAM
LIVING
PSYCHOLOGISTS
MEDICINE
PAGES
ORE
FUNGI
INDIVIDUALS
NURSING
GUIDE
ALUMINUM PSYCHOLOGICAL
MOLD
TREATMENT
WORDS
MINERAL
EXPERIENCES MATERIALS
NURSES
MATERIAL
MINE
NUCLEUS
ENVIRONMENT
PHYSICIAN
ARTICLE
STONE
CELLED
HUMAN
HOSPITALS
ARTICLES
MINERALS
STRUCTURES
RESPONSES
DR
WORD
POT
MATERIAL
BEHAVIORS
SICK
FACTS
MINING
STRUCTURE
ATTITUDES
ASSISTANT
AUTHOR
MINERS
GREEN
PSYCHOLOGY
EMERGENCY
REFERENCE
TIN
MOLDS
PERSON
PRACTICE
NOTE
PLANTS
PLANT
LEAVES
SEEDS
SOIL
ROOTS
FLOWERS
WATER
FOOD
GREEN
SEED
STEMS
FLOWER
STEM
LEAF
ANIMALS
ROOT
POLLEN
GROWING
GROW
Syntactic classes
SAID
ASKED
THOUGHT
TOLD
SAYS
MEANS
CALLED
CRIED
SHOWS
ANSWERED
TELLS
REPLIED
SHOUTED
EXPLAINED
LAUGHED
MEANT
WROTE
SHOWED
BELIEVED
WHISPERED
THE
HIS
THEIR
YOUR
HER
ITS
MY
OUR
THIS
THESE
A
AN
THAT
NEW
THOSE
EACH
MR
ANY
MRS
ALL
MORE
SUCH
LESS
MUCH
KNOWN
JUST
BETTER
RATHER
GREATER
HIGHER
LARGER
LONGER
FASTER
EXACTLY
SMALLER
SOMETHING
BIGGER
FEWER
LOWER
ALMOST
ON
AT
INTO
FROM
WITH
THROUGH
OVER
AROUND
AGAINST
ACROSS
UPON
TOWARD
UNDER
ALONG
NEAR
BEHIND
OFF
ABOVE
DOWN
BEFORE
GOOD
SMALL
NEW
IMPORTANT
GREAT
LITTLE
LARGE
*
BIG
LONG
HIGH
DIFFERENT
SPECIAL
OLD
STRONG
YOUNG
COMMON
WHITE
SINGLE
CERTAIN
ONE
SOME
MANY
TWO
EACH
ALL
MOST
ANY
THREE
THIS
EVERY
SEVERAL
FOUR
FIVE
BOTH
TEN
SIX
MUCH
TWENTY
EIGHT
HE
YOU
THEY
I
SHE
WE
IT
PEOPLE
EVERYONE
OTHERS
SCIENTISTS
SOMEONE
WHO
NOBODY
ONE
SOMETHING
ANYONE
EVERYBODY
SOME
THEN
BE
MAKE
GET
HAVE
GO
TAKE
DO
FIND
USE
SEE
HELP
KEEP
GIVE
LOOK
COME
WORK
MOVE
LIVE
EAT
BECOME
NIPS Semantics
IMAGE
DATA
IMAGES
GAUSSIAN
OBJECT
MIXTURE
OBJECTS
LIKELIHOOD
FEATURE
POSTERIOR
RECOGNITION
PRIOR
VIEWS
DISTRIBUTION
#
EM
PIXEL
BAYESIAN
VISUAL
PARAMETERS
STATE
POLICY
VALUE
FUNCTION
ACTION
REINFORCEMENT
LEARNING
CLASSES
OPTIMAL
*
MEMBRANE
SYNAPTIC
CELL
*
CURRENT
DENDRITIC
POTENTIAL
NEURON
CONDUCTANCE
CHANNELS
EXPERTS
EXPERT
GATING
HME
ARCHITECTURE
MIXTURE
LEARNING
MIXTURES
FUNCTION
GATE
KERNEL
SUPPORT
VECTOR
SVM
KERNELS
#
SPACE
FUNCTION
MACHINES
SET
NETWORK
NEURAL
NETWORKS
OUPUT
INPUT
TRAINING
INPUTS
WEIGHTS
#
OUTPUTS
NIPS Syntax
IN
WITH
FOR
ON
FROM
AT
USING
INTO
OVER
WITHIN
IS
WAS
HAS
BECOMES
DENOTES
BEING
REMAINS
REPRESENTS
EXISTS
SEEMS
SEE
SHOW
NOTE
CONSIDER
ASSUME
PRESENT
NEED
PROPOSE
DESCRIBE
SUGGEST
USED
TRAINED
OBTAINED
DESCRIBED
GIVEN
FOUND
PRESENTED
DEFINED
GENERATED
SHOWN
MODEL
ALGORITHM
SYSTEM
CASE
PROBLEM
NETWORK
METHOD
APPROACH
PAPER
PROCESS
HOWEVER
ALSO
THEN
THUS
THEREFORE
FIRST
HERE
NOW
HENCE
FINALLY
#
*
I
X
T
N
C
F
P
Random sentence generation
LANGUAGE:
[S] RESEARCHERS GIVE THE SPEECH
[S] THE SOUND FEEL NO LISTENERS
[S] WHICH WAS TO BE MEANING
[S] HER VOCABULARIES STOPPED WORDS
[S] HE EXPRESSLY WANTED THAT BETTER VOWEL
Nested Chinese Restaurant Process
Topic Hierarchies
•
•
In regular topic model, no relations
between topics
topic 1
topic 2
Nested Chinese Restaurant Process
topic 3
– Blei, Griffiths, Jordan, Tenenbaum
(2004)
– Learn hierarchical structure, as
well as topics within structure
topic 4
topic 5
topic 6
topic 7
Example: Psych Review Abstracts
THE
OF
AND
TO
IN
A
IS
A
MODEL
MEMORY
FOR
MODELS
TASK
INFORMATION
RESULTS
ACCOUNT
RESPONSE
SPEECH
STIMULUS
READING
REINFORCEMENT
WORDS
RECOGNITION MOVEMENT
STIMULI
MOTOR
RECALL
VISUAL
CHOICE
WORD
CONDITIONING SEMANTIC
ACTION
SOCIAL
SELF
EXPERIENCE
EMOTION
GOALS
EMOTIONAL
THINKING
SELF
SOCIAL
PSYCHOLOGY
RESEARCH
RISK
STRATEGIES
INTERPERSONAL
PERSONALITY
SAMPLING
GROUP
IQ
INTELLIGENCE
SOCIAL
RATIONAL
INDIVIDUAL
GROUPS
MEMBERS
SEX
EMOTIONS
GENDER
EMOTION
STRESS
WOMEN
HEALTH
HANDEDNESS
MOTION
VISUAL
SURFACE
BINOCULAR
RIVALRY
CONTOUR
DIRECTION
CONTOURS
SURFACES
DRUG
FOOD
BRAIN
AROUSAL
ACTIVATION
AFFECTIVE
HUNGER
EXTINCTION
PAIN
REASONING
IMAGE
CONDITIONIN
ATTITUDE
COLOR
STRESS
CONSISTENCY
MONOCULAR
EMOTIONAL
SITUATIONAL
LIGHTNESS
BEHAVIORAL
INFERENCE
GIBSON
FEAR
JUDGMENT
SUBMOVEMENT STIMULATION
PROBABILITIES ORIENTATION
TOLERANCE
STATISTICAL HOLOGRAPHIC
RESPONSES
Generative Process
THE
OF
AND
TO
IN
A
IS
A
MODEL
MEMORY
FOR
MODELS
TASK
INFORMATION
RESULTS
ACCOUNT
RESPONSE
SPEECH
STIMULUS
READING
REINFORCEMENT
WORDS
RECOGNITION MOVEMENT
STIMULI
MOTOR
RECALL
VISUAL
CHOICE
WORD
CONDITIONING SEMANTIC
ACTION
SOCIAL
SELF
EXPERIENCE
EMOTION
GOALS
EMOTIONAL
THINKING
SELF
SOCIAL
PSYCHOLOGY
RESEARCH
RISK
STRATEGIES
INTERPERSONAL
PERSONALITY
SAMPLING
GROUP
IQ
INTELLIGENCE
SOCIAL
RATIONAL
INDIVIDUAL
GROUPS
MEMBERS
SEX
EMOTIONS
GENDER
EMOTION
STRESS
WOMEN
HEALTH
HANDEDNESS
MOTION
VISUAL
SURFACE
BINOCULAR
RIVALRY
CONTOUR
DIRECTION
CONTOURS
SURFACES
DRUG
FOOD
BRAIN
AROUSAL
ACTIVATION
AFFECTIVE
HUNGER
EXTINCTION
PAIN
REASONING
IMAGE
CONDITIONIN
ATTITUDE
COLOR
STRESS
CONSISTENCY
MONOCULAR
EMOTIONAL
SITUATIONAL
LIGHTNESS
BEHAVIORAL
INFERENCE
GIBSON
FEAR
JUDGMENT
SUBMOVEMENT STIMULATION
PROBABILITIES ORIENTATION
TOLERANCE
STATISTICAL HOLOGRAPHIC
RESPONSES
Other Slides
Markov chain Monte Carlo
• Sample from a Markov chain constructed to converge to the
target distribution
• Allows sampling from unnormalized posterior
• Can compute approximate statistics from intractable
distributions
• Gibbs sampling one such method, construct Markov chain with
conditional distributions
Enron email: two example topics (T=100)
TOPIC 10
TOPIC 32
WORD
PROB.
WORD
PROB.
BUSH
0.0227
ANDERSEN
0.0241
LAY
0.0193
FIRM
0.0134
MR
0.0183
ACCOUNTING
0.0119
WHITE
0.0153
SEC
0.0065
ENRON
0.0150
SETTLEMENT
0.0062
HOUSE
0.0148
AUDIT
0.0054
PRESIDENT
0.0131
CORPORATE
0.0053
ADMINISTRATION
0.0115
FINANCIAL
0.0052
COMPANY
0.0090
JUSTICE
0.0052
ENERGY
0.0085
INFORMATION
0.0050
SENDER
PROB.
SENDER
PROB.
NELSON, KIMBERLY (ETS)
0.3608
HILTABRAND, LESLIE
0.1359
PALMER, SARAH
0.0997
WELLS, TORI L.
0.0865
DENNE, KAREN
0.0541
DUPREE, DIANNA
0.0825
HOTTE, STEVE
0.0340
ARMSTRONG, JULIE
0.0316
DUPREE, DIANNA
0.0282
DENNE, KAREN
0.0208
ARMSTRONG, JULIE
0.0222
SULLIVAN, LORA
0.0072
LOKEY, TEB
0.0194
[email protected]
0.0026
SULLIVAN, LORA
0.0073
WILSON, DANNY
0.0016
VILLARREAL, LILLIAN
0.0040
HU, SYLVIA
0.0013
BAGOT, NANCY
0.0026
MATHEWS, LEENA
0.0012
Enron email: two work unrelated topics
TOPIC 38
TOPIC 25
WORD
PROB.
WORD
PROB.
TRAVEL
0.0161
NEWS
0.0245
ROUNDTRIP
0.0124
MAIL
0.0182
SAVE
0.0118
NYTIMES
0.0149
DEALS
0.0097
YORK
0.0128
HOTEL
0.0095
PAGE
0.0095
BOOK
0.0094
TIMES
0.0090
SALE
0.0089
HEADLINES
0.0079
FARES
0.0083
BUSH
0.0077
TRIP
0.0072
DELIVERY
0.0070
CITIES
0.0070
HTML
0.0068
SENDER
PROB.
SENDER
PROB.
TRAVELOCITY MEMBER SERVICES
0.0763
THE NEW YORK TIMES DIRECT
0.3438
BESTFARES.COM HOT DEALS
0.0502
<[email protected]>
0.0104
<[email protected]>
0.0315
THE ECONOMIST
0.0029
LISTS.COOLVACATIONS.COM
0.0151
@TIMES - INSIDE NYTIMES.COM
0.0015
CHEAP TICKETS
0.0111
[email protected]
0.0011
EXPEDIA FARE TRACKER
0.0106
AMAZON.COM DELIVERS BESTSELLERS
0.0009
TRAVELOCITY.COM
0.0096
NYTIMES.COM
0.0009
[email protected] 0.0088
HYATT, JERRY
0.0008
[email protected]
0.0066
NEWSLETTER_TEXT
0.0008
LASTMINUTE.COM
0.0051
CHRIS LONG
0.0007
Using Topic Models for
Information Retrieval
Clusters v. Topics
One Cluster
Hidden Markov Models in Molecular Biology: New
Algorithms and Applications
Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A.
McClure
Hidden Markov Models (HMMs) can be applied to several important
problems in molecular biology. We introduce a new convergent learning
algorithm for HMMs that, unlike the classical Baum-Welch algorithm is
smooth and can be applied on-line or in batch mode, with or without the
usual Viterbi most likely path approximation. Left-right HMMs with
insertion and deletion states are then trained to represent several protein
families including immunoglobulins and kinases. In all cases, the models
derived capture all the important statistical properties of the families and
can be used efficiently in a number of important tasks such as multiple
alignment, motif detection, and classification.
Multiple Topics
[cluster 88]
[topic 10]
model data models time neural
figure state learning set parameters
network probability number
networks training function system
algorithm hidden markov
state hmm markov sequence models hidden states
probabilities sequences parameters transition probability
training hmms hybrid model likelihood modeling
[topic 37]
genetic structure chain protein population region
algorithms human mouse selection fitness proteins
search evolution generation function sequence sequences
genes
LSA
documents
dims
U
D
documents
dims
=
dims
C
words
words
dims
Topic model
normalized
co-occurrence matrix
=
F
topics
C
topics
words
words
documents
mixture
components
documents
Q
mixture
weights
VT
Documents as Topics Mixtures:
a Geometric Interpretation
1
P(word1)
topic 1
topic 2
= observed
document
0
1
P(word2)
P(word3)
1
P(word1)+P(word2)+P(word3) = 1
Generating Artificial Data: a visual example
word  pixel
document  image
10 “topics”
with 5 x 5
pixels
sample each pixel from
a mixture of topics
A subset of documents
Inferred Topics
0
Iterations
1
5
10
50
100
Interpretable decomposition
Topic model
SVD
• SVD gives a basis for the data, but not an interpretable one
• The true basis is not orthogonal, so rotation does no good
Level 1
Dept of
Health and
Human
Services
(11609)
Level 2
National
Institutes of
Health
(11609)
Level 3
National Cancer Institute
(7574)
National Institute of Mental
Health (4035)
Behavioral and cognitive
sciences (BCS) (1469)
Social,
Behavioral,
and
Economic
Sciences
(SBE)
(4584)
National
Science
Foundation
(10580)
International science and
engineering (INT) (formerly
International cooperative
scientific activities) (1299)
Social and economic
sciences (SES) (1816)
Biological infrastructure
(BIR/DBI) (1061)
Environmental biology (DEB)
(1609)
Biological
Sciences
(BIO)
(5996)
Integrative biology and
neuroscience (IBN/BNS)
(1673)
Molecular and cellular
biosciences (MCB/DCB)
(1534)
Plant genome research (119)
Level 4
Cancer biology, detection and diagnosis
AIDS Research
Cancer Research Centers
Cancer causation
Cancer prevention and control
Cancer treatment
NCI research manpower development
AIDS Research
Extramural research
Intramural research
Archaeology, archeometry, and ...
Behavioral and cognitive sciences - Other
Child learning and development
anthropology
Environmental Cultural
social and
behavioral
science
Geography and regional science
Human cognition and perception
Instrumentation
Linguistics
Physical anthropology
Social psychology
Africa, Near East, and South Asia
Americas
Central and Eastern Europe
East Asia and Pacific
International activities - Other
Japan and Korea
Western Europe
Decision, risk, and management science
Methodology, measures, and statistics
Economics
Ethics and values studies
Innovation and organizational change
Law and social science
Political science
Research on science and technology
Science and technology studies
Social and economic sciences - Other
Sociology
Transformations to quality organizations
Biological infrastructure - Other
Human resources
Instrumentation
Research resources
Ecological studies
Environmental biology - Other
Systematic & population biology
mechanismsIntegrativeDevelopmental
biology and neuroscience
Other
Neuroscience
Physiology and ethology
Biochemical and biomolecular processes
Biomolecular structure & function
Cell biology
Genetics
Molecular and cellular biosciences - Other
Plant genome research project
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Similarity between 55 funding programs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Summary
INPUT:
word-document counts
(word order is irrelevant)
OUTPUT:
topic assignments to each word
likely words in each topic
likely topics in each document (“gist”)
P( z i )
P( w | z )
P( z | d )
Gibbs Sampling
count of topic t
assigned to doc d
count of word w
assigned to topic t
ntdi  
nwti  
p( zi  t | zi ) 

i
i
n

T

n
t ' t ' d
w' w't  W 
probability that word i
is assigned to topic t
Prior Distributions
• Dirichlet priors encourage sparsity on topic mixtures and
topics
Topic 3
Word 3
Topic 1
Topic 2
θ ~ Dirichlet( α )
Word 1
 ~ Dirichlet( β )
(darker colors indicate lower probability)
Word 2
Example Topics extracted from
NIH/NSF grants
brain
fmri
imaging
functional
mri
subjects
magnetic
resonance
neuroimaging
structural
schizophrenia
patients
deficits
schizophrenic
psychosis
subjects
psychotic
dysfunction
abnormalities
clinical
memory
working
memories
tasks
retrieval
encoding
cognitive
processing
recognition
performance
disease
ad
alzheimer
diabetes
cardiovascular
insulin
vascular
blood
clinical
individuals
Important point: these distributions are learned in a completely
automated “unsupervised” fashion from the data
Extracting Topics from Email
Enron Email
250,000 emails
28,000 authors
1999-2002
TEXANS
WIN
FOOTBALL
FANTASY
SPORTSLINE
PLAY
TEAM
GAME
SPORTS
GAMES
FERC
MARKET
ISO
COMMISSION
ORDER
FILING
COMMENTS
PRICE
CALIFORNIA
FILED
GOD
LIFE
MAN
PEOPLE
CHRIST
FAITH
LORD
JESUS
SPIRITUAL
VISIT
POWER
CALIFORNIA
ELECTRICITY
UTILITIES
PRICES
MARKET
PRICE
UTILITY
CUSTOMERS
ELECTRIC
TRAVEL
ROUNDTRIP
SAVE
DEALS
HOTEL
BOOK
SALE
FARES
TRIP
CITIES
STATE
PLAN
CALIFORNIA
DAVIS
RATE
BANKRUPTCY
SOCAL
POWER
BONDS
MOU
Topic trends from New York Times
15
Tour-de-France
10
5
0
Jan00
Jul00
Jan01
Jul01
30
330,000 articles
2000-2002
Jan02
Jul02
Jan03
Quarterly Earnings
20
10
0
Jan00
Jul00
Jan01
Jul01
Jan02
Jul02
50
0
Jan00
Jul00
Jan01
Jul01
Jan02
Jul02
COMPANY
QUARTER
PERCENT
ANALYST
SHARE
SALES
EARNING
Jan03
Anthrax
100
TOUR
RIDER
LANCE_ARMSTRONG
TEAM
BIKE
RACE
FRANCE
Jan03
ANTHRAX
LETTER
MAIL
WORKER
OFFICE
SPORES
POSTAL
BUILDING
Topic trends in NIPS conference
Neural networks on the
decline ...
Proportion of words
0.0018
0.0016
LAYER
NET
NEURAL
LAYERS
NETS
ARCHITECTURE
NUMBER
FEEDFORWARD
SINGLE
0.0014
0.0012
0.0010
0.0008
0.0006
0.0004
0.0002
1986
1988
1990
1992
1994
1996
1998
2000
... while SVM’s
become more popular.
Proportion of words
Year
0.03
KERNEL
SUPPORT
VECTOR
MARGIN
SVM
KERNELS
SPACE
DATA
MACHINES
0.02
0.01
0.00
1986
1988
1990
1992
1994
Year
1996
1998
2000
Finding Funding Overlap
• Analyze 22,000+ grants active during 2002
• 55 funding programs from NSF and NIH
– Focus on behavioral sciences
• Questions of interest:
– What are the areas of overlap between funding
programs?
– What topics are funded?
2D visualization of funding programs – nearby program support similar topics
INT
Japan
and Korea
INT
INT
INT
Africa, Near East,
International
Central
and South Asia
activities
- Other
and
Eastern
Europe
NSF – SBE
INT
DEB
East
Environmental
INT Asia
and -Pacific
biology
Other
Americas
BCS
Archaeology,
archeometry, and ...
BCS
Geography
and regional science
SES
Science
and technology studies
BCS
Environmental social
and behavioral science
SES
Social and economic
sciences - Other
NSF – BIO
INT
Western
Europe
MCB
Molecular and cellular
biosciences - Other
DEB
Ecological
studies
DEB
Systematic
& population biology
MCB
Biomolecular structure
BIR
BIR
BIR
BIR
& function
Human
Research
Biological
Instrumentation
resources
resources infrastructure - Other
IBN
PGR
MCB
Physiology Plant genome research project Cell biology
IBN
and ethology
BCS
Integrative biology
MCB
IBN
Physical
and neuroscience - Other
Genetics
Developmental
anthropology
mechanisms
SES
BCS
BCS
Ethics
SES
Cultural
Research on science and values studies Instrumentation
anthropology
and technology BCS
SES
SES
BCS
SES Linguistics Innovation
Political
Behavioral
Methodology, measures,
SESorganizational change
and
science
and cognitive sciences - Other
and statistics
Sociology SES
Transformations
BCS
SES
to quality organizations Human cognition
Law
and perception
and social science
SES
Decision, risk,
BCS
NIMH
and management science Child learning
Extramural research
and development
BCS
SES
NCI
Social
Economics
Cancer prevention
psychology
and control
NIMH
AIDS Research
IBN
Neuroscience
MCB
Biochemical
and biomolecular processes
NCI
Research
manpower development
NCI
Cancer
Research Centers
NIMH
Intramural research
NCI
Cancer biology,
detection and diagnosis
NCI
Cancer
causation
NCI
Cancer
treatment
NCI
AIDS Research
NIH
Funding Amounts per Topic
• We have $ funding per grant
• We have distribution of topics for each grant
• Solve for the $ amount per topic
 What are expensive topics?
High $$$ topics
Funding %
3.47
2.87
2.26
2.01
1.87
1.73
1.61
1.56
1.51
1.48
1.47
1.43
1.40
Interpretation
research center
cancer control
mental health services
clinical treatment
cancer
gene sequencing
risk factors
children/parents
tumors
training program
immunology
disorders
patient treatment
Low $$$ topics
Funding %
0.60
0.56
0.55
0.55
0.55
0.55
0.55
0.53
0.52
0.51
0.49
0.45
0.44
Interpretation
conference/meetings
theory
public policy
collaborative projects
marine environment
decision making
ecological diversity
sexual behavior
markets
science/technology
computer systems
language
archaelogy
Basic Problems for Semantic Memory
• Prediction
P  w2 | w1 
– what fact, concept, or word is next?
• Gist extraction
P  z | w
– What is this set of words about?
• Disambiguation
– What is the sense of this word?
P  z | w, context 
Topics from an educational corpus
•
•
37K docs 26K word vocabulary
1700 topics e.g.:
PRINTING
PAPER
PRINT
PRINTED
TYPE
PROCESS
INK
PRESS
IMAGE
PRINTER
PRINTS
PRINTERS
COPY
COPIES
FORM
OFFSET
GRAPHIC
SURFACE
PRODUCED
CHARACTERS
PLAY
PLAYS
STAGE
AUDIENCE
THEATER
ACTORS
DRAMA
SHAKESPEARE
ACTOR
THEATRE
PLAYWRIGHT
PERFORMANCE
DRAMATIC
COSTUMES
COMEDY
TRAGEDY
CHARACTERS
SCENES
OPERA
PERFORMED
TEAM
GAME
BASKETBALL
PLAYERS
PLAYER
PLAY
PLAYING
SOCCER
PLAYED
BALL
TEAMS
BASKET
FOOTBALL
SCORE
COURT
GAMES
TRY
COACH
GYM
SHOT
JUDGE
TRIAL
COURT
CASE
JURY
ACCUSED
GUILTY
DEFENDANT
JUSTICE
EVIDENCE
WITNESSES
CRIME
LAWYER
WITNESS
ATTORNEY
HEARING
INNOCENT
DEFENSE
CHARGE
CRIMINAL
HYPOTHESIS
EXPERIMENT
SCIENTIFIC
OBSERVATIONS
SCIENTISTS
EXPERIMENTS
SCIENTIST
EXPERIMENTAL
TEST
METHOD
HYPOTHESES
TESTED
EVIDENCE
BASED
OBSERVATION
SCIENCE
FACTS
DATA
RESULTS
EXPLANATION
STUDY
TEST
STUDYING
HOMEWORK
NEED
CLASS
MATH
TRY
TEACHER
WRITE
PLAN
ARITHMETIC
ASSIGNMENT
PLACE
STUDIED
CAREFULLY
DECIDE
IMPORTANT
NOTEBOOK
REVIEW
Same words in different topics represent
different contextual usages
PRINTING
PAPER
PRINT
PRINTED
TYPE
PROCESS
INK
PRESS
IMAGE
PRINTER
PRINTS
PRINTERS
COPY
COPIES
FORM
OFFSET
GRAPHIC
SURFACE
PRODUCED
CHARACTERS
PLAY
PLAYS
STAGE
AUDIENCE
THEATER
ACTORS
DRAMA
SHAKESPEARE
ACTOR
THEATRE
PLAYWRIGHT
PERFORMANCE
DRAMATIC
COSTUMES
COMEDY
TRAGEDY
CHARACTERS
SCENES
OPERA
PERFORMED
TEAM
GAME
BASKETBALL
PLAYERS
PLAYER
PLAY
PLAYING
SOCCER
PLAYED
BALL
TEAMS
BASKET
FOOTBALL
SCORE
COURT
GAMES
TRY
COACH
GYM
SHOT
JUDGE
TRIAL
COURT
CASE
JURY
ACCUSED
GUILTY
DEFENDANT
JUSTICE
EVIDENCE
WITNESSES
CRIME
LAWYER
WITNESS
ATTORNEY
HEARING
INNOCENT
DEFENSE
CHARGE
CRIMINAL
HYPOTHESIS
EXPERIMENT
SCIENTIFIC
OBSERVATIONS
SCIENTISTS
EXPERIMENTS
SCIENTIST
EXPERIMENTAL
TEST
METHOD
HYPOTHESES
TESTED
EVIDENCE
BASED
OBSERVATION
SCIENCE
FACTS
DATA
RESULTS
EXPLANATION
STUDY
TEST
STUDYING
HOMEWORK
NEED
CLASS
MATH
TRY
TEACHER
WRITE
PLAN
ARITHMETIC
ASSIGNMENT
PLACE
STUDIED
CAREFULLY
DECIDE
IMPORTANT
NOTEBOOK
REVIEW
Three documents with the word “play”
(numbers & colors  topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093
audience082 or before motion270 picture004 or television004 cameras004 ( for
later054 viewing004 by large202 audiences082). A Play082 is written082
because playwrights082 have something ...
He was listening077 to music077 coming009 from a passing043 riverboat. The
music077 had already captured006 his heart157 as well as his ear119. It was
jazz077. Bix beiderbecke had already had music077 lessons077. He
wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The
game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180
and jim296 read254 the game166 book254. The boys020 see a game166 for
two. The two boys020 play166 the game166....
Example of generating words
MONEY
LOAN
BANK
RIVER
STREAM
1.0
1
1
1
1
1
1
1
MONEY BANK BANK LOAN BANK MONEY BANK
1
1 LOAN1 LOAN1 BANK1 MONEY1 ....
MONEY BANK
.6
2
1
2
2
2
1
RIVER MONEY BANK STREAM BANK BANK
.4
RIVER
STREAM
BANK
MONEY
LOAN
1.0
Topics
Mixtures
1
2
1
2 LOAN1 MONEY1 ....
MONEY RIVER MONEY BANK
2
2
2
2
2
2....
RIVER BANK STREAM BANK RIVER BANK
Documents and topic
assignments
Statistical Inference
?
?
?
?
?
MONEY BANK BANK LOAN BANK MONEY BANK
?
?
?
?
?
MONEY BANK LOAN LOAN BANK MONEY
?
?
?
?
?
?
?
? ....
?
RIVER MONEY BANK STREAM BANK BANK
?
?
?
?
?
MONEY RIVER MONEY BANK LOAN MONEY
?
?
?
?
?
RIVER BANK STREAM BANK RIVER BANK
Topics
Mixtures
Documents and topic
assignments
?
?
? ....
?....
Two approaches to semantic representation
Semantic networks
Semantic Spaces
BAT
BALL
LOAN
CASH
GAME
FUN
MONEY
PLAY
THEATER
BANK
STAGE
RIVER
STREAM
How are these learned?
Can be learned (LSA), but is this
representation flexible enough?
Latent Semantic Analysis
(Landauer & Dumais, 1997)
high dimensional space
word-document
counts
SVD
BALL
GAME
PLAY
THEATER
STAGE
• Each word is a single point in semantic space -- nearby words
co-occur often with other words
• Words with multiple meanings only have one representation
and have to be near all words related to different senses  this
leads to distortions in similarity