Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Ben Taskar Pieter Abbeel Lise Getoor Eran Segal Nir Friedman Avi Pfeffer Ming-Fai Wong.

Transcript Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Ben Taskar Pieter Abbeel Lise Getoor Eran Segal Nir Friedman Avi Pfeffer Ming-Fai Wong.

Probabilistic Models
of Relational Data
Daphne Koller
Stanford University
Joint work with:
Ben Taskar
Pieter Abbeel
Lise Getoor
Eran Segal
Nir Friedman
Avi Pfeffer
Ming-Fai Wong
Why Relational?


The real world is composed of objects that have
properties and are related to each other
Natural language is all about objects and how
they relate to each other

“George got an A in Geography 101”
Attribute-Based Worlds
Smart_Jane & easy_CS101  GetA_Jane_CS101
Smart_Mike & easy_Geo101 
GetA_Mike_Geo101
Smart_Jane
& easy_Geo101
Smart students
get A’sinGetA_Jane_Geo101
easy classes
Smart_Rick & easy_CS221  GetA_Rick_C

World = assignment of values to attributes
/ truth values to propositional symbols
Object-Relational Worlds
x,y(Smart(x) & Easy(y) & Take(x,y)
 Grade(A,x,y))

World = relational interpretation:



Objects in the domain
Properties of these objects
Relations (links) between objects
Why Probabilities?

All universals are false (almost)


Smart students get A’s in easy classes
True universals are rarely useful

Smart students get either A, B, C, D, or F
C student
The actual science of logic is conversant at present only
with things either certain, impossible, or entirely doubtful …
Therefore the true logic for this world is the calculus
of probabilities …
James Clerk Maxwell
Probable Worlds

Probabilistic semantics:


A set of possible worlds
Each world associated with a probability
hard
smart
A
hard
weak
A
easy
smart
A
easy
weak
A
hard
smart
B
hard
weak
B
easy
smart
B
easy
weak
B
hard
smart
C
hard
weak
C
easy
smart
C
easy
weak
C
course difficulty
student intell.
grade
Epistemic state
Probabilistic
Categorical
Representation: Design Axes
Bayesian nets
Markov nets
n-gram models
HMMs
Prob. CFGs
Propositional logic
CSPs
Automata
Grammars
Attributes
Sequences
World state
First-order logic
Relational databases
Objects
Outline

Bayesian Networks







Representation & Semantics
Reasoning
Probabilistic Relational Models
Collective Classification
Undirected discriminative models
Collective Classification Revisited
PRMs for NLP
Bayesian Networks
CPD P(G|D,I)
A
B
C
Difficulty
Intelligence
easy,low
easy,high
hard,low
Grade
hard,high
0%
20% 40%
60% 80% 100%
nodes = variables
edges = direct influence
Letter
SAT
Graph structure encodes independence assumptions:
Letter conditionally independent of Intelligence given Grade
BN semantics
D
I
conditional
local
full joint
independencies + probability = distribution
G
models
over domain
in
BN
structure
L
S
P(d, i, g, l, s)  P(d) P(i) P(g | d, i)
P(l | g) P(s | i)

Compact & natural representation:


nodes have  k parents  2kn vs. 2n params
parameters natural and easy to elicit
Reasoning using BNs
Probability theory is nothing but
Difficulty
Intelligence
common sense reduced
to calculation.
Pierre Simon Laplace
Grade
Letter
SAT
Full joint distribution specifies answer to any query:
P(variable | evidence about others)
BN Inference


BN Inference is NP-hard
Structure can use graph structure:



Graph separation  conditional independence
B
A
Do separate inference in parts
Results combined over interface.
C
E

D
F
Complexity: exponential in largest separator


Structured BNs allow effective inference
Exact inference in dense BNs is intractable
Approximate BN Inference


Belief propagation is an iterative message passing
algorithm for approximate inference in BNs
Each iteration (until “convergence”):


Cons:



Nodes pass “beliefs” as messages to neighboring nodes
Limited theoretical guarantees
Might not converge
Pros:


Linear time per iteration
Works very well in practice, even for dense networks
Outline


Bayesian Networks
Probabilistic Relational Models






Language & Semantics
Web of Influence
Collective Classification
Undirected discriminative models
Collective Classification Revisited
PRMs for NLP
Bayesian Networks: Problem


Bayesian nets use propositional representation
Real world has objects, related to each other
Intelligence
Diffic_CS101
Intell_Jane
Difficulty
Diffic_CS101
Intell_George
These
“instances”
are not
independent
Grade_Jane_CS101
A
Grade
Intell_George
Diffic_Geo101
Grade_George_Geo101
Grade_George_CS101
C
Probabilistic Relational Models

Combine advantages of relational logic & BNs:




Natural domain modeling: objects, properties,
relations
Generalization over a variety of situations
Compact, natural probability models
Integrate uncertainty with relational model:


Properties of domain entities can depend on
properties of related entities
Uncertainty over relational structure of domain
St. Nordaf University
Prof. Smith
Teaching-ability
Teaches
Teaches
Prof. Jones
Teaching-ability
In-courseGrade
Registered
Satisfac
Intelligence
Welcome to
George
Geo101
Grade
Welcome to
Difficulty
Registered
In-courseSatisfac
CS101
Intelligence
Grade
Difficulty
In-courseSatisfac
Registered
Jane
Relational Schema

Specifies types of objects in domain, attributes of each
type of object & types of relations between objects
Professor
Classes
Student
Intelligence
Teaching-Ability
Teach
Take
Attributes
Relations
Course
Difficulty
In
Registration
Grade
Satisfaction
Probabilistic Relational Models


Universals: Probabilistic patterns hold for all objects in class
Locality: Represent direct probabilistic dependencies
 Links define potential interactions
Professor
Teaching-Ability
Student
Intelligence
Course
Difficulty
A
B
C
easy,low
Reg
Grade
Satisfaction
[K. & Pfeffer; Poole; Ngo & Haddawy]
easy,high
hard,low
hard,high
0%
20%
40%
60%
80%
100%
PRM Semantics
Prof. Jones
Teaching-ability
Prof. Smith
Teaching-ability
Instantiated PRM BN
 variables: attributes of all objects
 dependencies: determined by
links & PRM
Grade
Welcome to
Intelligence
Satisfac
George
Geo101
Grade
Welcome to
Difficulty
Satisfac
CS101
Grade
Difficulty
Satisfac
Intelligence
Jane
The Web of Influence
Welcome to
CS101
C
0%
0%
50%
50%
Welcome to
Geo101
easy / hard
A
low
high
low / high
100%
100%
Outline



Bayesian Networks
Probabilistic Relational Models
Collective Classification & Clustering





Learning models from data
Collective classification of webpages
Undirected discriminative models
Collective Classification Revisited
PRMs for NLP
Learning PRMs
Reg
D
Relational
Database
Learner
Course
Student
Expert knowledge
[Friedman, Getoor, K., Pfeffer]
Learning PRMs

Parameter estimation:

Probabilistic model with shared parameters


Grades for all students share same model
Can use standard techniques for max-likelihood or
Bayesian parameter estimation
Pˆ ( Reg.Grade  A | Student.Intell  hi, Course.Diff  lo)
# ( Reg.Grade  A, Student.Intell  hi, Course.Diff  lo)

# ( Reg.Grade  *, Student.Intell  hi, Course.Diff  lo)

Structure learning:


Define scoring function over structures
Use combinatorial search to find high-scoring structure
Web  KB
Tom Mitchell
Professor
Project-of
WebKB
Project
Member
Advisor-of
Sean Slattery
Student
[Craven et al.]
Web Classification Experiments

WebKB dataset





Four CS department websites
Bag of words on each page
Links between pages
Anchor text for links
Experimental setup



Trained on three universities
Tested on fourth
Repeated for all four combinations
Standard Classification
Page
Category
Word1
Professor
department
extract
information
computer
science
machine
learning
…
Categories:
faculty
course
project
student
other
Naïve
Bayes
... WordN
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
words only
Exploiting Links
Page
Category
Word1
workin
g
with
Tom
Mitchell
…
... WordN ... LinkWordN
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
words only
link words
Collective Classification
To-Page
From-Page
Category
Category
Word1
... WordN
Exists
Word1
Link
Classify all pages collectively,
... WordN
0.35
maximizing the joint
label probability
0.3
0.25
0.2
0.15
0.1
0.05
Approx. inference: belief propagation
[Getoor, Segal, Taskar, Koller]
0
words only
link words
collective
Learning w. Missing Data: EM
[Dempster et al. 77]
Students
Courses
A
B
C
easy,low
easy,high
hard,low
hard,high
0%
20%
40%
60%
80% 100%
P(Registration.Grade |
Course.Difficulty,
Student.Intelligence)
easy / hard
low / high
Discovering Hidden Types
Internet Movie Database
http://www.imdb.com
Discovering Hidden Types
Actor
Director
Type
Type
Movie
Genres
Type
Year MPAA Rating
Rating
#Votes
[Taskar, Segal, Koller]
Discovering Hidden Types
Movies
Actors
Directors
Wizard of Oz
Cinderella
Sound of Music
The Love Bug
Pollyanna
The Parent Trap
Mary Poppins
Swiss Family Robinson
Sylvester Stallone
Bruce Willis
Harrison Ford
Steven Seagal
Kurt Russell
Kevin Costner
Jean-Claude Van Damme
Arnold Schwarzenegger
Alfred Hitchcock
Stanley Kubrick
David Lean
Milos Forman
Terry Gilliam
Francis Coppola
Terminator 2
Batman
Batman Forever
GoldenEye
Starship Troopers
Mission: Impossible
Hunt for Red October
Anthony Hopkins
Robert De Niro
Tommy Lee Jones
Harvey Keitel
Morgan Freeman
Gary Oldman
Steven Spielberg
Tim Burton
Tony Scott
James Cameron
John McTiernan
Joel Schumacher
…
…
Outline




Bayesian Networks
Probabilistic Relational Models
Collective Classification & Clustering
Undirected Discriminative Models




Markov Networks
Relational Markov Networks
Collective Classification Revisited
PRMs for NLP
Directed Models:
Solution:
Undirected
Limitations
Models

Acyclicity constraint limits
expressive power:





Network size O(N2)
Inference is quadratic
Generative training:

Allow arbitrary patterns over
sets of objects & links
Two objects linked to by a
student probably not both
professors
Acyclicity forces modeling
of all potential links:


Train to fit all of data, not to
maximize accuracy
Influence flows over existing
links, exploiting link graph
sparsity


Network size O(N)
Allow discriminative training:

Max P (labels | observations)
[Lafferty, McCallum, Pereira]
Markov Networks
Compatibility
ABC
FFF
FFT
FTF
FTT
TFF
TFT
TTF
TTT
Alice
Eve
Chris
Dave
(A,B,C)
Betty
0
0.5
1
1.5
2
1
P(A, B, C,D,E)   (A, B, C) (C,D)  (D,E)  (E, A)
Z
Graph structure encodes independence assumptions:
Chris conditionally independent of Eve given Alice & Dave
Relational Markov Networks


Universals: Probabilistic patterns hold for all groups of
objects
Locality: Represent local probabilistic dependencies
 Sets of links give us possible interactions
Student
Intelligence
Reg
Grade
Course
Difficulty
Student2
Intelligence
Template potential
Reg2
Grade
CC
CB
CA
BC
Study
BB
BA
AC
AB
AA
Group
0
0.5
1
1.5
[Taskar, Abbeel, Koller ‘02]
2
RMN Semantics
Instantiated RMN  MN
 variables: attributes of all objects
 dependencies: determined by links & RMN
Welcome to
Geo101
Grade
Geo Study Group
Difficulty
Intelligence
George
Grade
Intelligence
Welcome to
CS101
Grade
Difficulty
CS Study Group
Jane
Intelligence
Grade
Jill
Outline





Bayesian Networks
Probabilistic Relational Models
Collective Classification & Clustering
Undirected Discriminative Models
Collective Classification Revisited




Discriminative training of RMNs
Webpage classification
Link prediction
PRMs for NLP
Learning RMNs

Parameter estimation is not closed form

Convex problem  unique global maximum
Maximize
L = log P(Grades,Intelligence|Difficulty)
easy / hard
ABC
low / high
Grade
Difficulty
Intelligence
Grade
Intelligence
Grade
Intelligence
Difficulty
Grade
(Reg1.Grade,Reg2.Grade)
CC
CB
CA
BC
BB
BA
AC
AB
AA
0
0.5
1
1.5
2
L
 # (Grade  A, Grade  A)
AA
  P(Grade  A, Grade  A | Diffic)
Flat Models
Page
Category
Logistic
Regression
Word1
...
WordN
...
LinkWordN
0.3
0.25
0.2
0.15
P(Category|Words)
0.1
0.05
0
Naïve Bayes
Logistic
SVM
Exploiting Links
To-Page
From-Page
Category
Category
42.1% Word
relative
reduction
... Word
Word1 ... WordN
Linkin error
N
1
relative to generative approach
0.25
0.2
0.15
0.1
0.05
0
PRM
Logistic
RMN-link
More Complex Structure
Faculty
W1
Students
C
Wn
S
S
Courses
Collective Classification: Results
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
35.4% relative reduction in error
relative to strong flat approach
Logistic
Links
Section
Link+Section
Scalability

WebKB data set size




Classification
Directed
models
3 sec
180 sec
Undirected
models
20 minutes
15-20 sec
1300 entities
180K attributes
5800 links
Network size / school:

Directed model



200,000 variables
360,000 edges
Undirected model



Training
40,000 variables
44,000 edges
Difference in training time decreases substantially when


some training data is unobserved
want to model with hidden variables
Predicting Relationships
Tom Mitchell
Professor
Member
WebKB
Project
Member
Advisor-of
Sean Slattery
Student

Even more interesting are the relationships between objects

e.g., verbs are almost always relationships
Flat Model
To-Page
From-Page
Word1
...
Rel
NONE
advisor
instructor
TA
member
project-of
Word1
WordN
LinkWord1
Type
...
LinkWordN
...
WordN
Flat Model
...
...
...
...
...
...
Collective Classification: Links
To-Page
From-Page
Category
Category
...
...
Word1
Word1
WordN
Rel
LinkWord1
Type
...
LinkWordN
WordN
Link Model
...
...
...
...
...
...
Triad Model
Professor
Advisor
Student
Member
Member
Group
Triad Model
Professor
Advisor
Student
TA
Instructor
Course
Triad Model
WebKB++

Four new department web sites:


Labeled page type (8 types):


faculty, student, research scientist, staff, research
group, research project, course, organization
Labeled hyperlinks and virtual links (6 types):


Berkeley, CMU, MIT, Stanford
advisor, instructor, TA, member, project-of, NONE
Data set size:



11K pages
110K links
2million words
Link Prediction: Results
...

...
72.9% relative reduction in error
...
relative to strong flat approach
Error measured over
links predicted to be
present
30
25
20

15
Link presence cutoff is
10
at precision/recall
5
break-even point
(30% for all models) 0
Flat
Labels
Triad
Summary

PRMs inherit key advantages of probabilistic
graphical models:





Coherent probabilistic semantics
Exploit structure of local interactions
Relational models inherently more expressive
“Web of influence”: use all available information
to reach powerful conclusions
Exploit both relational information and power of
probabilistic reasoning
Outline






Bayesian Networks
Probabilistic Relational Models
Collective Classification & Clustering
Undirected Discriminative Models
Collective Classification Revisited
PRMs for NLP or “Why Should I Care?”*



Word-Sense Disambiguation
Relation Extraction
Natural Language Understanding (?)
* An outsider’s perspective
Word Sense Disambiguation
financial
physical
electrical
academic figurative
criticism
wind
paper
Her advisor gave her feedback about the draft.


Neighboring words alone may not provide
enough information to disambiguate
We can gain insight by considering compatibility
between senses of related words
Collective Disambiguation
financial
physical
electrical
academic figurative
criticism
wind
paper
Her advisor gave her feedback about the draft.

Can we infer grammatical structure
Objects: words
in text
and
disambiguate word senses
simultaneously
Attributes:
sense, gender, rather
number,than
pos,sequentially?
…

Links:





Grammatical relations (subject-object, modifier,…)
CloseCan
semantic
(is-a,inter-word
cause-of, …) relationships
we relations
integrate
Same word
in different
sentences
(one-sense-per-discourse)
directly
into
our probabilistic
model?
Compatibility parameters:


Learned from tagged data
Based on prior knowledge (e.g., WordNet, FrameNet)
Relation Extraction
ACME’s board of directors began a search for a new CEO after
the departure of current CEO, James Jackson, following
allegations of creative accounting practices at ACME. [6/01] …
In an attempt to improve the company’s image, ACME is
considering former judge Mary Miller for the job. [7/01] …
As her first act in her new position, Miller announced that
ACME will be doing a stock buyback. [9/01] …
CEO
Jackson
Of
Miller
Made
Announcement
Understanding Language
Professor Sarah met Jane.
She explained the hole in her proof.
Most likely interpretation:
Theorem: P=NP
Proof: N=1
Student Jane
Professor Sarah
Resolving Ambiguity
Professor Sarah met Jane.
She explained the hole in her proof.

Professors often meet with students


Jane is probably a student
Attribute values
Professors like to explain

Link types
“She” is probably Prof. Sarah
Object identity
Probabilistic reasoning about objects, their attributes,
and the relationships between them
[Goldman & Charniak, Pasula & Russell]
Acquiring Semantic Models

Statistical NLP reveals patterns:
train
hire 3%
pay 1.5%
fire 1.4%
0.3%
serenade


be
3%
24%
teacher
Standard models learn patterns at word level
But word-patterns are only implicit surrogates for
underlying semantic patterns


“Teacher” objects tend to participate in certain relationships
Can use this pattern for objects not explicitly labeled as a teacher
Complementary
Competing Approaches
Approaches
Semantic
Scaling Up
Desiderata: Understanding (via learning)
Logical
Statistical
PRMs
Noise &
Ambiguity
Statistics: from Words to Semantics

Represent statistical patterns at semantic level



What types of objects participate in what types of
relationships
Learn statistical models of semantics from text
Reason using the models to obtain global
semantic understanding of the text
Georgia O’Keefe
Ladder to the Moon

Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Ben Taskar Pieter Abbeel Lise Getoor Eran Segal Nir Friedman Avi Pfeffer Ming-Fai Wong.

Transcript Probabilistic Models of Relational Data Daphne Koller Stanford University Joint work with: Ben Taskar Pieter Abbeel Lise Getoor Eran Segal Nir Friedman Avi Pfeffer Ming-Fai Wong.

Directory