Learning with Bayesian Networks

Download Report

Transcript Learning with Bayesian Networks

Learning with Bayesian
Networks
David Heckerman
Presented by Colin Rickert
Introduction to Bayesian Networks



Bayesian networks represent an advanced
form of general Bayesian probability
A Bayesian network is a graphical model
that encodes probabilistic relationships
among variables of interest1
The model has several advantages for
data analysis over rule based decision
trees1
Outline
1.
2.
3.
4.
5.
6.
Bayesian vs. classical probability
methods
Advantages of Bayesian techniques
The coin toss prediction model from a
Bayesian perspective
Constructing a Bayesian network with
prior knowledge
Optimizing a Bayesian network with
observed knowledge (data)
Exam questions
Bayesian vs. the Classical Approach


The Bayesian probability of an event x,
represents the person’s degree of belief
or confidence in that event’s occurrence
based on prior and observed facts.
Classical probability refers to the true or
actual probability of the event and is not
concerned with observed behavior.
Bayesian vs. the Classical Approach


Bayesian approach restricts its
prediction to the next (N+1)
occurrence of an event given the
observed previous (N) events.
Classical approach is to predict
likelihood of any given event
regardless of the number of
occurrences.
Example



Imagine a coin with irregular surfaces
such that the probability of landing heads
or tails is not equal.
Classical approach would be to analyze
the surfaces to create a physical model of
how the coin is likely to land on any given
throw.
Bayesian approach simply restricts
attention to predicting the next toss
based on previous tosses.
Advantages of Bayesian Techniques
How do Bayesian techniques compare to
other learning models?
1. Bayesian networks can readily handle
incomplete data sets.
2. Bayesian networks allow one to learn
about causal relationships
3. Bayesian networks readily facilitate use of
prior knowledge
4. Bayesian methods provide an efficient
method for preventing the over fitting of
data (there is no need for pre-processing).
Handling of Incomplete Data



Imagine a data sample where two attribute
values are strongly anti-correlated
With decision trees both values must be
present to avoid confusing the learning
model
Bayesian networks need only one of the
values to be present and can infer the
absence of the other:
 Imagine two variables, one for gunowner and the other for peace activist.
 Data should indicate that you do not
need to check both values
Learning about Causal
Relationships


We can use observed knowledge to
determine the validity of the acyclic graph
that represents the Bayesian network.
For instance is running a cause of knee
damage?
 Prior knowledge may indicate that this is
the case.
 Observed knowledge may strengthen or
weaken this argument.
Use of Prior Knowledge and
Observed Behavior


Construction of prior knowledge is
relatively straightforward by constructing
“causal” edges between any two factors
that are believed to be correlated.
Causal networks represent prior
knowledge where as the weight of the
directed edges can be updated in a
posterior manner based on new data
Avoidance of Over Fitting Data


Contradictions do not need to be
removed from the data.
Data can be “smoothed” such that
all available data can be used
The “Irregular” Coin Toss from a
Bayesian Perspective



Start with the set of probabilities  =
{1,…,n} for our hypothesis.
For coin toss we have only one 
representing our belief that we will toss a
“heads”, 1-  for tails.
Predict the outcome of the next (N+1) flip
based on the previous N flips:




for 1, … ,N
D = {X1=x1,…, Xn=xn}
Want to know probability that Xn+1=xn+1 = heads
 represents information we have observed
thus far (i.e.  = {D}
Bayesian Probabilities




Posterior Probability, p( |D,): Probability of
a particular value of  given that D has been
observed (our final value of ) . In this case 
= {D}.
Prior Probability, p( |): Prior Probability of a
particular value of  given no observed data
(our previous “belief”)
Observed Probability or “Likelihood”, p(D|,):
Likelihood of sequence of coin tosses D being
observed given that  is a particular value. In
this case  = {}.
p(D|): Raw probability of D
Bayesian Formulas for Weighted Coin
Toss (Irregular Coin)
where
*Only need to calculate p( |D,) and p(|),
the rest can be derived
Integration
To find the probability that Xn+1=heads, we
must integrate over all possible values of  to
find the average value of  which yields:
Expansion of Terms
1. Expand observed probability p(|D,):
2. Expand prior probability p(|):
*“Beta” function yields a bell curve upon integration
which is a typical probability distribution. Can be
viewed as our expectation of the shape of the curve.
Beta Function and Integration

Combine product of both functions to yield

Integrating gives the desired result:
Key Points


Multiply the results of the beta function
(prior probability) with results of the coin
toss function for  (observed probability).
Result is our confidence for this value of
.
Integrating the product of the two with
respect to  over all values of 0<<1 , is
necessary to yield the average value that
best fits the observed facts + prior
knowledge.
Bayesian Networks
1.
Construct prior knowledge from
graph of causal relationships
among variables.
2.
Update the weights of the edges to
reflect confidence of that causal
link based on observed data (i.e.
posterior knowledge).
Example Network


Consider a credit fraud network designed to
determine the probability of credit fraud based on
certain events
Variables include:






Fraud(f): whether fraud occurred or not
Gas(g): whether gas was purchased within 24 hours
Jewelry(J): whether jewelry was purchased in the
last 24 hours
Age(a): Age of card holder
Sex(s): Sex of card holder
Task of determining which variables to include is
not trivial, involves decision analysis.
Construct Graph Based on Prior
Knowledge


If examining all possible causal
networks, there are n! possibilities to
consider when trying to find the best
one
Search space can be reduced with
domain knowledge:
 If A is thought to be a “cause” of B then
add edge A->B
 For all nodes that do not have a causal
link we can check for conditional
independence between those nodes
Example
Using the above graph of expected causes, we can
check for conditional independence of the following
probabilities given initial sample data
p(a|f) = p(a)
p(s|f,a) = p(s)
p(g|f,a, s) = p(g|f)
p(j|f,a,s,g) = p(j|f,a,s)
Construction of “Posterior” knowledge
based on observed data:


For every node i, we construct the
vector of probabilities ij = {ij1,…,
ijn} where ij is represented as row
entry in a table of all possible
combinations j of the parent nodes
1,…,n
The entries in this table are the weights
that represent the degree of confidence
that nodes 1,…,n influence node i
(though we don’t know these values
yet)
Determining Table Values for i


How do we determine the values for
ij?
Perform multivariate integration to
find the average ij for all i and j in
a similar manner to the coin toss
integration:


Count all instances “m” that satisfy a
configuration ijk then observed
probability for ijk becomes ijkm(1- ijk
)n-m
Integrate over all vectors ijk to find
the average value of each ijk
Question1: What is Bayesian
Probability?


A person’s degree of belief in a
certain event
i.e. Your own degree of certainty
that a tossed coin will land “heads”
Question 2: What are the advantages and
disadvantages of the Bayesian and classical
approaches to probability?

Bayesian Probability:




+ Reflects an expert’s knowledge
+Compiles with rules of probability
- Arbitrary
Classical Probability:


+ Objective and unbiased
- Generally not available
Question 3: Mention at least 3
Advantages of Bayesian analysis




Handle incomplete data sets
Learning about causal relationships
Combine domain knowledge and
data
Avoid over fitting
Conclusion

Bayesian networks can be used to
express expert knowledge about a
problem domain even when a
precise model does not exist