Simple Bayesian Supervised Models
Download
Report
Transcript Simple Bayesian Supervised Models
Simple Bayesian Supervised Models
Saskia Klein & Steffen Bollmann
1
Content
Recap from last weak
Bayesian Linear Regression
What is linear regression?
Application of the Bayesian Theory on Linear Regression
Example
Comparison to Conventional Linear Regression
Bayesian Logistic Regression
Naive Bayes classifier
Source:
Bishop (ch. 3,4); Barber (ch. 10)
Saskia Klein & Steffen Bollmann
2
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Linear Regression
goal: predict the value of a target variable 𝑡 given the
value of a D-dimensional vector 𝐱 of input variables
𝐱→𝑡
linear regression models: linear functions of the adjustable
parameters
𝐱
for example:
𝒕 = 𝟎. 𝟓 ⋅ 𝒙𝟏 + 𝟎. 𝟑 ⋅ 𝒙𝟐
−𝟎. 𝟐 ⋅ 𝒙𝟑 + 𝟎. 𝟏 ⋅ 𝒙𝟒
𝑡
Saskia Klein & Steffen Bollmann
5
Linear Regression
Training
{𝐱 𝑛 } … training data set comprising 𝑁 observations, where
𝑛 = 1, … , 𝑁
{𝑡𝑛 } … corresponding target values
compute the weights
Prediction
goal: predict the value of 𝑡 for a new value of 𝐱
= model the predictive distribution 𝑝 𝑡 𝐱
and make predictions of 𝑡 in such a way as to minimize the
expected value of a loss function
Saskia Klein & Steffen Bollmann
6
Examples of linear regression models
simplest linear regression model:
𝑦 𝐱, 𝐰 = 𝑤0 + 𝑤1 𝑥1 + … + 𝑤𝐷 𝑥𝐷
𝑀−1
𝑤𝑗 𝑥𝑗 = 𝐰 𝑇 𝐱
𝑦 𝐱, 𝐰 =
𝑗=0
linear function of the weights/parameters 𝐰 and the data 𝐱
linear regression models using basis functions 𝛟:
𝑀−1
𝑤𝑗 𝜙𝑗 (𝐱) = 𝐰 𝑇 𝛟(𝐱)
𝑦 𝐱, 𝐰 =
𝑗=0
𝐰 = 𝑤0 , … 𝑤𝑀−1 𝑇
𝛟 = 𝜙0 , … , 𝜙𝑀−1 𝑇
Saskia Klein & Steffen Bollmann
7
Bayesian Linear Regression
model: 𝑡 = 𝑦 𝐱, 𝐰 + 𝜖
𝑡 … target variable
𝑦 … model
𝐱 … data
𝐰 … weights/parameters
𝜖 … additive Gaussian noise: 𝑝 𝜖 = 𝒩(0, 𝛽−1 ) with zero
mean and precision (inverse variance) 𝛽
Saskia Klein & Steffen Bollmann
8
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression - Likelihood
likelihood function:
𝑝 t 𝐱, 𝐰, 𝛽 = 𝒩(𝑡|𝑦 𝐱, 𝐰 , 𝛽 −1 )
observation of N training data sets of inputs 𝐗 =
𝐱1 , … , 𝐱 𝑁 and target values 𝐭 = {𝑡1 , … , 𝑡𝑁 }
(independently drawn from the distribution)
𝑁
𝒩(𝑡𝑛 |𝐰 𝑇 𝛟 𝐱 𝐧 , 𝛽 −1 )
𝑝 𝐭 𝐗, 𝐰, 𝛽 =
𝑛=1
Saskia Klein & Steffen Bollmann
10
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Bayesian Linear Regression - Prior
prior probability distribution over the model
parameters 𝐰
conjugate prior: Gaussian distribution
𝑝 𝐰 = 𝒩 𝐰 𝐦0 , 𝐒0
mean 𝐦0 and covariance 𝐒0
Saskia Klein & Steffen Bollmann
13
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression – Posterior
Distribution
due to the conjugate prior, the posterior will also be
Gaussian
𝑝 𝐰 𝐭 = 𝒩(𝐰|𝐦𝑁 , 𝐒𝑁 )
𝐦𝑁 = 𝐒𝑁 𝐒0−1 𝐦0 + 𝛽𝛟𝑇 𝐭
−1
𝐒𝑁
= 𝐒0−1 + 𝛽𝛟𝑇 𝛟
𝒘𝑀𝐴𝑃 = 𝒎𝑁
(derivation: Bishop p.112)
Saskia Klein & Steffen Bollmann
15
Example Linear Regression
matlab
Saskia Klein & Steffen Bollmann
16
Predictive Distribution
making predictions of 𝑡 for new values of 𝐱
predictive distribution:
𝑝 𝑡 𝐱, 𝐭, 𝛼, 𝛽 = 𝒩(𝑡|𝐦𝑇𝑁 𝛟 𝐱 , 𝜎𝑁2 𝐱 )
variance of the distribution:
1
2
𝜎𝑁 𝐱 = + 𝛟 𝐱 𝑇 𝐒𝑁 𝛟(𝐱)
𝛽
first term represents the noise in the data
second term reflects the uncertainty associated with the
parameters 𝐰
optimal prediction, for a new value of 𝐱, would be the
conditional mean of the target variable:
𝐄 𝑡 𝐱 = ∫ 𝑡 ⋅ 𝑝 𝑡 𝐱 𝑑𝑡 = 𝑦(𝐱, 𝐰)
Saskia Klein & Steffen Bollmann
17
Common Problem in Linear Regression:
Overfitting/model complexitiy
Least Squares approach (maximizing the likelihood):
point estimate of the weights
Regularization: regularization term and value needs to be
chosen
Cross-Validation: requires large datasets and high
computational power
Bayesian approach:
distribution of the weights
good prior
model comparison: computationally demanding, validation data
not required
Saskia Klein & Steffen Bollmann
18
From Regression to Classification
for regression problems:
target variable 𝑡 was the vector of real numbers whose values
we wish to predict
in case of classification:
target values represent class labels
two-class problem: 𝑡 𝜖 {1, 0}
K > 2: 𝐭 = (0, 1, 0, 0, 0)𝑇 → class 2
Saskia Klein & Steffen Bollmann
19
Classification
goal: take an input vector 𝐱 and assign it to one of 𝐾
discrete classes 𝐶𝑘
decision boundary
Saskia Klein & Steffen Bollmann
20
Bayesian Logistic Regression
model the class-conditional densities 𝑝 𝐱 𝐶𝑘 and the
prior probabilities 𝑝 𝐶𝑘 and apply Bayes Theorem:
𝑝 𝐱 𝐶𝑘 𝑝 𝐶𝑘
𝑝 𝐶𝑘 𝐱 =
𝑝 𝐱
Saskia Klein & Steffen Bollmann
21
Bayesian Logistic Regression
exact Bayesian inference for logistic regression is
intractable
Laplace approximation
aims to find a Gaussian approximation to a probability density
defined over a set of continuous variables
posterior distribution is approximated around 𝐰𝑀𝐴𝑃
Saskia Klein & Steffen Bollmann
22
Example
Barber: DemosExercises\demoBayesLogRegression.m
Saskia Klein & Steffen Bollmann
23
Example
Barber: DemosExercises\demoBayesLogRegression.m
Saskia Klein & Steffen Bollmann
24
Naive Bayes classifier
Why naive?
strong independence assumptions
assumes that the presence/absence of a feature of a class is
unrelated to the presence/absence of any other feature, given
the class variable
Ignores relation between features and assumes that all feature
contribute independently to a class
[http://en.wikipedia.org/wiki/Naive_Bayes_classifier]
Saskia Klein & Steffen Bollmann
25
Thank you for your attention
Saskia Klein & Steffen Bollmann
26