#### Transcript Simple Bayesian Supervised Models

```Simple Bayesian Supervised Models
1
Content


Recap from last weak
Bayesian Linear Regression




What is linear regression?
Application of the Bayesian Theory on Linear Regression
Example
Comparison to Conventional Linear Regression

Bayesian Logistic Regression
Naive Bayes classifier

Source:


Bishop (ch. 3,4); Barber (ch. 10)
2
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Linear Regression


goal: predict the value of a target variable 𝑡 given the
value of a D-dimensional vector 𝐱 of input variables
𝐱→𝑡
linear regression models: linear functions of the adjustable
parameters
𝐱
for example:
𝒕 = 𝟎. 𝟓 ⋅ 𝒙𝟏 + 𝟎. 𝟑 ⋅ 𝒙𝟐
−𝟎. 𝟐 ⋅ 𝒙𝟑 + 𝟎. 𝟏 ⋅ 𝒙𝟒
𝑡
5
Linear Regression

Training




{𝐱 𝑛 } … training data set comprising 𝑁 observations, where
𝑛 = 1, … , 𝑁
{𝑡𝑛 } … corresponding target values
compute the weights
Prediction



goal: predict the value of 𝑡 for a new value of 𝐱
= model the predictive distribution 𝑝 𝑡 𝐱
and make predictions of 𝑡 in such a way as to minimize the
expected value of a loss function
6
Examples of linear regression models

simplest linear regression model:
𝑦 𝐱, 𝐰 = 𝑤0 + 𝑤1 𝑥1 + … + 𝑤𝐷 𝑥𝐷
𝑀−1
𝑤𝑗 𝑥𝑗 = 𝐰 𝑇 𝐱
𝑦 𝐱, 𝐰 =
𝑗=0


linear function of the weights/parameters 𝐰 and the data 𝐱
linear regression models using basis functions 𝛟:
𝑀−1
𝑤𝑗 𝜙𝑗 (𝐱) = 𝐰 𝑇 𝛟(𝐱)
𝑦 𝐱, 𝐰 =
𝑗=0


𝐰 = 𝑤0 , … 𝑤𝑀−1 𝑇
𝛟 = 𝜙0 , … , 𝜙𝑀−1 𝑇
7
Bayesian Linear Regression

model: 𝑡 = 𝑦 𝐱, 𝐰 + 𝜖





𝑡 … target variable
𝑦 … model
𝐱 … data
𝐰 … weights/parameters
𝜖 … additive Gaussian noise: 𝑝 𝜖 = 𝒩(0, 𝛽−1 ) with zero
mean and precision (inverse variance) 𝛽
8
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression - Likelihood

likelihood function:
𝑝 t 𝐱, 𝐰, 𝛽 = 𝒩(𝑡|𝑦 𝐱, 𝐰 , 𝛽 −1 )

observation of N training data sets of inputs 𝐗 =
𝐱1 , … , 𝐱 𝑁 and target values 𝐭 = {𝑡1 , … , 𝑡𝑁 }
(independently drawn from the distribution)
𝑁
𝒩(𝑡𝑛 |𝐰 𝑇 𝛟 𝐱 𝐧 , 𝛽 −1 )
𝑝 𝐭 𝐗, 𝐰, 𝛽 =
𝑛=1
10
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Conjugate prior
• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate
prior that can be written in the form
• Important conjugate pairs include:
Binomial – Beta
Multinomial – Dirichlet
Gaussian – Gaussian (for mean)
Gaussian – Gamma (for precision)
Exponential – Gamma
Bayesian Linear Regression - Prior



prior probability distribution over the model
parameters 𝐰
conjugate prior: Gaussian distribution
𝑝 𝐰 = 𝒩 𝐰 𝐦0 , 𝐒0
mean 𝐦0 and covariance 𝐒0
13
Maximum a posterior
estimation
• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize
posterior distribution. likelihood
posterior
prior
evidence
• It allows to account for the prior information.
Bayesian Linear Regression – Posterior
Distribution

due to the conjugate prior, the posterior will also be
Gaussian
𝑝 𝐰 𝐭 = 𝒩(𝐰|𝐦𝑁 , 𝐒𝑁 )
𝐦𝑁 = 𝐒𝑁 𝐒0−1 𝐦0 + 𝛽𝛟𝑇 𝐭
−1
𝐒𝑁
= 𝐒0−1 + 𝛽𝛟𝑇 𝛟
𝒘𝑀𝐴𝑃 = 𝒎𝑁
(derivation: Bishop p.112)
15
Example Linear Regression

matlab
16
Predictive Distribution







making predictions of 𝑡 for new values of 𝐱
predictive distribution:
𝑝 𝑡 𝐱, 𝐭, 𝛼, 𝛽 = 𝒩(𝑡|𝐦𝑇𝑁 𝛟 𝐱 , 𝜎𝑁2 𝐱 )
variance of the distribution:
1
2
𝜎𝑁 𝐱 = + 𝛟 𝐱 𝑇 𝐒𝑁 𝛟(𝐱)
𝛽
first term represents the noise in the data
second term reflects the uncertainty associated with the
parameters 𝐰
optimal prediction, for a new value of 𝐱, would be the
conditional mean of the target variable:
𝐄 𝑡 𝐱 = ∫ 𝑡 ⋅ 𝑝 𝑡 𝐱 𝑑𝑡 = 𝑦(𝐱, 𝐰)
17
Common Problem in Linear Regression:
Overfitting/model complexitiy

Least Squares approach (maximizing the likelihood):




point estimate of the weights
Regularization: regularization term and value needs to be
chosen
Cross-Validation: requires large datasets and high
computational power
Bayesian approach:



distribution of the weights
good prior
model comparison: computationally demanding, validation data
not required
18
From Regression to Classification

for regression problems:


target variable 𝑡 was the vector of real numbers whose values
we wish to predict
in case of classification:



target values represent class labels
two-class problem: 𝑡 𝜖 {1, 0}
K > 2: 𝐭 = (0, 1, 0, 0, 0)𝑇 → class 2
19
Classification

goal: take an input vector 𝐱 and assign it to one of 𝐾
discrete classes 𝐶𝑘
decision boundary
20
Bayesian Logistic Regression

model the class-conditional densities 𝑝 𝐱 𝐶𝑘 and the
prior probabilities 𝑝 𝐶𝑘 and apply Bayes Theorem:
𝑝 𝐱 𝐶𝑘 𝑝 𝐶𝑘
𝑝 𝐶𝑘 𝐱 =
𝑝 𝐱
21
Bayesian Logistic Regression

exact Bayesian inference for logistic regression is
intractable

Laplace approximation


aims to find a Gaussian approximation to a probability density
defined over a set of continuous variables
posterior distribution is approximated around 𝐰𝑀𝐴𝑃
22
Example

Barber: DemosExercises\demoBayesLogRegression.m
23
Example

Barber: DemosExercises\demoBayesLogRegression.m
24
Naive Bayes classifier

Why naive?



strong independence assumptions
assumes that the presence/absence of a feature of a class is
unrelated to the presence/absence of any other feature, given
the class variable
Ignores relation between features and assumes that all feature
contribute independently to a class
[http://en.wikipedia.org/wiki/Naive_Bayes_classifier]