A ((very) brief) introduction to minimum description

Download Report

Transcript A ((very) brief) introduction to minimum description

A(n) (extremely)
brief/crude introduction
to minimum description
length principle
jdu
2006-04
1
Outline
•
•
•
•
•
Conceptual/non-technical introduction
Probabilities and Codelengths
Crude MDL
Refined MDL
Other topics
2
Outline
•
•
•
•
•
Conceptual/non-technical introduction
Probabilities and Codelengths
Crude MDL
Refined MDL
Other topics
3
Introduction
• Example: data compression
– Description methods
4
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
Introduction
• Example: regression
– Model selection and overfitting
– Complexity of the model vs. Goodness of fit
5
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
Introduction
• Models vs. Hypotheses
6
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
Introduction
• Crude 2-part version of MDL
7
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
Outline
•
•
•
•
•
Conceptual/non-technical introduction
Probabilities and Codelengths
Crude MDL
Refined MDL
Other topics
8
Probabilities and Codelengths
• Let X be a finite or countable set
– A code C(x) for X
• 1-to-1 mapping from X to Un>0{0,1}n
• LC(x): number of bits needed to encode x using C
– P: probability distribution defined on X
• P(x): the probability of x
• A sequence of (usually iid) observations x1, x2, …, xn: xn
9
Probabilities and Codelengths
• Prefix codes: as examples of uniquely decodable codes
– no code word is a prefix of any other
a
b
c
d
r
!
0
111
1011
1010
110
100
10
Source: http://www.cs.princeton.edu/courses/archive/spring04/cos126/
Probabilities and Codelengths
• Expected codelength of a code C
EP ( LC ( x))   P( x) LC ( x)
xX
– Lower bound:
H ( x)    P( x) log2 P( x)
xX
• Optimal code
– if it has minimum expected codelength over all uniquely
decodable codes
– How to design one given P?
• Huffman coding
11
Probabilities and Codelengths
• Huffman coding
12
Source: http://star.itc.it/caprile/teaching/algebra-superiore-2001/
Probabilities and Codelengths
• How to design code for {1, 2, …, M}?
– Assuming a uniform distribution: 1/M for each
number
– ~logM bits
13
Probabilities and Codelengths
• How to design code for all the positive integers?
– For each k
•
•
•
•
Describe it with
0s
Followed by a 1
Then encode k using the uniform code for
In total, ~ 2logk + 1 bits
– Can be refined…
14
Probabilities and Codelengths
• Let P be a probability distribution over X, then there
exists a code C for X such that:
LC ( x)   log P( x)
• Let C be a uniquely decodable code over X, then there
exists a probability distribution P such that:
LC ( x)   log P( x)
LC ( xn )   log P( xn )
15
Probabilities and Codelengths
• Codelength revisited
16
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
Outline
•
•
•
•
•
Conceptual/non-technical introduction
Probabilities and Codelengths
Crude MDL
Refined MDL
Other topics
17
Crude MDL
• Preliminary: k-th order Markov chain on X={0,1}
– A sequence: X1, X2, …, XN
– Special case: 0-th order: Bernoulli model (biased coin)
• Maximum Likelihood estimator
18
Crude MDL
• Preliminary: k-th order Markov chain on
X={0,1}
– Special case: first order Markov chain B(1)
• MLE
19
Crude MDL
• Preliminary: k-th order Markov chain on
X={0,1}
– 2k parameters
•
•
•
•
•
theta[1|000…000] = n[1|000…000]/n[000…000]
theta[1|000…001]
…
theta[1|111…110]
theta[1|111…111]
– Log likelihood function: …
– MLE: …
20
Crude MDL
• Question: Given data D=xn, find the Markov
chain that best explains D.
– We do not want to restrict ourselves to chains of
fixed order
• How to avoid overfitting?
• Obviously, an (n-1)-th order Markov model would always
fit the data the best
21
Crude MDL
• two-part MDL revisited
22
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
Crude MDL
• Description length of data given hypothesis
23
Crude MDL
• Description length of hypothesis
– The code should not change with the sample size n.
– Different codes will lead to preferences of different
hypotheses
– How to design a code that
• Leads to good inferences with small, practically relevant sample sizes?
24
Crude MDL
• An ``intuitive” and ``reasonable” code for k-th order
Markov chain
– First describe k using 2logk+1 bits
– Then describe the d=2k parameters
• Assume n is given in advance
– For each theta in the MLE {theta[1|000…000], …, theta[1|111…111]},
the best precision we can achieve by counting is 1/(n+1)
– Describe each theta with log(n+1) bits
– L(H)=2logk+1+dlog(n+1)
– L(H)+L(D|H) = 2logk+1+dlog(n+1) – logP(D|k, theta)
– For a given k, only the MLE theta need to be considered
25
Crude MDL
• Good news
– We have found a principled manner to encode data
D using H
• Bad news
– We have not found clear guidelines to design codes
for H
26
Outline
•
•
•
•
•
Conceptual/non-technical introduction
Probabilities and Codelengths
Crude MDL
Refined MDL
Other issues
27
Refined MDL
• Universal codes and universal distributions
– maximum likelihood code depends on the data
• How to describe the data in an unambiguous manner?
– Design a code such that for every possible
observation, its codelength corresponds to its ML? impossible
28
Refined MDL
• Worst-case regret
• Optimal universal model
29
Refined MDL
• Normalized maximum likelihood (NML)
• Minimizing -logNML
30
Refined MDL
• Complexity of a model
– The more sequences that can be fit well by an
element of M, the larger M’s complexity
– Would it lead to a ``right” balance between
complexity and fit?
• Hopefully…
31
Refined MDL
• General refined MDL
32
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
Outline
•
•
•
•
•
Conceptual/non-technical introduction
Probabilities and Codelengths
Crude MDL
Refined MDL
Other topics
33
Other topics
• Mixture code
• Resolvability
• …
34
References
• Barron, A.; Rissanen, J. & Yu, B. (1998), 'The minimum
description length principle in coding and modeling',
Information Theory, IEEE Transactions on 44(6), 2743-2760.
• Grnwald, P.D.; Myung, I.J. & Pitt, M.A. (2005), Advances
in Minimum Description Length: Theory and Applications
(Neural Information Processing), The MIT Press.
• Hall, P. & Hannan, E.J. (1988), 'On stochastic
complexity and nonparametric density estimation',
Biometrika 75(4), 705-714.
35