Using CTW in Dasher - University of Cambridge

Transcript Using CTW in Dasher - University of Cambridge

Using CTW as a language
modeler in Dasher
Phil Cowans, Martijn van Veen
25-04-2007
Inference Group
Department of Physics
University of Cambridge
Language Modelling
• Goal is to produce a generative model
over strings
• Typically sequential predictions:
• Finite context models:
2/24
Dasher: Language Model
• Conditional probability for each alphabet symbol, given the
previous symbols
• Similar to compression methods
• Requirements:
– Sequential
– Fast
– Adaptive
• Model is trained
• Better compression -> faster text input
3/24
Basic Language Model
• Independent distributions for each context
• Use Dirichlet prior
• Makes poor use of data
– intuitively expect similarities between similar
contexts
4/24
Basic Language Model
5/24
Prediction By Partial Match
• Associate a generative distribution with
each leaf in the context tree
• Share information between nodes using a
hierarchical Dirichlet (or Pitman-Yor) prior
• In practice use a fast, but generally good,
approximation
6/24
Hierarchical Dirichlet Model
7/24
Context Tree Weighting
• Combine nodes in the context tree
• Tree structure treated as a random
variable
• Contexts associated with each leaf have
the same generative distribution
• Contexts associated with different leaves
are independent
• Dirichlet prior on generative distributions
8/24
CTW: Tree model
• Source structure in the model, memoryless parameters
9/24
Tree Partitions
10/24
Recursive Definition
Children share
one distribution
Children distributed
independently
11/24
Experimental Results [256]
12/24
Experimental Results [128]
13/24
Experimental Results [27]
14/24
Observations So Far
• No clear overall winner without
modification.
• PPM Does better with small alphabets?
• PPM Initially learns faster?
• CTW is more forgiving with redundant
symbols?
15/24
CTW for text
Properties of text generating sources:
• Large alphabet, but in any given context only a
small subset is used
– Waste of code space, many probabilities that should be zero
– Solution:
• Adjust zero-order estimator to decrease
probability of unlikely events
• Binary decomposition
• Only locally stationary
– Limit the counts to increase adaptivity
• Bell, Cleary, Witten 1989
16/24
Binary Decomposition
• Decomposition tree
17/24
Binary Decomposition
• Results found by Aberg and Shtarkov:
– All tests with full ASCII alphabet
Input file
Paper 1
Paper 2
Book 1
Book 2
News
PPM-D
(byte predictions)
2.351
2.322
2.291
1.969
2.379
CTW-D
(byte predictions)
2.904
2.719
2.490
2.265
2.877
CTW-KT
(bit predictions)
2.322
2.249
2.184
1.910
2.379
CTW/PPM-D
(byte predictions)
2.287
2.235
2.192
1.896
2.322
18/24
Count halving
• If one count reaches a maximum, divide both
counts by 2
– Forget older input data, increase adaptivity
• In Dasher: Predict user input with a model
based on training text
– Adaptivity even more important
19/24
Counthalving: Results
20/24
Counthalving: Results
21/24
Results: Enron
22/24
Combining PPM and CTW
• Select locally best model, or weight
models together
• More alpha parameters for PPM, learned
from data
• PPM like sharing, with prior over context
trees, as with CTW
23/24
Conclusions
• PPM and CTW have different strengths,
makes sense to try combining them
• Decomposition and count scaling may give
clues for improving PPM
• Look at performance on out of domain
text in more detail
24/24
Experimental Parameters
• Context depth: 5
• Smoothing: 5%
• PPM – alpha: 0.49, beta: 0.77
• CTW – w: 0.05, alpha: 1/128
25/24
Comparing language models
• PPM
– Quickly learns repeating strings
• CTW
–
–
–
–
Works on a set of all possible tree models
Not sensitive to parameter D, max. model depth
Easy to increase adaptivity
The weight factor (escape probability) is strictly
defined
26/24