Moja prva SOI prezentacija
Download
Report
Transcript Moja prva SOI prezentacija
STATISTICAL LANGUAGE
MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
Lucia Načinović, Sanda Martinčić-Ipšić and Ivo Ipšić
Department of Informatics, University of Rijeka
lnacinovic, smarti, ivoi @inf.uniri.hr
7/17/2015
1
Introduction
• Statistical language modelling estimates the
regularities in natural languages
– the probabilities of word sequences which are usually
derived from large collections of text material
• Employed in:
–
–
–
–
–
–
Speech recognition
Optical character recognition
Handwriting recognition
Machine translation
Spelling correction
...
7/17/2015
2
N-gram language models
• The most widely-used LMs
– Based on the probability of a word wn given
the preceding sequence of words wn-1
– Bigram models (2-grams)
• determine the probability of a word given the
previous word
– Trigram models (3-gram)
• determine the probability of a word given the
previous two words
7/17/2015
3
Language model perplexity
• The most common metric for evaluating a
language model - probability that the
model assigns to test data, or the
derivative measures of :
– cross-entropy
– perplexity
7/17/2015
4
Cross-entropy
• The cross-entropy of a model p(T) on data
T:
1
H p (T )
log 2 p(T )
WT
• WT -the length of the text T measured in
words
7/17/2015
5
Perplexity
• The reciprocal value of the average probability
assigned by the model to each word in the test
set T
• The perplexity PPp(T) of a model - related to
cross-entropy by the equation
PPp (T ) 2
H p (T )
• lower cross-entropies and perplexities are better
7/17/2015
6
Smoothing
• Data sparsity problem
– N-gram models - trained from finite corpus
– some perfectly acceptable N-grams are missing:
probability=0
• Solution – smoothing techiques
– adjust the maximum likelihood estimate of probabilities
to produce more accurate probabilities
– adjust low probabilities such as zero probabilities
upward, and high probabilities downward
7/17/2015
7
Smoothing techniques used
in our research
•
•
•
•
Additive smoothing
Absolute discounting
Witten-Bell technique
Kneser-Nay technique
7/17/2015
8
Additive smoothing
• one of the simplest types of smoothing
• we add a factor δ to every count: δ (0< δ ≤1)
• Formula for additive smoothing:
i 1
i n1
padd ( wi | w
)
c( wiin1 )
| V | w c( wiin1 )
i
• V - the vocabulary (set of all words considered)
• c - the number of occurrences
• values of δ parameter used in our research:
0.1,0.5 and 1
7/17/2015
9
Absolute discounting
• When there is little data for directly estimating an
n-gram probability, useful information can be
provided by the corresponding (n-1)-gram
• Absolute discounting - the higher-order
distribution is created by subtracting a fixed
discount D from each non-zero count:
i
max
c
(
w
i n
i 1
i n 1 ) D,0
pabs ( wi | wi n1 )
(
1
i 1 ) pabs ( wi | wi n 2 )
win1
wi c(wiin1 )
• Values of D used in our research: 0.3, 0.5, 1
7/17/2015
10
Witten-Bell technique
• Number of different words in the corpus is
used as a help at determing the probability
of words that never occur in the corpus
• Example for bigram:
T (wx )
p( wi wx )
N ( wx ) T ( wx )
i:c ( wxwi )
7/17/2015
11
Kneser-Nay technique
• An extension of absolute discounting
• the lower-order distribution that one
combines with a higher-order distribution
is built in a novel manner:
– it is taken into consideration only when few or
no counts are present in the higher-order
distribution
7/17/2015
12
Smoothing implementation
• 2-gram, 3-gram and 4-gram language
models were built
• Corpus: 290 480 words
– 2 398
– 18 694
– 23 021
– 29 736
1-grams,
2-grams,
3-grams and
4-grams
• On each of these models four different smoothing
techniques were applied
7/17/2015
13
Corpus
• Major part developed from 2002 until 2005
and some parts added later
• Includes the vocabulary related to
weather, bio and maritime forecast, river
water levels and weather reports
• Devided into 10 parts
– 9/10 used for building language models
– 1/10 used for evaluating those models in
terms of their estimated perplexities
7/17/2015
14
Results given by the perplexities
of LM-s
Without
smoothing
Additive smoothing
δ parameter
0,1
0,5
1
Absolute discounting
D parameter
0,3
0,5
Witten- KneserBell
Ney
1
2gram
19,87
28,8
51,6
73,5
19,61
19,64
21,6
19,75
18,96
3gram
8,45
30,04
86,9
144,2
8,17
8,22
9,30
8,25
7,63
4gram
6,04
42,9
142,6
239,87
5,64
5,71
6,76
5,76
5,24
7/17/2015
15
Conclusion
• In this paper we described the process of
language model building from the Croatian
weather-domain corpus
• We built models of different order:
– 2-grams
– 3-grams
– 4-grams
7/17/2015
16
Conclusion
• We applied four different smoothing techniques:
–
–
–
–
additive smoothing
absolute discounting
Witten-Bell technique
Kneser-Ney technique
• We estimated and compared perplexities of
those models
• Kneser-Ney smoothing technique gives the best
results
7/17/2015
17
Further work
• Prepare more balanced corpus of Croatian
text and thus build more complete
language model
• Other LM
– Class based
• Other smoothing techniques
7/17/2015
18
STATISTICAL LANGUAGE
MODELS FOR CROATIAN
WEATHER-DOMAIN CORPUS
Lucia Načinović, Sanda Martinčić-Ipšić and Ivo Ipšić
Department of Informatics, University of Rijeka
lnacinovic, smarti, ivoi @inf.uniri.hr
7/17/2015
19
References
•
•
•
•
•
•
•
•
Chen, Stanley F.; Goodman, Joshua. An empirical study of smoothing techniques for
language modelling. Cambridge, MA: Computer Science Group, Harvard University,
1998
Chou, Wu; Juang, Biing-Hwang. Pattern recognition in speech and language
processing. CRC Press, 2003
Jelinek, Frederick. Statistical Methods for Speech Recognition. Cambridge, MA: The
MIT Press, 1998
Jurafsky, Daniel; Martin, James H. Speech and Language Processing, An
Introduction to Natural Language Processing, Computational Linguistics, and Speech
Recognition. Upper Saddle River, New Jersey: Prentice Hall, 2000
Manning, Christopher D.; Schütze, Hinrich. Foundations of Statistical Natural
Language Processing. Cambridge, MA: The MIT Press, 1999
Martinčić-Ipšić, Sanda. Raspoznavanje i sinteza hrvatskoga govora konteksno
ovisnim skrivenim Markovljevim modelima, doktorska disertacija. Zagreb, FER, 2007
Milharčič, Grega; Žibert, Janez; Mihelič, France. Statistical Language Modeling of
SiBN Broadcast News Text Corpus.//Proceedings of 5th Slovenian and 1st
international Language Technologies Conference 2006/Erjavec, T.; Žganec Gros, J.
(ed.). Ljubljana, Jožef Stefan Institute, 2006
Stolcke, Andreas. SRILM – An Extensible Language Modeling Toolkit.//Proceedings
Intl. Conf. on Spoken Language Processing. Denver, 2002, vol.2, pp. 901-904
7/17/2015
20
SRILM toolkit
• Modeli su građeni i evaluirani pomoću
SRILM alata
• http://www.speech.sri.com/projects/srilm/
• ngram-count –text TRAINDATA –lm LM
• ngram –lm LM –ppl TESTDATA
7/17/2015
21
Language model
• Speech recognition – converting an
acoustic signal into a sequence of words
• Through language modelling, the speech
signal is being statistically modelled
• Language model of a speech estimates
probability Pr(W) for all possible word
strings W=(w1, w2,…wi).
7/17/2015
22
System diagram of a generic speech recognizer based on statistical models
7/17/2015
23
• Bigram language models (2-grams)
– Central goal: to determine the probability of a word
given the previous word
• Trigram language models (3-grams)
– Central goal: to determine the probability of a word
given the previous two words
The simplest way to approximate this probability is to
compute:
c( wi 2 wi 1wi )
pML( wi | wi 2 wi 1 )
c( wi 2 wi 1 )
-This value is called the maximum likelihood (ML)
estimate
7/17/2015
24
• Linear interpolation - simple method for
combining the information from lowerorder n-gram models in estimating higherorder probabilities
7/17/2015
25
• A general class of interpolated models is
described by Jelinek and Mercer:
pint erp ( wi | wii1n1 ) wi1 pML ( w | wii1n 1 ) (1 wi1 ) pint erp ( wi | wii1n 2 )
i n11
i n1
• The nth-order smoothed model is defined
recursively as a linear interpolation between the
nth-order maximum likelihood model and the (n1)-th-order smoothed model
• Given fixed pML, it is possible to search
efficiently for the wii1n1 factor that maximizes the
probability of some data using the Baum–Welch
algorithm
7/17/2015
26
• In absolute discounting smoothing instead
of multiplying the higher-order maximumlikelihood distribution by a factor w , the
higher-order distribution is created by
subtracting a fixed discount D from each
non-zero count:
i 1
i n 1
i
max
c
(
w
i 1
i n 1 ) D,0
pabs ( wi | wiinn1 )
(
1
i 1 ) pabs ( wi | wi n 2 )
win1
wi c(wiin1 )
• Values of D used in research: 0.3, 0.5, 1
7/17/2015
27