Random text - Deutscher Wortschatz

Download Report

Transcript Random text - Deutscher Wortschatz

A Random Text Model
for the Generation of
Statistical Language Invariants
Chris Biemann
University of Leipzig, Germany
HLT-NAACL 2007, Rochester, NY, USA
Monday, April 23, 2007
1
Outline
• Previous random text models
• Large-scale measures for text
• A novel random text model
• Comparison to natural language text
2
Necessary property: Zipf‘s Law
• Zipf: Ordering words in a corpus by descending frequency, the
relation between the frequency of a word at rank r and its rank is
given by f(r) ~ r-z, where z is the exponent of the power-law that
corresponds to the slope of the curve in a log plot. For word
frequencies in NL, z  1
• Zipf-Mandelbrot: f(r) ~(r+c1)-(1+c2): Approximates lower
frequencies for very high ranks
rank-frequency
spoken English
power law z=1.4
Zipf-Mandelbrot c1=10 c2=0.4
frequency
10000
1000
100
10
1
1
10
100
rank
1000
10000
3
Previous Random Text Models
B. B. Mandelbrot (1953)
• Sometimes called the “monkey at the typewriter”
• With a probability w, a word separator is generated at each step,
• with probability (1-w)/N, a letter from an alphabet of size N is generated
H. A. Simon (1955)
• No alphabet of single letters
• at each time step, a previously unseen new word is added to the
stream with a probability , whereas with probability (1-), the next
word is chosen amongst the words at previous positions.
• frequency distribution that follows a power law with exponent z=(1-).
• Modified by Zanette and Montemurro (2002):
- sublinear growth for higher exponents
- Zipf-Mandelbrot law by maximum probability threshold
4
Critique on Previous Models
• Mandelbrot: All words with the same length are
equiprobable, as all letters are equiprobable
 Ferrer i Cancho and Solé (2002): Initialisation with letter
probabilities obtained from natural language text solves this
problem, but where do these letter frequencies come from?
• Simon: No concept of „letter“ at all.
• Both:
– no concept of sentence
– no word order restrictions: Simon = bag of words,
Mandelbrot does not take into account generated
stream at all
5
Large-scale Measures for Text
• Zipf‘s law and lexical spectrum: rank-frequency plot should
follow a power law with z1, frequency-spectrum
(probability of frequencies) should follow a power law with
z2 (Pareto distribution)
• Word length: Should be distributed like in natural language
text, according to a variant of the gamma distribution
(Sigurd et al. 2004)
• Sentence length: Should also distributed like in NL, same
gamma distribution
• Significant neighbour-based co-occurrence graph: Should
be a similar in terms of degree distribution and connectivity
in random text and NL.
6
A Novel Random Text Model
Two parts:
• Word Generator
• Sentence Generator
Both follow the principle of beaten tracks:
• Memorize what has been generated before
• Generate with higher probability if generated before more
often
Inspired by Small World network generation, especially
(Kumar et al. 1999).
7
Word Generator
• Initialisation:
– Letter graph of N letters.
– Vertices are connected to themselves with weight 1.
• Choice:
– When generating a word, the generator chooses a letter x according to its
probability P(x), which is computed as the normalized weight sum of
outgoing edges:
weightsum( x)
with weightsum( y)   weight( y, u)
P( x) 
uneigh( y )
 weightsum(v)
• Parameter:
vV
– At every position, the word ends with a probability w(0,1) or generates a
next letter according to the letter production probability as given above.
• Update:
– For every letter bigram, the weight of the directed edge between the
preceding and current letter in the letter graph is increased by one.
• Effect: self-reinforcement of letter probabilities:
– the more often a letter is generated, the higher its weight sum will be in
subsequent steps,
– leading to an increased generation probability.
8
Word Generator Example
The small numbers
next to edges are
edge weights. The
probability for the
letters for the next
step are
P(A)=0.4
P(B)=0.4
P(C)=0.2
9
Measures on the Word Generator
rank-frequency
lexical spectrum
P(frequency)
word generator w=0.2
power law z=1
Mandelbrot model
10000
1000
1
word generator w=0.2
power law z=2
Mandelbrot model
0.1
0.01
0.001
100
0.0001
10
1e-005
1
1e-006
1
10
100
rank
1000
10000
1
10
100
frequency
1000
• Word Generator fulfills measures much better than the
Mandelbrot model.
• For other measures, we need something extra...
10
Sentence Generator I
• Initialisation:
– Word graph is initialized with a begin-of-sentence (BOS) and an
end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS
to EOS.
• Word Graph: (directed)
– Vertices correspond to words
– edge weights correspond to the number of times two words were
generated in a sequence.
• Generation:
– random walk on the directed edges starts at the BOS vertex.
– With a new word probability (1-s), an existing edge is followed from
the current vertex to the next vertex
– the probability of choosing endpoint X from the endpoints of all
outgoing edges from the current vertex C is given by
weight(C , X )
P( word  X ) 
 weight(C, N )
N neigh( C )
11
Sentence Generator II
• Parameter:
– With probability s  (0,1), a new word is generated by the word
generator model
– next word is chosen from the word graph in proportion to its
weighted indegree: the probability of choosing an existing vertex E
as successor of a newly generated word N is given by
P( word  E ) 
indgw( E )
,
 indgw(v)
vV
indgw( X )   weight(v, X )
vV
• Update:
– For each sequence of two words generated, the weight of the
directed edge between them is increased by 1
12
Sentence Generator Example
• In the last step, the
second CA was
generated as a new
word from the word
generator.
• The generation of
empty sentences
happens frequently.
These are omitted in
the output.
13
Comparison to Natural Language
• Corpus for comparison: The first 1 million words of BNC, spoken
English.
• 26 letters, uppercase, punctuation removed  same in word generator
• 125,395 sentences  set s=0.08, remove first 50K sentences
• average sentence length: 7.975 words
• Average word length: 3.502 letters  w=0.4
OOH
OOH
ERM
WOULD LIKE A CUP OF THIS ER
MM
SORRY NOW THAT S
NO NO I DID NT
I KNEW THESE PEWS WERE HARD
OOH I DID NT REALISE THEY WERE THAT BAD
I FEEL SORRY FOR MY POOR CONGREGATION
14
Word Frequency
rank-frequency
sentence generator
English
power law z=1.5
frequency
10000
• Zipf-Mandelbrot
distribution
• Smooth curve
1000
• Similar to English
100
10
1
1
10
100
rank
1000
10000
15
Word Length
word length
sentence generator
English
gamma distribution
100000
10000
1000
100
• More 1-letter words in the
sentence generator
• Longer words in the
sentence generator
• Curve is similar
• Gamma distribution here:
f(x)~x1.50.45x
10
1
1
10
length in letters
16
Sentence Length
sentence length
sentence generator
English
10000
1000
• Longer sentences in
English
• More 2-word sentences
in english
• Curve is similar
100
10
1
1
10
length in words
100
17
Neighbor-based Co-occurrence Graph
degree distribution
sentence generator
English
word generator
power law z=2
10000
1000
100
10
1
0.1
0.01
0.001
1
10
100
degree interval
1000
English
sample
sentence
gen.
word gen.
random
graph (ER)
# of ver.
7154
15258
3498
10000
avg. sht.
path
2.933
3.147
3.601
4.964
avg. deg.
9.445
6.307
3.069
7
cl.coeff.
0.2724
0.1497
0.0719
6.89E-4
z
1.966
2.036
2.007
-
• Min. cooc. freq=2,
min. log likelihood
ratio=3.84
• NB-graph is a small
world
• Qualitatively, English
and sentence
generator are similar
• Word generator
shows much much
less co-occurrences
• Factor 2 in clustering
coefficient and
number of vertices
18
Formation of Sentences
avg. sentence length
• Word graph grows and contains
the full vocabulary used so far for
generating in every time step.
• Random walks starting from BOS
always end in EOS.
• Sentence length slowly increases:
random walk has more
possibilities before finally arriving
at the EOS vertex.
• Sentence length is influenced by
both parameters of the model:
– the word end probability w in the
word generator
– the new word probability s in the
sentence generator.
sentence length growth
100
w=0.4 s=0.08
w=0.4 s=0.1
w=0.17 s=0.22
w=0.3 s=0.09
x^(0.25);
10
1
10000
100000
text interval
1e+006
19
Conclusion
Novel random text model
• obeys Zipf‘s law
• obeys word length distribution
• obeys sentence length
• shows similar nb-cooccurrence data
First model that:
• produces smooth lexical spectrum without initial letter
probabilities
• incorporates notion of a sentence
• models word order restrictions
20
Sentence generator at work
Beginning: Q . U . RFXFJF . G . G . U . R . U . RFXFJF .
XXF . RFXFJF . U . QYVHA . RFXFJF . R TCW . CV . Z U
. G . XXF . RFXFJF . M XXF . Q . G . RFXFJF . U .
RFXFJF . RFXFJF . Z U . G . RFXFJF . RFXFJF . M XXF
. R . Z U .
Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI
X HPTXVYPF . FVFRIK . Y TXYP VYFI QC TPS Q UYYLPCQXC
. G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV
VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U
YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO
FYF QHT T YPXRQ R GQFRVQ . MUHVJ Q VAVF YPF QPXPCY Q
YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY .
FYC T PNXTQ . R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ
CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY
YTYYFXK . RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ.
OY FDXFOXC. N OYCT . L MMYMT CY YAQ XAA J YHYJ MPQ
XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN
. TA KV XJP O EGV J HQY KMQ U .
21
Questions?
Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte
dank we u trew wel wwd muchas werwe ewr gracias
werwe rew merci mille werew re ew ee ew grazie d fsd ffs
df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm
22