Modelling Polish Intonation - uni

Download Report

Transcript Modelling Polish Intonation - uni

Modelling Polish Intonation for
Speech Synthesis
Dominika Oliver
23 May 2002
Plan
 Aims & Objectives
 Reasons
 Methodology
Building TTS systems

Basic building blocks:

pre-processing: analysis of raw and labelled text into
identifiable words.



Text normalisation (abbreviations, dates, money time
indications, addresses, telephone num, bank accounts, etc)
tokenization, mapping tokens to words, resolving mark-up
languages
linguistic module : From words to segments:


Orthographic to phonetic conversion of words (morphological
analysis, g2p, syllabification, stress assignment)
Sentence analysis (resolve pronunciation ambiguities,
syntactic, lexical and semantic analysis)
Building TTS systems (cd)

phonetic module
 F0 and durations (and anything else appropriate for
waveform synthesis)
 Prosodic modelling (generation of intonation contour by
intonation model, prosodic phrase, accent and F0
prediction)

acoustic module (Waveform synthesis)
 Conversion into digital speech signal
 From segments, F0 and duration to a waveform.
 There are many techniques to do this, concatenative
synthesis (diphone, unit selection), formant synthesis
and articulatory synthesis.
Terminology

Stress
- lexically specified distinction between strong and
weak syllables, a stressed syllable louder and
longer than an unstressed one

Tone
- lexically specified pitch movement, property of a
syllable

Accent
- post-lexical pitch movement, linked to a stressed
syllable

Pitch accent
- lexical pitch movement, property of a
word
Intonation in TTS
 Intonation prediction can be split into two tasks

Prediction of accents: (and/or tones) this is done on a per
syllable basis, identifying which syllables are to be
accented as well as what type of accent is required (if
appropriate for the theory).

Realization of F0 contour: given the accents/tones
generate an F0 contour.
Why is it important?

In the task of rendering natural sounding speech from raw
text, one of the many tasks is generating natural sounding
intonation.

A number of intonation theories have been utilised in
various systems to try to do this task.

As the quality of speech synthesis improves, a greater demand
is put on the intonation system to produce more varied
intonation tunes.
Models of Intonation

Linear or Tone sequence models - generate values from left to right
as a sequence of values or movements.





British school - based on auditory analysis
Pierrehumbert 1980 - predominantly acoustic analysis
Dutch school - ‘t Hart, Collier and Cohen 1990 - perceptual data
Tilt - Taylor 1998 - phonetic
Superpositional or hierarchical models - generate a contour by
modelling factors separately (phone, syllable, word, phase, sentence)
and then combining the partial models.

Fujisaki 1983, Grønnum 1992, Möbius et al. 1993,
Techniques of intonation modelling: using
Tilt & ToBI

Tilt and ToBI typify two major classes of intonation systems.

Tilt comes from a data-driven approach attempting to form an
abstraction of the natural contour directly from the waveform.

ToBI takes a more linguistic or phonological approach
specifying a small set of discrete labels which identify the
intonational space of accents and tones.

Also prosodic labelling systems
ToBI (Pierrehumbert, 1980)







Autosegmental-metrical approach, pitch movements are
decomposed into pitch levels.
Intonation phrases are modelled as sequences of (H) high and (L)
low pitch levels.
ToBI offers a well-defined intonation phonology for labelled speech.
Most widely available standard labelling system.
The ToBI labelling system itself does not define a mechanism to go
from the labels to an F0 contour, or the reverse. However there are
both hand written rule systems (e.g. M. Anderson,
J. Pierrehumbert, and M. Liberman 1984)
and statistically trained methods (e.g. A. Black and A. Hunt,
1996.) which do this task.
Machine readable .
Increase in descriptive power : transcriptions can be compared
across dialects and languages, ToBI for English, GToBI for
German, SCToBI for Serbo-Croatian, ToDI for Dutch, etc.
Tilt (Taylor 1998)

Tilt is a phonetic model of
intonation that represents
intonation as a sequence of
continuously parameterised
events (pitch accents or
boundary tones).

These parameters are called
tilt parameters, determined
directly from F0 contour.

They are : duration, amplitude
and tilt

Imposes no categorial
classification on events.
Tilt (cd)




Duration is a sum of the rise and fall durations.
Amplitude is the sum of the magnitudes of the rise and fall
amplitudes.
Tilt parameter – expresses overall shape of the event, the
difference of the amplitudes divided by their sum.
The tilt parameter has a range of -1 to 1, -1 pure fall, 1 pure
rise, 0 equal portions of rise and fall.
Examples of intonation control
 Information provided by intonation:


Focus or given/new information
Emotions, word emphasis, syntactic disambiguation
examples from Mary TTS (DFKI)
 Gehen wir nach Hause !/?
 Der Zug fährt nach Frankfurt, oder?
 Ist die Nummer 180? Nein, die Nummer ist 100 80.
Prosodic Labelling Systems
 ToBi (Tones and Break Indices)

ToBI is a intonational labelling standard for speech
databases that in some way is based on Janet
Pierrehumbert's thesis Pierrehumbert 1980.






Made on the basis of a speech wave and F0 trace
The labelling scheme consists of:
(1) words spoken
Orthographic tier
(2) the degree of juncture between words Break-index tier
(3) intonation
Tone tier
(4) comments
Miscellaneous tier
Prosodic Labelling Systems
 ToBI (cd)

discrete intonation accents types: H*, H+!H, L*, L*+H and
L+H*.

phrase accent type: H- and L-

boundary tones: L-L%, L-H%, H-L% and H-H%

break levels: 0, 1, 3, and 4 (2 reserved for special cases)
Prosodic Labelling Systems (cd)
 Tilt

A Tilt labelling for an utterance consists of an assignment
of one of four basic intonational events:




pitch accents,
boundary tones,
connections,
silence (labelled a, b, c, sil).
Prosodic Labelling Systems (cd)
Polish synthesis (examples)

What is available :
 Festival (University of Edinburgh, CSTR)

Realspeak (Scansoft)

Spiker (IVO Software)

SynTalk (Neurosoft)
Polish intonation model

British school (Jassem 1984, Demenko, 1999)




The description of accent and intonation at the linguistic level is
based on the main features of a British-English system developed
essentially by O’Connor and Arnold (1973) and Jassem (1984),
an intonational phrase is defined in terms of a sequence of
(optional) pre-nuclear, (constitutive) nuclear, and (optional) postnuclear accents.
[prehead [ head [[ nucleus ] tail]]] (O'Connor & Arnold)
[anacrusis][[prenuclear intonation[nuclear intonation]]] (Jassem)
e.g.
 To jest naj' lepsza 'pora "dnia.
 To jest naj' lepsza po" radnia.
 "Co mó wiłeś?
Intro - Polish intonation structure

A Polish phrase includes only one ictic accent, which is the also
referred to as nuclear accent,

The pre-ictic accent is referred to as pre-nuclear and postictic accents are called post-nuclear accents

The pre-nuclear and the nuclear accents are mainly determined
by specific pitch relations, whilst the post-nuclear accent (if
any) is essentially durational.
Intro - Polish intonation structure (cd)

2 classes of pre-nuclear accents: H (high) and L (low)

9 classes of nuclear accents: HL, ML, xL, HM, LM, MH, MM, and
LHL have been distinguished, where H is High, M Medium, L
Low and xL extra-Low relative to the particular speaker’s
average and mean-Low pitch; e. g., LH means “rising from Low
to High”. etc.

e.g. ``Znowu ten wariat. (HL)
,, Znowu ten wariat? (LH)
Platform

Festival is a speech synthesis application developed at the The
Centre for Speech Technology Research (CSTR) at the
University of Edinburgh
 Multilingual text to speech
 (English, Spanish, German, Welsh, Catalan, Polish)
 Allows addition of new languages
 Synthesis research and development environment
 Tools for development - support for extracting
information from speech databases, in a way suitable
for building models. (Models for accent prediction, F0
generation, duration, vowel reduction, homograph
disambiguation, phrase break assignment and unit
selection)
 Free software
Platform (cd) - direct route from research
to use



Multi-lingual text to speech: for those who have little
interest in the internal workings of the system, and just
want speech output.
Synthesis for language system: for applications that
generate text from known forms. In this type of system
perhaps telephone numbers, addresses, etc. can be
explicitly marked, language type, even intonational forms
can be specified. This form of access requires more
knowledge about the synthesis internals but still not its low
level details.
Synthesis development environment: In this mode,
new synthesis modules, intonation, waveform synthesizers,
etc. can be developed and compared in a software
environment that provides the right basic tools so that
development may concentrate on the theory not the
implementation.
Intonation in Festival

Task :


Prediction of accents & realisation of F0 contour
Method :

Statistical and rule based


Tilt
ToBI
Intonation in Festival (cd)

ToBI: Accents and boundary types are predicted by a CART
tree (classification and regression trees), but the F0 generation
method is a statistically trained method.

Three F0 values are predicted for each syllable, at the start,
mid vowel and end. They are predicted using linear regression
based on a number of features including ToBI accent type,
phrase position, syllable position with contexts.

Although a three point prediction system cannot capture all the
variability in natural intonation, by experiment it has been used
to be sufficient to produce reasonable F0 contours (Black
1998).
Intonation in Festival (cd)

The Tilt Intonation Theory, takes a bottom up approach. Its
intention is to build a parameterization of the F0 contour, that
is abstract enough to be predictable in a text to speech
system.

It has been shown that a good representation of a natural F0
contour can be made automatically from the raw signal
(though it is better of the accents and boundaries are hand
labelled). Dusterhoff 1997 further shows how that
parameterization can be predicted from text.
Future work : pilot study

Immediate Plans
 ToBI description of Polish Intonation Phrase (Polish Intonation
database (Karpiński 2000)

Future Work


Synthesis assessment : visually impaired
Potential Applications

free Polish-English talking dictionary (EU project)