Introduction - University of Malta

Transcript Introduction - University of Malta

Corpora and Statistical Methods
Albert Gatt
Course goals
 Introduce the field of statistical natural language processing
(statistical NLP).
 Describe the main directions, problems, and algorithms in
the field.
 Discuss the theoretical foundations.
 Involve students in hands-on experiments with real
problems.
CSA5011 -- Corpora and Statistical Methods
A general introduction
CSA5011 -- Corpora and Statistical
Methods
Language
 We can define a language formally as:
 a set of symbols (“alphabet”)
 a set of rules to combine those symbols
 This mathematical definition covers many classes of
languages, not just human language.
CSA5011 -- Corpora and Statistical Methods
Java: An artificial (formal) language
 fixed set of basic symbols:
 public, static, for, while, {, }…
 fixed syntax for symbol combination
public static void main (String[] args) {
for(int i = 0; i < args.length; i++) {
…
}
}
CSA5011 -- Corpora and Statistical Methods
Natural language
 Often much more complicated than an artificial language.
 NB: Some theorists view NL as a special kind of formal language as
well (Montague…).
 It does conform to the formal definition:
 there are symbols
 there are modes of combination
 However, there are many levels at which these symbols and
rules are defined.
CSA5011 -- Corpora and Statistical Methods
Levels of analysis in Natural language (I)
 Acoustic properties (phonetics)
 defines a basic set of sounds in terms of their features
 studies the combination of these phonemes
 Higher-order acoustic features (phonology)
 how combinations of phonemes combine into larger units, with
suprasegmental features such as intonation.
CSA5011 -- Corpora and Statistical Methods
Levels of analysis in Natural language (II)
 Word formation (morphology)
 combines morphemes into words
 Combination into longer units in a structure-dependent way
(syntax)
 “legal” word combinations in a language
 recursive phrasal combination
 Interpretation (semantics):
 of words (lexical semantics)
 of longer units (sentential/propositional semantics)
 Interpretation in context (pragmatics)
CSA5011 -- Corpora and Statistical Methods
Natural Language Processing
 Studies language at all its levels.
 phonology, morphology, syntax, semantics…
 focuses on process (Sparck-Jones `07)
 computational methods to understand and generate human
language
 Often, the distinction between NLP and computational
linguistics is fuzzy
CSA5011 -- Corpora and Statistical Methods
Kindred disciplines: Linguistics
 Theoretical linguistics tends to be less process-oriented than
NLP
 Q: how can we characterise knowledge that native speakers have of
their language?
 this leads to declarative models of speaker’s knowledge of language
 tends to say less about how speakers process language in real time
 NB: This depends on the theoretical orientation!
 NLP has strong ties to theoretical linguistics
 it has also been an important contributor: process models can serve as
tests for declarative models
CSA5011 -- Corpora and Statistical Methods
Kindred disciplines: Psycholinguistics
 Like NLP, psycholinguistics tends to be strongly process-
oriented
 studies the online processes of language understanding and
language production
 NLP has benefited from such models.
 NLP has also been a contributor:
 it is increasingly common to test psycholinguistic theories by
building computational models.
CSA5011 -- Corpora and Statistical Methods
Paradigms in NLP (I)
 Knowledge-based:
 system is based on a priori rules and constraints
 e.g. a syntactic parser might have hand-crafted rules such as:
NP  Det AdjP N
AdjP  A+
 Problem: it is extremely difficult to hand-code all the relevant
knowledge.
CSA5011 -- Corpora and Statistical Methods
Paradigms in NLP (II)
 Statistical:
 starting point is a large repository of text or speech (a corpus)
 corpus is often annotated with relevant information, e.g.:
 parsed corpora (syntax)
 tagged corpora (part-of-speech)
 word-sense annotated corpora (semantics)
 tries to learn a model from the data
 tries to generalise this model to new data
CSA5011 -- Corpora and Statistical Methods
The paradigms: a bird’s-eye view
 We find similar “divisions” within mainstream linguistics:
 generative linguistics tends to formulate generalisations about “internalised
speaker knowledge of language” (competence, I-Language…)
 corpus linguistics tends to formulate generalisations based on patterns
observed in corpora
 The two paradigms are viewed as having roots in different traditions:
 rationalist tradition (Plato, Descartes…)
 empiricist tradition (Locke…)
CSA5011 -- Corpora and Statistical Methods
The idea of “linguistic knowledge”
 Traditional linguistic theory (since the 1950s) introduced a
dichotomy:
 competence: a person’s knowledge of language, formalised as a
set of rules
 performance: actual production and perception of language in
concrete situations
 Much of linguistic theory has focused on characterising
competence.
CSA5011 -- Corpora and Statistical Methods
The idea of “linguistic knowledge”
 The use of data (corpora) involves an increased focus on
“performance”.
 The idea is that exposure to such regularities is a crucial part
of human language learning.
CSA5011 -- Corpora and Statistical Methods
An initial example
 Suppose you’re a linguist interested in the syntax of verb
phrases.
 Some verbs are transitive, some intransitive
 I ate the meat pie (transitive)
 I swam (intransitive)
 What about:
 quiver
Most traditional grammars characterise
 quake
these as intransitive
 Corpus data suggests they have transitive uses:
 the insect quivered its wings
 it quaked his bowels (with fear)
CSA5011 -- Corpora and Statistical Methods
Example II: lexical semantics
 Quasi-synonymous lexical items exhibit subtle differences in
context.
 strong
 powerful
 A fine-grained theory of lexical semantics would benefit
from data about these contextual cues to meaning.
CSA5011 -- Corpora and Statistical Methods
Example II continued
 Some differences between strong and powerful (source: British
National Corpus):
 strong
 powerful
wind, feeling, accent, flavour
tool, weapon, punch, engine
 The differences are subtle, but examining their collocates
helps.
CSA5011 -- Corpora and Statistical Methods
Statistical approaches to language

Do not rely on categorical judgements of grammaticality
etc. Examples:
1.
Degrees of grammaticality: people often do not have
categorical judgements of acceptability.
2.
Category blending: We live nearer town than you thought.

3.
Is near an adjective or a preposition?
Syntactic ambiguity: She killed the man with the gun.

What is the most likely parse?
CSA5011 -- Corpora and Statistical Methods
Statistical NLP vs. Corpus Linguistics (I)
 Corpus linguistics became popular with the arrival of large, machine-
readable corpora.
 generally viewed as a methodology
 tests hypotheses empirically on data
 aim is to refine a theory of language, or discover novel generalisations
 Statistical NLP shares these aims; however:
 it is often corpus-driven rather than corpus-based
 the “theory” or “model” learned is often not a priori given
CSA5011 -- Corpora and Statistical Methods
Statistical NLP vs. Corpus Linguistics (II)
 The term “corpus” may mean different things to different people:
 To a corpus linguist, a corpus is a balanced, representative sample of a
particular language variety (e.g. The British National Corpus)
 Representativeness allows generalisations to be made more rigorously.
 In statistical NLP, there has traditionally been less emphasis on these
properties.
 emphasis on algorithms for learning language models
 we frequently find the tacit assumption that the algorithm can be applied to
any set of data, given the right annotations
CSA5011 -- Corpora and Statistical Methods
Some applications of Statistical NLP
CSA5011 -- Corpora and Statistical
Methods
Language Technology
Meaning
Natural Language
analysis and
understanding
Structure
Text
Text
Machine
translation,
summarisation
Speech
Recognition
Speech
Natural Language
Generation
Speech
Speech
Synthesis
A (very) rough division of NLP tasks
 understanding: typically take as input free text or speech, and
conduct some structural or semantic analysis
 POS Tagging, parsing, semantic role labelling, sentiment/opinion mining,
named entity recognition…
 generation: typically take textual or non-linguistic input,
outputting some text/speech
 automatic weather reporting, summarisation, machine translation
 How effective are statistical NLP tools to carry out these and other tasks?
 Are statistical techniques actually useful to learn things about language?
CSA5011 -- Corpora and Statistical Methods
Example 1: Semantics
“goat”
 Example of an automatically
sheep
0.359
cow
0.345
pig
0.331
rabbit
0.305
cattle
0.304
deer
0.289
lamb
0.286
donkey
0.276
(www.sketchengine.co.uk)
poultry
0.262
boar
0.261
 How does this work?
camel
0.259
elephant
0.258
calf
0.258
pony
0.255
acquired thesaurus of similar
words.
 Data: 1.5 bn words obtained
from the web.
CSA5011 -- Corpora and
Statistical Methods
Example 1: Semantics (cont/d)
 Corpus-based lexical semantic acquisition typically uses
vector-space models.
 represent a word as a vectors containing information about the
context in which it is likely to occur
 some models also include grammatical relations (subject-of,
object-of etc)
CSA5011 -- Corpora and Statistical Methods
Example 2: POS Tagging
<tok pos="at">The</tok>
<tok pos="jj">tall</tok>
<tok pos="nn">woman</tok>
<tok pos="cc">and</tok>
<tok pos="at">the</tok>
<tok pos="jj">strange</tok>
<tok pos="nn">boy</tok>
<tok pos="vbd">thought</tok>
<tok pos="jj">statistical</tok>
<tok pos="nn">NLP</tok>
<tok pos="bedz">was</tok>
<tok pos="jj">pointless</tok>
<tok pos=".">.</tok>
“The tall woman and the strange boy thought
statistical NLP was pointless.”
CSA5011 -- Corpora and Statistical
Methods
 Output from a statistical POS
Tagger, trained on the Brown
Corpus
(LingPipe demo library)
 Uses of POS Tagging:
 pre-parsing
 corpus analysis for linguistics
 …
Example 3: parsing
 Parsed using the Stanford Parser.
 Based on probabilistic context-free grammar of English
 trained on a treebank
 CFG rules with probabilities
CSA5011 -- Corpora and
Statistical Methods
Example 4: Machine translation
 Input:
(Maltese translation of example
sentence)
 Translated using Maltese-English
Google Translate.
 Obvious shortcomings, but
 Output:
The wife and son long strange
nonetheless feels that the
statistical NLP is without
purpose.
CSA5011 -- Corpora and Statistical
Methods
robust, i.e. some output
returned, even if garbled.
 Based on automatic alignment
between parallel text corpora.
Example 5: Generation/Summarisation
[…] No laboratories offering
molecular genetic testing for
prenatal diagnosis of 3-M
syndrome are listed in the
GeneTests Laboratory
Directory. However, prenatal
testing may be available for
families in which the diseasecausing mutations have been
identified […]
CSA5011 -- Corpora and Statistical
Methods
 Automatically generated article
about 3-M syndrome (Sauper and
Barzilay 2009)
 Now on Wikipedia!!!
(http://en.wikipedia.org/wiki/3M_syndrome)
 Summarised from multiple
documents drawn from the web.
 Uses automatically acquired
templates from human-authored
texts to ensure coherence.
Features of Statistical NLP systems
 Robustness: typically, don’t break down with new or
unknown input (although they may output garbage)
 Portability: statistical learning algorithms can in principle be
ported to new domains (given data)
 Sensitivity to training data: if (say) a POS tagger is trained on
medical text, its performance will decline on a new genre
(e.g. news).
CSA5011 -- Corpora and Statistical Methods
Some important concepts
 All the systems surveyed rely on regularities in large
repositories of training data, expressed as probabilities.
 In practice, we distinguish between:
 training/development data: for learning a model and finetuning
 test data: for evaluation on unseen but compatible data
CSA5011 -- Corpora and Statistical Methods
References
 Sparck-Jones, K. (2007). Computational Linguistics: What
about the linguistics? Computational Linguistics 33 (3): 437 –
441
 McEnery, T., Xiao, R. & Tono,Y. 2006:
 Corpus-based language studies: An advanced resource book. London:
Routledge
 (Contains an interesting discussion of corpus-based vs. corpus-driven
approaches)
CSA5011 -- Corpora and Statistical Methods