Introduction - University of Malta
Download
Report
Transcript Introduction - University of Malta
Corpora and Statistical Methods
Albert Gatt
Course goals
Introduce the field of statistical natural language processing
(statistical NLP).
Describe the main directions, problems, and algorithms in
the field.
Discuss the theoretical foundations.
Involve students in hands-on experiments with real
problems.
CSA5011 -- Corpora and Statistical Methods
A general introduction
CSA5011 -- Corpora and Statistical
Methods
Language
We can define a language formally as:
a set of symbols (“alphabet”)
a set of rules to combine those symbols
This mathematical definition covers many classes of
languages, not just human language.
CSA5011 -- Corpora and Statistical Methods
Java: An artificial (formal) language
fixed set of basic symbols:
public, static, for, while, {, }…
fixed syntax for symbol combination
public static void main (String[] args) {
for(int i = 0; i < args.length; i++) {
…
}
}
CSA5011 -- Corpora and Statistical Methods
Natural language
Often much more complicated than an artificial language.
NB: Some theorists view NL as a special kind of formal language as
well (Montague…).
It does conform to the formal definition:
there are symbols
there are modes of combination
However, there are many levels at which these symbols and
rules are defined.
CSA5011 -- Corpora and Statistical Methods
Levels of analysis in Natural language (I)
Acoustic properties (phonetics)
defines a basic set of sounds in terms of their features
studies the combination of these phonemes
Higher-order acoustic features (phonology)
how combinations of phonemes combine into larger units, with
suprasegmental features such as intonation.
CSA5011 -- Corpora and Statistical Methods
Levels of analysis in Natural language (II)
Word formation (morphology)
combines morphemes into words
Combination into longer units in a structure-dependent way
(syntax)
“legal” word combinations in a language
recursive phrasal combination
Interpretation (semantics):
of words (lexical semantics)
of longer units (sentential/propositional semantics)
Interpretation in context (pragmatics)
CSA5011 -- Corpora and Statistical Methods
Natural Language Processing
Studies language at all its levels.
phonology, morphology, syntax, semantics…
focuses on process (Sparck-Jones `07)
computational methods to understand and generate human
language
Often, the distinction between NLP and computational
linguistics is fuzzy
CSA5011 -- Corpora and Statistical Methods
Kindred disciplines: Linguistics
Theoretical linguistics tends to be less process-oriented than
NLP
Q: how can we characterise knowledge that native speakers have of
their language?
this leads to declarative models of speaker’s knowledge of language
tends to say less about how speakers process language in real time
NB: This depends on the theoretical orientation!
NLP has strong ties to theoretical linguistics
it has also been an important contributor: process models can serve as
tests for declarative models
CSA5011 -- Corpora and Statistical Methods
Kindred disciplines: Psycholinguistics
Like NLP, psycholinguistics tends to be strongly process-
oriented
studies the online processes of language understanding and
language production
NLP has benefited from such models.
NLP has also been a contributor:
it is increasingly common to test psycholinguistic theories by
building computational models.
CSA5011 -- Corpora and Statistical Methods
Paradigms in NLP (I)
Knowledge-based:
system is based on a priori rules and constraints
e.g. a syntactic parser might have hand-crafted rules such as:
NP Det AdjP N
AdjP A+
Problem: it is extremely difficult to hand-code all the relevant
knowledge.
CSA5011 -- Corpora and Statistical Methods
Paradigms in NLP (II)
Statistical:
starting point is a large repository of text or speech (a corpus)
corpus is often annotated with relevant information, e.g.:
parsed corpora (syntax)
tagged corpora (part-of-speech)
word-sense annotated corpora (semantics)
tries to learn a model from the data
tries to generalise this model to new data
CSA5011 -- Corpora and Statistical Methods
The paradigms: a bird’s-eye view
We find similar “divisions” within mainstream linguistics:
generative linguistics tends to formulate generalisations about “internalised
speaker knowledge of language” (competence, I-Language…)
corpus linguistics tends to formulate generalisations based on patterns
observed in corpora
The two paradigms are viewed as having roots in different traditions:
rationalist tradition (Plato, Descartes…)
empiricist tradition (Locke…)
CSA5011 -- Corpora and Statistical Methods
The idea of “linguistic knowledge”
Traditional linguistic theory (since the 1950s) introduced a
dichotomy:
competence: a person’s knowledge of language, formalised as a
set of rules
performance: actual production and perception of language in
concrete situations
Much of linguistic theory has focused on characterising
competence.
CSA5011 -- Corpora and Statistical Methods
The idea of “linguistic knowledge”
The use of data (corpora) involves an increased focus on
“performance”.
The idea is that exposure to such regularities is a crucial part
of human language learning.
CSA5011 -- Corpora and Statistical Methods
An initial example
Suppose you’re a linguist interested in the syntax of verb
phrases.
Some verbs are transitive, some intransitive
I ate the meat pie (transitive)
I swam (intransitive)
What about:
quiver
Most traditional grammars characterise
quake
these as intransitive
Corpus data suggests they have transitive uses:
the insect quivered its wings
it quaked his bowels (with fear)
CSA5011 -- Corpora and Statistical Methods
Example II: lexical semantics
Quasi-synonymous lexical items exhibit subtle differences in
context.
strong
powerful
A fine-grained theory of lexical semantics would benefit
from data about these contextual cues to meaning.
CSA5011 -- Corpora and Statistical Methods
Example II continued
Some differences between strong and powerful (source: British
National Corpus):
strong
powerful
wind, feeling, accent, flavour
tool, weapon, punch, engine
The differences are subtle, but examining their collocates
helps.
CSA5011 -- Corpora and Statistical Methods
Statistical approaches to language
Do not rely on categorical judgements of grammaticality
etc. Examples:
1.
Degrees of grammaticality: people often do not have
categorical judgements of acceptability.
2.
Category blending: We live nearer town than you thought.
3.
Is near an adjective or a preposition?
Syntactic ambiguity: She killed the man with the gun.
What is the most likely parse?
CSA5011 -- Corpora and Statistical Methods
Statistical NLP vs. Corpus Linguistics (I)
Corpus linguistics became popular with the arrival of large, machine-
readable corpora.
generally viewed as a methodology
tests hypotheses empirically on data
aim is to refine a theory of language, or discover novel generalisations
Statistical NLP shares these aims; however:
it is often corpus-driven rather than corpus-based
the “theory” or “model” learned is often not a priori given
CSA5011 -- Corpora and Statistical Methods
Statistical NLP vs. Corpus Linguistics (II)
The term “corpus” may mean different things to different people:
To a corpus linguist, a corpus is a balanced, representative sample of a
particular language variety (e.g. The British National Corpus)
Representativeness allows generalisations to be made more rigorously.
In statistical NLP, there has traditionally been less emphasis on these
properties.
emphasis on algorithms for learning language models
we frequently find the tacit assumption that the algorithm can be applied to
any set of data, given the right annotations
CSA5011 -- Corpora and Statistical Methods
Some applications of Statistical NLP
CSA5011 -- Corpora and Statistical
Methods
Language Technology
Meaning
Natural Language
analysis and
understanding
Structure
Text
Text
Machine
translation,
summarisation
Speech
Recognition
Speech
Natural Language
Generation
Speech
Speech
Synthesis
A (very) rough division of NLP tasks
understanding: typically take as input free text or speech, and
conduct some structural or semantic analysis
POS Tagging, parsing, semantic role labelling, sentiment/opinion mining,
named entity recognition…
generation: typically take textual or non-linguistic input,
outputting some text/speech
automatic weather reporting, summarisation, machine translation
How effective are statistical NLP tools to carry out these and other tasks?
Are statistical techniques actually useful to learn things about language?
CSA5011 -- Corpora and Statistical Methods
Example 1: Semantics
“goat”
Example of an automatically
sheep
0.359
cow
0.345
pig
0.331
rabbit
0.305
cattle
0.304
deer
0.289
lamb
0.286
donkey
0.276
(www.sketchengine.co.uk)
poultry
0.262
boar
0.261
How does this work?
camel
0.259
elephant
0.258
calf
0.258
pony
0.255
acquired thesaurus of similar
words.
Data: 1.5 bn words obtained
from the web.
CSA5011 -- Corpora and
Statistical Methods
Example 1: Semantics (cont/d)
Corpus-based lexical semantic acquisition typically uses
vector-space models.
represent a word as a vectors containing information about the
context in which it is likely to occur
some models also include grammatical relations (subject-of,
object-of etc)
CSA5011 -- Corpora and Statistical Methods
Example 2: POS Tagging
<tok pos="at">The</tok>
<tok pos="jj">tall</tok>
<tok pos="nn">woman</tok>
<tok pos="cc">and</tok>
<tok pos="at">the</tok>
<tok pos="jj">strange</tok>
<tok pos="nn">boy</tok>
<tok pos="vbd">thought</tok>
<tok pos="jj">statistical</tok>
<tok pos="nn">NLP</tok>
<tok pos="bedz">was</tok>
<tok pos="jj">pointless</tok>
<tok pos=".">.</tok>
“The tall woman and the strange boy thought
statistical NLP was pointless.”
CSA5011 -- Corpora and Statistical
Methods
Output from a statistical POS
Tagger, trained on the Brown
Corpus
(LingPipe demo library)
Uses of POS Tagging:
pre-parsing
corpus analysis for linguistics
…
Example 3: parsing
Parsed using the Stanford Parser.
Based on probabilistic context-free grammar of English
trained on a treebank
CFG rules with probabilities
CSA5011 -- Corpora and
Statistical Methods
Example 4: Machine translation
Input:
(Maltese translation of example
sentence)
Translated using Maltese-English
Google Translate.
Obvious shortcomings, but
Output:
The wife and son long strange
nonetheless feels that the
statistical NLP is without
purpose.
CSA5011 -- Corpora and Statistical
Methods
robust, i.e. some output
returned, even if garbled.
Based on automatic alignment
between parallel text corpora.
Example 5: Generation/Summarisation
[…] No laboratories offering
molecular genetic testing for
prenatal diagnosis of 3-M
syndrome are listed in the
GeneTests Laboratory
Directory. However, prenatal
testing may be available for
families in which the diseasecausing mutations have been
identified […]
CSA5011 -- Corpora and Statistical
Methods
Automatically generated article
about 3-M syndrome (Sauper and
Barzilay 2009)
Now on Wikipedia!!!
(http://en.wikipedia.org/wiki/3M_syndrome)
Summarised from multiple
documents drawn from the web.
Uses automatically acquired
templates from human-authored
texts to ensure coherence.
Features of Statistical NLP systems
Robustness: typically, don’t break down with new or
unknown input (although they may output garbage)
Portability: statistical learning algorithms can in principle be
ported to new domains (given data)
Sensitivity to training data: if (say) a POS tagger is trained on
medical text, its performance will decline on a new genre
(e.g. news).
CSA5011 -- Corpora and Statistical Methods
Some important concepts
All the systems surveyed rely on regularities in large
repositories of training data, expressed as probabilities.
In practice, we distinguish between:
training/development data: for learning a model and finetuning
test data: for evaluation on unseen but compatible data
CSA5011 -- Corpora and Statistical Methods
References
Sparck-Jones, K. (2007). Computational Linguistics: What
about the linguistics? Computational Linguistics 33 (3): 437 –
441
McEnery, T., Xiao, R. & Tono,Y. 2006:
Corpus-based language studies: An advanced resource book. London:
Routledge
(Contains an interesting discussion of corpus-based vs. corpus-driven
approaches)
CSA5011 -- Corpora and Statistical Methods