Towards an NLP `module’ The role of an utterance-level interface Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech.
Download
Report
Transcript Towards an NLP `module’ The role of an utterance-level interface Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech.
Towards an NLP `module’
The role of an
utterance-level
interface
Modular architecture
Language independent application
Meaning representation
Language module
Utterance-level
interface
text or speech
Desiderata for NLP module
1.
2.
3.
4.
5.
6.
Application- and domain- independent
Bidirectional processing
No grammar-specific information should be
needed in the application
Architecture should support multiple
languages
Practical
Coverage: all well-formed input should be
accepted, robust to speaker errors
Why?
developers could build `intelligent’
responsive applications without being
NLP experts themselves
less time-consuming and expensive
than doing the NLP for each application
domain
multilingual applications
support further research
LinGO/DELPH-IN
Software and `lingware’ for application- and
domain- independent NLP
Linguistically-motivated (HPSG), deep
processing
Multiple languages
Analysis and generation
Informal collaboration since c.1995
NLP research and development, theoretical
research and teaching
www.delph-in.net
What’s different?
Open Source, integrated systems
Data-driven techniques combined with
linguistic expertise
Testing
empirical basis, evaluation
linguistic motivation
No toy systems!
large scale grammars
maintainable software
development and runtime tools
Progress
1.
Application- and domain- independent: reasonable
2.
(lexicons, text structure)
Bidirectional processing: yes
3.
No grammar-specifics in applications: yes
4.
Multiple languages: English, Japanese, German,
5.
6.
Norwegian, Korean, Greek, Italian, French: plus
grammar sharing via the Matrix
Practical: efficiency OK for some applications and
improving, interfaces?
Coverage and robustness: 80%+ coverage on English,
good parse selection, not robust
Integrating deep and shallow
processing
Shallow processing: speed and robustness, but lacks
precision
Pairwise integration of systems is time-consuming,
brittle
Common semantic representation language: shallow
processing underspecified
Demonstrated effectiveness on IE (Deep Thought)
Requires that systems share tokenization
(undesirable and impractical) or that output can be
precisely aligned with original document
Markup complicates this
Utterance-level interface
text or speech
complex cases
text structure (e.g., headings, lists)
non-text (e.g., formulae, dates, graphics)
segmentation (esp., Japanese, Chinese)
speech lattices
integration of multiple analyzers
Utterance interface
Standard interface language
allow for ambiguity at all levels
XML
collaborating with ISO working group (MAF)
processors deliver standoff annotations to original
text
Plan to develop finite-state preprocessors for
some text types, allow for others
Plan to experiment with speech lattices
Assumptions about
tokenization
tokenization: input data is transformed to form
suitable for morph processing or lexical lookup:
What’s in those 234 dogs’ bowls, Fred?
what ’s in those <num 234> dogs ’s bowls , Fred ?
tokenization is therefore pre-lexical and cannot
depend on lexical lookup
normalization (case, numbers, dates, formulae) as
well as segmentation
used to be common to strip punctuation, but largecoverage systems utilize it
in generation: go from tokens to final output
Tokenization ambiguity
Unusual to find cases where humans have
any difficulty: problem arises because we
need a pipelined system
Some examples:
`I washed the dogs’ bowls’, I said. (first ’ could be
The ’keeper’s reputations are on the line. (first ’
I want a laptop-with a case. (common in email not to
end of quote)
actually indicating abbreviation for goalkeeper but could be
start of quote in text where ’ is not distinct from `)
have spaces round dash)
Modularity problems
lexicon developers may assume
particular tokenization: e.g., hyphen
removal
different systems tokenize differently:
big problem for system integration
DELPH-IN - `characterization’ – record
original string character positions in token
and all subsequent units
Speech output
Speech output from a transcribing
recognizer is treated as a lattice of
tokens
may actually require retokenization
Non-white space languages
Segmentation in Japanese (e.g.,
Chasen) is (in effect) accompanied by
lexical lookup / morphological analysis
definitely do not want to assume this for
English – for some forms of processing we
may not have a lexicon.