Towards an NLP `module’ The role of an utterance-level interface Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech.

Download Report

Transcript Towards an NLP `module’ The role of an utterance-level interface Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech.

Towards an NLP `module’
The role of an
utterance-level
interface
Modular architecture
Language independent application
Meaning representation
Language module
Utterance-level
interface
text or speech
Desiderata for NLP module
1.
2.
3.
4.
5.
6.
Application- and domain- independent
Bidirectional processing
No grammar-specific information should be
needed in the application
Architecture should support multiple
languages
Practical
Coverage: all well-formed input should be
accepted, robust to speaker errors
Why?
developers could build `intelligent’
responsive applications without being
NLP experts themselves
 less time-consuming and expensive
than doing the NLP for each application
domain
 multilingual applications
 support further research

LinGO/DELPH-IN







Software and `lingware’ for application- and
domain- independent NLP
Linguistically-motivated (HPSG), deep
processing
Multiple languages
Analysis and generation
Informal collaboration since c.1995
NLP research and development, theoretical
research and teaching
www.delph-in.net
What’s different?



Open Source, integrated systems
Data-driven techniques combined with
linguistic expertise
Testing



empirical basis, evaluation
linguistic motivation
No toy systems!



large scale grammars
maintainable software
development and runtime tools
Progress
1.
Application- and domain- independent: reasonable
2.
(lexicons, text structure)
Bidirectional processing: yes
3.
No grammar-specifics in applications: yes
4.
Multiple languages: English, Japanese, German,
5.
6.
Norwegian, Korean, Greek, Italian, French: plus
grammar sharing via the Matrix
Practical: efficiency OK for some applications and
improving, interfaces?
Coverage and robustness: 80%+ coverage on English,
good parse selection, not robust
Integrating deep and shallow
processing





Shallow processing: speed and robustness, but lacks
precision
Pairwise integration of systems is time-consuming,
brittle
Common semantic representation language: shallow
processing underspecified
Demonstrated effectiveness on IE (Deep Thought)
Requires that systems share tokenization
(undesirable and impractical) or that output can be
precisely aligned with original document

Markup complicates this
Utterance-level interface
text or speech
 complex cases

text structure (e.g., headings, lists)
 non-text (e.g., formulae, dates, graphics)
 segmentation (esp., Japanese, Chinese)
 speech lattices


integration of multiple analyzers
Utterance interface

Standard interface language






allow for ambiguity at all levels
XML
collaborating with ISO working group (MAF)
processors deliver standoff annotations to original
text
Plan to develop finite-state preprocessors for
some text types, allow for others
Plan to experiment with speech lattices
Assumptions about
tokenization

tokenization: input data is transformed to form
suitable for morph processing or lexical lookup:






What’s in those 234 dogs’ bowls, Fred?
what ’s in those <num 234> dogs ’s bowls , Fred ?
tokenization is therefore pre-lexical and cannot
depend on lexical lookup
normalization (case, numbers, dates, formulae) as
well as segmentation
used to be common to strip punctuation, but largecoverage systems utilize it
in generation: go from tokens to final output
Tokenization ambiguity


Unusual to find cases where humans have
any difficulty: problem arises because we
need a pipelined system
Some examples:

`I washed the dogs’ bowls’, I said. (first ’ could be

The ’keeper’s reputations are on the line. (first ’

I want a laptop-with a case. (common in email not to
end of quote)
actually indicating abbreviation for goalkeeper but could be
start of quote in text where ’ is not distinct from `)
have spaces round dash)
Modularity problems
lexicon developers may assume
particular tokenization: e.g., hyphen
removal
 different systems tokenize differently:
big problem for system integration


DELPH-IN - `characterization’ – record
original string character positions in token
and all subsequent units
Speech output

Speech output from a transcribing
recognizer is treated as a lattice of
tokens

may actually require retokenization
Non-white space languages

Segmentation in Japanese (e.g.,
Chasen) is (in effect) accompanied by
lexical lookup / morphological analysis

definitely do not want to assume this for
English – for some forms of processing we
may not have a lexicon.