Towards an NLP `module’ The role of an utterance-level interface Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech.
Download ReportTranscript Towards an NLP `module’ The role of an utterance-level interface Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech.
Towards an NLP `module’ The role of an utterance-level interface Modular architecture Language independent application Meaning representation Language module Utterance-level interface text or speech Desiderata for NLP module 1. 2. 3. 4. 5. 6. Application- and domain- independent Bidirectional processing No grammar-specific information should be needed in the application Architecture should support multiple languages Practical Coverage: all well-formed input should be accepted, robust to speaker errors Why? developers could build `intelligent’ responsive applications without being NLP experts themselves less time-consuming and expensive than doing the NLP for each application domain multilingual applications support further research LinGO/DELPH-IN Software and `lingware’ for application- and domain- independent NLP Linguistically-motivated (HPSG), deep processing Multiple languages Analysis and generation Informal collaboration since c.1995 NLP research and development, theoretical research and teaching www.delph-in.net What’s different? Open Source, integrated systems Data-driven techniques combined with linguistic expertise Testing empirical basis, evaluation linguistic motivation No toy systems! large scale grammars maintainable software development and runtime tools Progress 1. Application- and domain- independent: reasonable 2. (lexicons, text structure) Bidirectional processing: yes 3. No grammar-specifics in applications: yes 4. Multiple languages: English, Japanese, German, 5. 6. Norwegian, Korean, Greek, Italian, French: plus grammar sharing via the Matrix Practical: efficiency OK for some applications and improving, interfaces? Coverage and robustness: 80%+ coverage on English, good parse selection, not robust Integrating deep and shallow processing Shallow processing: speed and robustness, but lacks precision Pairwise integration of systems is time-consuming, brittle Common semantic representation language: shallow processing underspecified Demonstrated effectiveness on IE (Deep Thought) Requires that systems share tokenization (undesirable and impractical) or that output can be precisely aligned with original document Markup complicates this Utterance-level interface text or speech complex cases text structure (e.g., headings, lists) non-text (e.g., formulae, dates, graphics) segmentation (esp., Japanese, Chinese) speech lattices integration of multiple analyzers Utterance interface Standard interface language allow for ambiguity at all levels XML collaborating with ISO working group (MAF) processors deliver standoff annotations to original text Plan to develop finite-state preprocessors for some text types, allow for others Plan to experiment with speech lattices Assumptions about tokenization tokenization: input data is transformed to form suitable for morph processing or lexical lookup: What’s in those 234 dogs’ bowls, Fred? what ’s in those <num 234> dogs ’s bowls , Fred ? tokenization is therefore pre-lexical and cannot depend on lexical lookup normalization (case, numbers, dates, formulae) as well as segmentation used to be common to strip punctuation, but largecoverage systems utilize it in generation: go from tokens to final output Tokenization ambiguity Unusual to find cases where humans have any difficulty: problem arises because we need a pipelined system Some examples: `I washed the dogs’ bowls’, I said. (first ’ could be The ’keeper’s reputations are on the line. (first ’ I want a laptop-with a case. (common in email not to end of quote) actually indicating abbreviation for goalkeeper but could be start of quote in text where ’ is not distinct from `) have spaces round dash) Modularity problems lexicon developers may assume particular tokenization: e.g., hyphen removal different systems tokenize differently: big problem for system integration DELPH-IN - `characterization’ – record original string character positions in token and all subsequent units Speech output Speech output from a transcribing recognizer is treated as a lattice of tokens may actually require retokenization Non-white space languages Segmentation in Japanese (e.g., Chasen) is (in effect) accompanied by lexical lookup / morphological analysis definitely do not want to assume this for English – for some forms of processing we may not have a lexicon.