Generation Aims of this talk Discuss MRS and LKB generation  Describe larger research programme: modular generation  Mention some interactions with other work in.

Download Report

Transcript Generation Aims of this talk Discuss MRS and LKB generation  Describe larger research programme: modular generation  Mention some interactions with other work in.

Generation
Aims of this talk
Discuss MRS and LKB generation
 Describe larger research programme:
modular generation
 Mention some interactions with other
work in progress:

RMRS
 SEM-I

Outline of talk
Towards modular generation
 Why MRS?
 MRS and chart generation
 Data-driven techniques
 SEM-I and documentation

Modular architecture
Language independent component
Meaning representation
Language dependent realization
string or speech output
Desiderata for a portable
realization module
Application independent
 Any well-formed input should be
accepted
 No grammar-specific/conventional
information should be essential in the
input
 Output should be idiomatic

Architecture (preview)
External LF
SEM-I
Internal LF
specialization
modules
Chart generator
control modules
String
Why MRS?

Flat structures
independence of syntax: conventional LFs
partially mirror tree structure
 manipulation of individual components: can
ignore scope structure etc
 lexicalised generation
 composition by accumulation of EPs: robust
composition


Underspecification
An excursion: Robust MRS
Deep Thought: integration of deep and
shallow processing via compatible
semantics
 All components construct RMRSs

Principled way of building robustness into
deep processing
 Requirements for consistency etc help
human users too

Extreme flattening of deep output
some
every
x
some
cat
x
y
dog1
y
y
dog1
y
chase
e x
y
every
x
cat
x
chase
e x
y
lb1:every_q(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat_n(x), lb5:dog_n_1(y),
lb4:some_q(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase_v(e),ARG1(lb3,x),
ARG2(lb3,y), h9 qeq lb2,h8 qeq lb5
Extreme Underspecification



Factorize deep representation to minimal
units
Only represent what you know
Robust MRS





Separating relations
Separate arguments
Explicit equalities
Conventions for predicate names and sense
distinctions
Hierarchy of sorts on variables
Chart generation with the LKB
1.
2.
3.
4.
5.
6.
7.
Determine lexical signs from MRS
Determine possible rules contributing EPs
(`construction semantics’: compound rule
etc)
Instantiate signs (lexical and rule) according
to variable equivalences
Apply lexical rules
Instantiate chart
Generate by parsing without string position
Check output against input
Lexical lookup for generation
_like_v_1(e,x,y) – return lexical entry
for sense 1 of verb like
 temp_loc_rel(e,x,y) – returns multiple
lexical entries
 multiple relations in one lexical entry:
e.g., who, where
 entries with null semantics: heuristics

Instantiation of entries

_like_v_1(e,x,y) & named(x,”Kim”) &
named(y,”Sandy”)





find locations corresponding to `x’s in all FSs
replace all `x’s with constant
repeat for `y’s etc
Also for rules contributing construction
semantics
`Skolemization’ (misleading name ...)
Lexical rule application
Lexical rules that contribute EPs only
used if EP is in input
 Inflectional rules will only apply if
variable has the correct sort
 Lexical rule application does
morphological generation (e.g., liked,
bought)

Chart generation proper
Possible lexical signs added to a chart
structure
 Currently no indexing of chart edges



chart generation can use semantic indices,
but current results suggest this doesn’t
help
Rules applied as for chart parsing:
edges checked for compatibility with
input semantics (bag of EPs)
Root conditions
Complete structures must consume all
the EPs in the input MRS
 Should check for compatibility of scopes

precise qeq matching is (probably) too
strict
 exactly same scopes is (probably)
unrealistic and too slow

Generation failures due to
MRS issues





Well-formedness check prior to input to
generator (optional)
Lexical lookup failure: predicate doesn’t
match entry, wrong arity, wrong variable
types
Unwanted instantiations of variables
Missing EPs in input: syntax (e.g., no noun),
lexical selection
Too many EPs in input: e.g., two verbs and
no coordination
Improving generation via
corpus-based techniques

CONTROL: e.g. intersective modifier
order:

Logical representation does not determine
order
• wet(x) & weather(x) & cold(x)

UNDERSPECIFIED INPUT: e.g.,
Determiners: none/a/the/
 Prepositions: in/on/at

Constraining generation for
idiomatic output
Intersective modifier order: e.g.,
adjectives, prepositional phrases
 Logical representation does not
determine order


wet(x) & weather(x) & cold(x)
Adjective ordering

Constraints / preferences
big red car
 * red big car
 cold wet weather
 wet cold weather (OK, but dispreferred)


Difficult to encode in symbolic grammar
Corpus-derived adjective
ordering
ngrams perform poorly
 Thater: direct evidence plus clustering
 positional probability
 Malouf (2000): memory-based learning
plus positional probability: 92% on BNC

Underspecified input to
generation
We bought a car on Friday
Accept:
pron(x) & a_quant(y,h1,h2) & car(y) &
buy(epast,x,y) & on(e,z) & named(z,Friday)
and:
pron(x) & general_q(y,h1,h2) & car(y) &
buy(epast,x,y) & temploc(e,z) & named(z,Friday)
And maybe:
pron(x1pl) & car(y) & buy(epast,x,y) & temp_loc(e,z)
& named(z,Friday)
Guess the determiner
We went climbing in _ Andes
 _ president of _ United States
 I tore _ pyjamas
 I tore _ duvet
 George doesn’t like _ vegetables
 We bought _ new car yesterday

Determining determiners




Determiners are partly conventionalized,
often predictable from local context
Translation from Japanese etc, speech
prosthesis application
More `meaning-rich’ determiners assumed to
be specified in the input
Minnen et al: 85% on WSJ (using TiMBL)
Preposition guessing

Choice between temporal in/on/at








in the morning
in July
on Wednesday
on Wednesday morning
at three o’clock
at New Year
ERG uses hand-coded rules and lexical categories
Machine learning approach gives very high precision
and recall on WSJ, good results on balanced corpus
(Lin Mei, 2004, Cambridge MPhil thesis)
SEM-I: semantic interface
Meta-level: manually specified
`grammar’ relations (constructions and
closed-class)
 Object-level: linked to lexical database
for deep grammars
 Definitional: e.g. lemma+POS+sense
 Linked test suites, examples,
documentation

SEM-I development
SEM-I eventually forms the `API’:
stable, changes negotiated.
 SEM-I vs Verbmobil SEMDB

Technical limitations of SEMDB
 Too painful!
 `Munging’ rules: external vs internal
 SEM-I development must be incremental

Role of SEM-I in architecture

Offline
Definition of `correct’ (R)MRS for
developers
 Documentation
 Checking of test-suites


Online
In unifier/selector: reject invalid RMRSs
 Patching up input to generation

Goal: semi-automated
documentation
[incr tsdb()]
Lex DB
ERG
Object-level
SEM-I
Documentation
strings
and semantic
test-suite
Auto-generate
examples
semi-automatic
examples,
autogenerated
on demand
Documentation
Meta-level
SEM-I
autogenerate
appendix
Robust generation

SEM-I an important preliminary

check whether generator input is
semantically compatible with grammars
Eventually: hierarchy of relations
outside grammars, allowing
underspecification
 `fill-in’ of underspecified RMRS


exploit work on determiner guessing etc
Architecture (again)
External LF
SEM-I
Internal LF
specialization
modules
Chart generator
control modules
String
Interface

External representation
public, documented
 reasonably stable


Internal representation
syntax/semantics interface
 convenient for analysis


External/Internal conversion via SEM-I
Guaranteed generation?
Given a well-formed input MRS/RMRS,
with elementary predications found in
SEM-I (and dependencies)
 Can we generate a string? with input fix
up? negotiation?

Semantically bleached lexical items: which,
one, piece, do, make
 Defective paradigms, negative polarity,
anti-collocations etc?

Next stages







SEM-I development
Documentation and test suite integration
Generation from RMRSs produced by shallower
parser (or deep/shallow combination)
Partially fixed text in generation (cogeneration)
Further statistical modules: e.g., locational
prepositions, other modifiers
More underspecification
Gradually increase flexibility of interface to
generation