Transcript Faceted Metadata for Information Architecture and Search
Unambiguous + Unlimited = Unsupervised or Using the Web for Natural Language Processing Problems
Marti Hearst School of Information, UC Berkeley This research supported in part by NSF DBI-0317510
Natural Language Processing
The ultimate goal: write programs that read and understand stories and conversations.
This is too hard! Instead we tackle sub-problems.
There have been notable successes lately: Machine translation is vastly improved Decent speech recognition in limited circumstances Text categorization works with some accuracy PARC, Aug 3, 2006
Automatic Help Desk Translation at MS
PARC, Aug 3, 2006
Why is text analysis difficult?
One reason: enormous vocabulary size.
The average English speaker’s vocabulary is around 50,000 words, Many of these can be combined with many others, And they mean different things when they do!
PARC, Aug 3, 2006
How can a machine understand these?
Decorate the cake with the frosting.
Decorate the cake with the kids.
Throw out the cake with the frosting.
Get the sock from the cat with the gloves.
Get the glove from the cat with the socks.
It’s in the plastic water bottle.
It’s in the plastic bag dispenser.
PARC, Aug 3, 2006
How to tackle this problem?
The field was stuck for quite some time.
CYC: hand-enter all semantic concepts and relations A new approach started around 1990 How to do it: Get large text collections Compute statistics over the words in those collections Many different algorithms for doing this.
PARC, Aug 3, 2006
Size Matters
Recent realization: bigger is better than smarter!
Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL PARC, Aug 3, 2006
Example Problem
Grammar checker example: Which word to use?
I am in my third year as the principal of Anamosa High School. School-principal transfers caused some upset.
This is a simple formulation of the quantum mechanical uncertainty principle. Power without principle is barren, but principle without power is futile. (Tony Blair)
PARC, Aug 3, 2006
Using Very, Very Large Corpora
Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: Principal: “high school” Principle: “rule” At grammar-check time, choose the spelling best predicted by the surrounding words.
Surprising results: Log-linear improvement even to a billion words!
Getting more data is better than fine-tuning algorithms!
PARC, Aug 3, 2006
The Effects of LARGE Datasets
From Banko & Brill ‘01 PARC, Aug 3, 2006
How to Extend this Idea?
This is an exciting result … BUT relies on having huge amounts of text that has been appropriately annotated!
PARC, Aug 3, 2006
How to Avoid Labeling?
“Web as a baseline” (Lapata & Keller 04,05) Main idea: apply web-determined counts to every problem imaginable.
Example: for t in {
f(w1, t, w2)
The largest count wins
Web as a Baseline
Works very well in some cases machine translation candidate selection article generation noun compound interpretation noun compound bracketing adjective ordering But lacking in others spelling correction countability detection prepositional phrase attachment How to push this idea further?
Significantly better
than the best supervised algorithm.
Not significantly different
from the best supervised.
PARC, Aug 3, 2006
Using Unambiguous Cases
The trick: look for unambiguous cases to start Use these to improve the results beyond what co occurrence statistics indicate.
An Early Example: Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93 Problem: Prepositional Phrase attachment
I eat /v spaghetti /n1 with /p a
fork
/n2 .
I eat /v spaghetti /n1 with /p
sauce
/n2 .
quadruple: (v, n1, p, n2) Question: does n2 attach to v or to n1 ?
PARC, Aug 3, 2006
Using Unambiguous Cases
How to do this with unlabeled data?
First try: Parse some text into phrase structure Then compute certain co-occurrences f(v, n1, p) f(n1, p) f(v, n1) Problem: results not accurate enough The trick: look for unambiguous cases: Spaghetti with sauce is delicious . (pre-verbal) I eat it with a fork . (object of preposition can’t attach to a pronoun) Use these to improve the results beyond what co occurrence statistics indicate.
PARC, Aug 3, 2006
Unambiguous + Unlimited = Unsupervised
Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea The potential of these approaches are not fully realized Our work: Structural Ambiguity Decisions (work with Preslav Nakov) PP-attachment Noun compound bracketing Coordination grouping Semantic Relation Acquisition Hypernym (ISA) relations Verbal relations between nouns PARC, Aug 3, 2006
Structural Ambiguity Problems
Apply the U + U = U idea to structural ambiguity Noun compound bracketing Prepositional Phrase attachment Noun Phrase coordination Motivation: BioText project
In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF).
Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.
•
BimL protein interact with Bcl-2 or Bcl-XL, or Bcl-w proteins (Immuno precipitation (anti-Bcl-2 OR Bcl-XL or Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE tagged BimL and (bcl-2 or bcl-XL or bcl-w) plasmids)
PARC, Aug 3, 2006
Applying U + U = U to Structural Ambiguity
We introduce the use of (nearly) unambiguous features: surface features paraphrases Combined with very, very large corpora Achieve state-of-the-art results without labeled examples.
PARC, Aug 3, 2006
Noun Compound Bracketing
(a) (b) [ [ liver cell ] antibody ] [ liver [cell line] ] (
left
bracketing) (
right
bracketing) In (a), the
antibody
targets the
liver cell
.
In (b), the
cell line
is derived from the
liver
.
PARC, Aug 3, 2006
Dependency Model
right
bracketing: [w 1 [w 2
w
3 ] ]
w
2
w
3 is a compound (modified by w 1 )
home health care
w
1 and w 2 independently modify w 3
adult male rat left
bracketing : [ [w 1
w
2 ]w 3 ] only 1 modificational choice possible
law enforcement officer w
1
w
1
w
2
w
2
w
3
w
3 PARC, Aug 3, 2006
Related Work
Marcus(1980), Pustejosky&al.(1993), Resnik(1993) adjacency model: Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995) dependency model: Pr(w 1 |w 2 ) vs. Pr(w 1 |w 3 ) Keller & Lapata (2004): use the Web unigrams and bigrams Girju & al. (2005) supervised model bracketing in context requires WordNet senses to be given
Our approach:
• Web as data • 2 , n-grams • paraphrases • surface features PARC, Aug 3, 2006
Computing Bigram Statistics
Dependency Model, Frequencies Compare #(w 1 ,w 2 ) to #(w 1 ,w 3 ) Dependency model, Probabilities
right
Pr(
left
) = Pr(w 1
w
2 |w 2 )Pr(w 2
w
3 |w 3 ) Pr(
right
) = Pr(w 1
w
3 |w 3 )Pr(w 2
w
3 |w 3 )
w
1
w
2
left
So we compare Pr(w 1
w
2 |w 2 ) to Pr(w 1
w
3 |w 3 )
w
3 PARC, Aug 3, 2006
Probabilities: Estimation
Using page hits as a proxy for n-gram counts Pr(w 1
w
2 |w 2 ) = #(w 1 ,
w
2 ) / #(w 2 ) #(w 2 ) word frequency; query for “w 2 ” #(w 1 ,
w
2 ) bigram frequency; query for “w 1
w
2 ” smoothed by 0.5
PARC, Aug 3, 2006
Association Models:
2
(Chi Squared)
A = #(w
i
,
w j
) B = #(w
i
) – #(w
i
,
w j
) C = #(w
j
) – #(w
i
,
w j
) D = N – (A+B+C) N = 8 trillion (= A+B+C+D)
8 billion Web pages x 1,000 words
PARC, Aug 3, 2006
Web-derived Surface Features
Authors often disambiguate noun compounds using surface markers , e.g.
: amino-acid sequence brain stem’s cell
left
brain’s stem cell
right left
The enormous size of the Web makes these frequent enough to be useful.
PARC, Aug 3, 2006
Web-derived Surface Features: Dash (hyphen) Left dash
cell-cycle analysis
left
Right dash
donor T-cell
right fiber optics-system
should be
left
..
Double dash
T-cell-depletion
unusable…
PARC, Aug 3, 2006
Web-derived Surface Features: Possessive Marker Attached to the first word
brain’s stem cell
right
Attached to the second word
brain stem’s cell
left
Combined features
brain’s stem-cell
right
PARC, Aug 3, 2006
Web-derived Surface Features: Capitalization don’t-care – lowercase – uppercase
Plasmodium vivax Malaria
left
plasmodium vivax Malaria
left
lowercase – uppercase – don’t-care
brain Stem cell
right
brain Stem Cell
right
Disable this on: Roman digits Single-letter words: e.g. vitamin D deficiency PARC, Aug 3, 2006
Web-derived Surface Features: Embedded Slash Left embedded slash
leukemia/lymphoma cell
right
PARC, Aug 3, 2006
Web-derived Surface Features: Parentheses Single-word
growth factor (beta) (brain) stem cell
left right
Two-word
(growth factor) beta brain (stem cell)
left right
PARC, Aug 3, 2006
Web-derived Surface Features: Comma, dot, semi-colon Following the first word home. health care
right adult, male rat
right
Following the second word
health care, provider lung cancer: patients
left left
PARC, Aug 3, 2006
Web-derived Surface Features: Dash to External Word External word to the left
mouse
-brain stem cell
right
External word to the right
tumor necrosis factor alpha
left
PARC, Aug 3, 2006
Web-derived Surface Features: Problems & Solutions Problem: search engines ignore punctuation in queries
“brain-stem cell” does not work
Solution:
query for “brain stem cell” obtain 1,000 document summaries scan for the features in these summaries PARC, Aug 3, 2006
Other Web-derived Features: Abbreviation After the second word
tumor necrosis factor (NF)
right
After the third word
tumor necrosis (TN) factor
right
We query for, e.g., “
tumor necrosis tn factor
” Problems: Roman digits: IV, VI States: CA Short words: me PARC, Aug 3, 2006
Other Web-derived Features: Concatenation Consider health care reform healthcare : 79,500,000 carereform : 269 healthreform: 812 Adjacency model
healthcare vs. carereform
Dependency model
healthcare vs. healthreform
Triples
“healthcare reform ” vs. “ health carereform ”
PARC, Aug 3, 2006
Other Web-derived Features: Reorder Reorders for “health care reform” “
care reform
health”
right
“reform
health care
”
left
PARC, Aug 3, 2006
Other Web-derived Features: Internal Inflection Variability Vary inflection of second word
tyrosine kinase activation tyrosine kinase
s
activation
PARC, Aug 3, 2006
Other Web-derived Features: Switch The First Two Words Predict
right
, if we can reorder
adult male rat male adult rat
as PARC, Aug 3, 2006
Paraphrases
The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) Prepositional
stem cells
in
the brain
cells
from
the brain stem right
right
Verbal
virus
causing
human immunodeficiency
left
Copula
office building
that is
a skyscraper
right
PARC, Aug 3, 2006
Paraphrases
prepositional paraphrases: We use: ~150 prepositions verbal paraphrases: We use: associated with, caused by, contained in, derived
from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.
copula paraphrases: We use: is/was and that/which/who optional elements: articles: a, an, the quantifiers: some, every, etc. pronouns: this, these, etc. PARC, Aug 3, 2006
Evaluation: Datasets
Lauer Set
244 noun compounds (NCs) from Grolier’s encyclopedia inter-annotator agreement: 81.5%
Biomedical Set
430 NCs from MEDLINE inter-annotator agreement: 88% ( =.606) PARC, Aug 3, 2006
Evaluation: Experiments
Exact phrase queries Limited to English
Inflections:
Lauer Set: Carroll’s morphological tools Biomedical Set: UMLS Specialist Lexicon PARC, Aug 3, 2006
Co-occurrence Statistics
Lauer set Bio set PARC, Aug 3, 2006
Paraphrase and Surface Features Performance
Lauer Set
Biomedical Set
PARC, Aug 3, 2006
Individual Surface Features Performance: Bio
PARC, Aug 3, 2006
Individual Surface Features Performance: Bio
PARC, Aug 3, 2006
Results Lauer
PARC, Aug 3, 2006
Results: Comparing with Others
PARC, Aug 3, 2006
Results Bio
PARC, Aug 3, 2006
Summary: Results for Noun Compound Bracketing
Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) surface features paraphrases Obtained new state-of-the-art results on NC bracketing more robust than Lauer (1995) more accurate than Keller&Lapata (2004) PARC, Aug 3, 2006
Prepositional Phrase Attachment
(a) Peter spent millions of dollars.
(b) Peter spent time (
noun attach
) with his family. (
verb attach
) quadruple: (v, n1, p, n2) (a) (b) (spent, millions, of, dollars) (spent, time, with, family) PARC, Aug 3, 2006
Noun Phrase Coordination
(Modified) real sentence:
The Department of Chronic Diseases
and
Health Promotion leads
and
strengthens global efforts to prevent
and
control chronic diseases
or
disabilities
and
to promote health
and
quality of life .
PARC, Aug 3, 2006
NC coordination: ellipsis
Ellipsis
car and truck production
means
car production and truck production
No ellipsis
president and chief executive
All-way coordination
Securities and Exchange Commission
PARC, Aug 3, 2006
Results
428 examples from Penn TB PARC, Aug 3, 2006
Semantic Relation Detection
Goal: automatically augment a lexical database Many potential relation types: ISA (hypernymy/hyponymy) Part-Of (meronymy) Idea: find unambiguous contexts which (nearly) always indicate the relation of interest PARC, Aug 3, 2006
Lexico-Syntactic Patterns
PARC, Aug 3, 2006
Lexico-Syntactic Patterns
PARC, Aug 3, 2006
Adding a New Relation
PARC, Aug 3, 2006
Semantic Relation Detection
Lexico-syntactic Patterns: Should occur frequently in text Should (nearly) always suggest the relation of interest Should be recognizable with little pre-encoded knowledge.
These patterns have been used extensively by other researchers.
PARC, Aug 3, 2006
Semantic Relation Detection
What relationship holds between two nouns?
olive oil – oil comes from olives machine oil – oil used on machines Assigning the meaning relations between these terms has been seen as a very difficult solution Our solution: Use clever queries against the web to figure out the relations.
PARC, Aug 3, 2006
Queries for Semantic Relations
Convert the noun-noun compound into a query of the form: noun2 that * noun1 “oil that * olive(s)” This returns search result snippets containing interesting verbs.
In this case: Come from Be obtained from Be extracted from Made from … PARC, Aug 3, 2006
Queries for Semantic Relations
More examples: Migraine drug -> treat, be used for, reduce, prevent Wrinkle drug -> treat, be used for, reduce, smooth Printer tray -> hold, come with, be folded, fit under, be inserted into Student protest -> be led by, be sponsored by, pit, be, be organized by PARC, Aug 3, 2006
Conclusions
The enormous size of the web opens new opportunities for text analysis There are many words, but they are more likely to appear together in a huge dataset This allows us to do word-specific analysis Unambiguous + Unlimited = Unsupervised We’ve applied it to structural and semantic language problems.
These are stepping stones towards sophisticated language understanding.
PARC, Aug 3, 2006
Thank you!
http://biotext.berkeley.edu
Supported in part by NSF DBI-0317510
Using n-grams to make predictions
Say trying to distinguish: [home health] care home [health care] Main idea: compare these co-occurrence probabilities “home health” vs “health care” PARC, Aug 3, 2006
Using n-grams to make predictions
Use search engines page hits as a proxy for n-gram counts compare Pr(w 1
w
2 |w 2 ) to Pr(w 1
w
3 |w 3 ) Pr(w 1
w
2 |w 2 ) = #(w 1 ,
w
2 ) / #(w 2 ) #(w 2 ) word frequency; query for “w 2 ” #(w 1 ,
w
2 ) bigram frequency; query for “w 1
w
2 ” PARC, Aug 3, 2006
Probabilities: Why? (1)
Why should we use: (a) Pr(w 1
w
2 |w 2 ), (b) Pr(w 2
w
1 |w 1 ) ?
rather than Keller&Lapata (2004) calculate: AltaVista queries:
(a): 70.49%
(b): 68.85% British National Corpus: (a): 63.11%
(b): 65.57%
PARC, Aug 3, 2006
Probabilities: Why? (2)
Why should we use: (a) Pr(w 1
w
2 |w 2 ), (b) Pr(w 2
w
1 |w 1 ) ?
rather than Maybe to introduce a bracketing prior.
Just like Lauer (1995) did.
But otherwise, no reason to prefer either one.
Do we need probabilities? ( association is OK) Do we need a directed model? ( symmetry is OK) PARC, Aug 3, 2006
Adjacency & Dependency (2)
right
bracketing: [w 1 [w 2
w
3 ] ]
w
2
w
3 is a compound (modified by w 1 )
w
1 and w 2 independently modify w 3
adjacency model
Is w 2
w
3 a compound?
(vs. w 1
w
2 being a compound)
dependency model
Does w 1 (vs. w 1 modify w 3 ? modifying w 2 )
w
1
w
1
w
1
w
2
w
2
w
2
w
3
w
3
w
3 PARC, Aug 3, 2006
Paraphrases: pattern (1) (1)v n1
p n2
v
n2
n1 ( noun )
Can we turn “ n1 p n2 ” into a noun compound “ n2 n1 ”?
meet/v demands/n1 from/p customers/n2
meet/v the customer/n2 demands/n1
gave/v an apple/n1 to/p him/n2 gave/v him/n2 an apple/n1
Solution: Problem: ditransitive verbs like give no determiner before n1 determiner before n2 is required the preposition cannot be to PARC, Aug 3, 2006
Paraphrases: pattern (2) (2)v n1
p n2
v
p n2
n1 ( verb )
If “ p n2 ” is an indirect object of v , then it could be switched with the direct object n1 .
had/v a program/n1 had/v in/p place/n2 in/p place/n2 a program/n1
Determiner before n1 is required to prevent “ n2 n1 ” from forming a noun compound.
PARC, Aug 3, 2006
Paraphrases: pattern (3) (3)v n1
p n2 p n2
* v n1 ( verb )
“*” indicates a wildcard position (up to three intervening words are allowed) Looks for appositions, where the PP has moved in front of the verb, e.g.
I gave/v an apple/n1 to/p him/n2
to/p him/n2 I gave/v an apple/n1
PARC, Aug 3, 2006
Paraphrases: pattern (4) (4)v n1 p n2
n1 p n2 v ( noun )
Looks for appositions, where “ n1 p n2 ” has moved in front of v
shaken/v confidence/n1 in/p markets/n2
confidence/n1 in/p markets/n2 shaken/v
PARC, Aug 3, 2006
Paraphrases: pattern (5) (5)v n1 p n2
v PRONOUN p n2 ( verb )
n1 verb (Hindle&Rooth, 93) Pattern (5) substitutes n1 (
him
or
her
), e.g.
with a dative pronoun
put/v a client/n1 put/v
him
at/p odds/n2 at/p odds/n2
PARC, Aug 3, 2006
Paraphrases: pattern (6)
(6) v
n1 p n2
BE
n1 p n2 ( noun )
BE
Pattern (6) substitutes
v
or
are
), e.g.
with a form of to be (
eat/v
spaghetti/n1 with/p sauce/n2
is
spaghetti/n1 with/p sauce/n2
is
PARC, Aug 3, 2006