Faceted Metadata for Information Architecture and Search

Transcript Faceted Metadata for Information Architecture and Search

Unambiguous + Unlimited = Unsupervised or Using the Web for Natural Language Processing Problems

Marti Hearst School of Information, UC Berkeley This research supported in part by NSF DBI-0317510

Natural Language Processing

   The ultimate goal: write programs that read and understand stories and conversations.

This is too hard! Instead we tackle sub-problems.

There have been notable successes lately:    Machine translation is vastly improved Decent speech recognition in limited circumstances Text categorization works with some accuracy PARC, Aug 3, 2006

Automatic Help Desk Translation at MS

PARC, Aug 3, 2006

Why is text analysis difficult?

 One reason: enormous vocabulary size.

 The average English speaker’s vocabulary is around 50,000 words,   Many of these can be combined with many others, And they mean different things when they do!

PARC, Aug 3, 2006

How can a machine understand these?

   Decorate the cake with the frosting.

Decorate the cake with the kids.

Throw out the cake with the frosting.

  Get the sock from the cat with the gloves.

Get the glove from the cat with the socks.

  It’s in the plastic water bottle.

It’s in the plastic bag dispenser.

PARC, Aug 3, 2006

How to tackle this problem?

   The field was stuck for quite some time.

 CYC: hand-enter all semantic concepts and relations A new approach started around 1990 How to do it:  Get large text collections  Compute statistics over the words in those collections  Many different algorithms for doing this.

PARC, Aug 3, 2006

Size Matters

 Recent realization: bigger is better than smarter!

 Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL PARC, Aug 3, 2006

Example Problem

 Grammar checker example: Which word to use?  Solution: look at which words surround each use:    

I am in my third year as the principal of Anamosa High School. School-principal transfers caused some upset.

This is a simple formulation of the quantum mechanical uncertainty principle. Power without principle is barren, but principle without power is futile. (Tony Blair)

PARC, Aug 3, 2006

Using Very, Very Large Corpora

 Keep track of which words are the neighbors of each spelling in well-edited text, e.g.:  Principal: “high school”  Principle: “rule”  At grammar-check time, choose the spelling best predicted by the surrounding words.

 Surprising results:   Log-linear improvement even to a billion words!

Getting more data is better than fine-tuning algorithms!

PARC, Aug 3, 2006

The Effects of LARGE Datasets

 From Banko & Brill ‘01 PARC, Aug 3, 2006

How to Extend this Idea?

  This is an exciting result … BUT relies on having huge amounts of text that has been appropriately annotated!

PARC, Aug 3, 2006

How to Avoid Labeling?

 “Web as a baseline” (Lapata & Keller 04,05)  Main idea: apply web-determined counts to every problem imaginable.

   Example: for t in { Compute

f(w1, t, w2)

The largest count wins } PARC, Aug 3, 2006

Web as a Baseline

 Works very well in some cases  machine translation candidate selection     article generation noun compound interpretation noun compound bracketing adjective ordering  But lacking in others  spelling correction   countability detection prepositional phrase attachment  How to push this idea further?

Significantly better

than the best supervised algorithm.

Not significantly different

from the best supervised.

PARC, Aug 3, 2006

Using Unambiguous Cases

   The trick: look for unambiguous cases to start Use these to improve the results beyond what co occurrence statistics indicate.

An Early Example:   Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93 Problem: Prepositional Phrase attachment  

I eat /v spaghetti /n1 with /p a

fork

/n2 .

I eat /v spaghetti /n1 with /p

sauce

/n2 .

  quadruple: (v, n1, p, n2) Question: does n2 attach to v or to n1 ?

PARC, Aug 3, 2006

Using Unambiguous Cases

    How to do this with unlabeled data?

First try:    Parse some text into phrase structure Then compute certain co-occurrences f(v, n1, p) f(n1, p) f(v, n1) Problem: results not accurate enough The trick: look for unambiguous cases:   Spaghetti with sauce is delicious . (pre-verbal) I eat it with a fork . (object of preposition can’t attach to a pronoun) Use these to improve the results beyond what co occurrence statistics indicate.

PARC, Aug 3, 2006

Unambiguous + Unlimited = Unsupervised

  Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea  The potential of these approaches are not fully realized Our work:   Structural Ambiguity Decisions (work with Preslav Nakov)    PP-attachment Noun compound bracketing Coordination grouping Semantic Relation Acquisition   Hypernym (ISA) relations Verbal relations between nouns PARC, Aug 3, 2006

Structural Ambiguity Problems

 Apply the U + U = U idea to structural ambiguity    Noun compound bracketing Prepositional Phrase attachment Noun Phrase coordination  Motivation: BioText project 

In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF).



Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.

•

BimL protein interact with Bcl-2 or Bcl-XL, or Bcl-w proteins (Immuno precipitation (anti-Bcl-2 OR Bcl-XL or Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE tagged BimL and (bcl-2 or bcl-XL or bcl-w) plasmids)

PARC, Aug 3, 2006

Applying U + U = U to Structural Ambiguity

   We introduce the use of (nearly) unambiguous features:   surface features paraphrases Combined with very, very large corpora Achieve state-of-the-art results without labeled examples.

PARC, Aug 3, 2006

Noun Compound Bracketing

(a) (b) [ [ liver cell ] antibody ] [ liver [cell line] ] (

left

bracketing) (

right

bracketing) In (a), the

antibody

targets the

liver cell

In (b), the

cell line

is derived from the

liver

PARC, Aug 3, 2006

Dependency Model

 

right

 bracketing: [w 1 [w 2

3 ] ]

3 is a compound (modified by w 1 ) 

home health care



1  and w 2 independently modify w 3

adult male rat left

 bracketing : [ [w 1

2 ]w 3 ] only 1 modificational choice possible 

law enforcement officer w

3 PARC, Aug 3, 2006

Related Work

    Marcus(1980), Pustejosky&al.(1993), Resnik(1993)  adjacency model: Pr(w 1 |w 2 ) vs. Pr(w 2 |w 3 ) Lauer (1995)  dependency model: Pr(w 1 |w 2 ) vs. Pr(w 1 |w 3 ) Keller & Lapata (2004):  use the Web  unigrams and bigrams Girju & al. (2005)  supervised model  bracketing in context  requires WordNet senses to be given

Our approach:

• Web as data •  2 , n-grams • paraphrases • surface features PARC, Aug 3, 2006

Computing Bigram Statistics

 Dependency Model, Frequencies  Compare #(w 1 ,w 2 ) to #(w 1 ,w 3 )  Dependency model, Probabilities

right

 Pr(

left

) = Pr(w 1 

2 |w 2 )Pr(w 2 

3 |w 3 )  Pr(

right

) = Pr(w 1 

3 |w 3 )Pr(w 2 

3 |w 3 )

left

 So we compare Pr(w 1 

2 |w 2 ) to Pr(w 1 

3 |w 3 )

3 PARC, Aug 3, 2006

Probabilities: Estimation

 Using page hits as a proxy for n-gram counts  Pr(w 1 

2 |w 2 ) = #(w 1 ,

2 ) / #(w 2 )   #(w 2 ) word frequency; query for “w 2 ” #(w 1 ,

2 ) bigram frequency; query for “w 1

2 ”  smoothed by 0.5

PARC, Aug 3, 2006

Association Models:

 2

(Chi Squared)

     A = #(w

w j

) B = #(w

) – #(w

w j

) C = #(w

) – #(w

w j

) D = N – (A+B+C) N = 8 trillion (= A+B+C+D)

8 billion Web pages x 1,000 words

PARC, Aug 3, 2006

Web-derived Surface Features

 Authors often disambiguate noun compounds using surface markers , e.g.

:    amino-acid sequence  brain stem’s cell 

left

brain’s stem cell 

right left

 The enormous size of the Web makes these frequent enough to be useful.

PARC, Aug 3, 2006

Web-derived Surface Features: Dash (hyphen)  Left dash 

cell-cycle analysis



left

 Right dash  

donor T-cell



right fiber optics-system



should be

left

 Double dash 

T-cell-depletion



unusable…

PARC, Aug 3, 2006

Web-derived Surface Features: Possessive Marker  Attached to the first word 

brain’s stem cell



right

 Attached to the second word 

brain stem’s cell



left

 Combined features 

brain’s stem-cell



right

PARC, Aug 3, 2006

Web-derived Surface Features: Capitalization  don’t-care – lowercase – uppercase 

Plasmodium vivax Malaria



left



plasmodium vivax Malaria



left

 lowercase – uppercase – don’t-care 

brain Stem cell



right



brain Stem Cell



right

 Disable this on:  Roman digits  Single-letter words: e.g. vitamin D deficiency PARC, Aug 3, 2006

Web-derived Surface Features: Embedded Slash  Left embedded slash 

leukemia/lymphoma cell



right

PARC, Aug 3, 2006

Web-derived Surface Features: Parentheses  Single-word  

growth factor (beta) (brain) stem cell

 

left right

 Two-word  

(growth factor) beta brain (stem cell)

 

left right

PARC, Aug 3, 2006

Web-derived Surface Features: Comma, dot, semi-colon  Following the first word   home. health care 

right adult, male rat



right

 Following the second word  

health care, provider lung cancer: patients

 

left left

PARC, Aug 3, 2006

Web-derived Surface Features: Dash to External Word  External word to the left 

mouse

-brain stem cell 

right

 External word to the right 

tumor necrosis factor alpha



left

PARC, Aug 3, 2006

Web-derived Surface Features: Problems & Solutions  Problem: search engines ignore punctuation in queries 

“brain-stem cell” does not work



Solution:

   query for “brain stem cell” obtain 1,000 document summaries scan for the features in these summaries PARC, Aug 3, 2006

Other Web-derived Features: Abbreviation  After the second word 

tumor necrosis factor (NF)



right

 After the third word 

tumor necrosis (TN) factor



right

  We query for, e.g., “

tumor necrosis tn factor

” Problems:  Roman digits: IV, VI  States: CA  Short words: me PARC, Aug 3, 2006

Other Web-derived Features: Concatenation     Consider health care reform  healthcare : 79,500,000  carereform : 269  healthreform: 812 Adjacency model 

healthcare vs. carereform

Dependency model 

healthcare vs. healthreform

Triples 

“healthcare reform ” vs. “ health carereform ”

PARC, Aug 3, 2006

Other Web-derived Features: Reorder  Reorders for “health care reform”  “

care reform

health” 

right

 “reform

health care

” 

left

PARC, Aug 3, 2006

Other Web-derived Features: Internal Inflection Variability  Vary inflection of second word  

tyrosine kinase activation tyrosine kinase

activation

PARC, Aug 3, 2006

Other Web-derived Features: Switch The First Two Words  Predict

right

, if we can reorder  

adult male rat male adult rat

as PARC, Aug 3, 2006

Paraphrases

 The semantics of a noun compound is often made overt by a paraphrase (Warren,1978)    Prepositional  

stem cells

the brain



cells

from

the brain stem right



right

Verbal 

virus

causing

human immunodeficiency



left

Copula 

office building

that is

a skyscraper



right

PARC, Aug 3, 2006

Paraphrases

    prepositional  paraphrases: We use: ~150 prepositions verbal  paraphrases: We use: associated with, caused by, contained in, derived

from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.

copula  paraphrases: We use: is/was and that/which/who optional elements:    articles: a, an, the quantifiers: some, every, etc. pronouns: this, these, etc. PARC, Aug 3, 2006

Evaluation: Datasets



Lauer Set

 244 noun compounds (NCs)   from Grolier’s encyclopedia inter-annotator agreement: 81.5% 

Biomedical Set

 430 NCs   from MEDLINE inter-annotator agreement: 88% (  =.606) PARC, Aug 3, 2006

Evaluation: Experiments

 

Exact phrase queries Limited to English



Inflections:

  Lauer Set: Carroll’s morphological tools Biomedical Set: UMLS Specialist Lexicon PARC, Aug 3, 2006

Co-occurrence Statistics

 Lauer set  Bio set PARC, Aug 3, 2006

Paraphrase and Surface Features Performance



Lauer Set



Biomedical Set

PARC, Aug 3, 2006

Individual Surface Features Performance: Bio

PARC, Aug 3, 2006

Individual Surface Features Performance: Bio

PARC, Aug 3, 2006

Results Lauer

PARC, Aug 3, 2006

Results: Comparing with Others

PARC, Aug 3, 2006

Results Bio

PARC, Aug 3, 2006

Summary: Results for Noun Compound Bracketing

 Introduced search engine statistics that go beyond the n-gram (applicable to other tasks)   surface features paraphrases  Obtained new state-of-the-art results on NC bracketing   more robust than Lauer (1995) more accurate than Keller&Lapata (2004) PARC, Aug 3, 2006

Prepositional Phrase Attachment

(a) Peter spent millions of dollars.

(b) Peter spent time (

noun attach

) with his family. (

verb attach

) quadruple: (v, n1, p, n2) (a) (b) (spent, millions, of, dollars) (spent, time, with, family) PARC, Aug 3, 2006

Noun Phrase Coordination

 (Modified) real sentence: 

The Department of Chronic Diseases

and

Health Promotion leads

and

strengthens global efforts to prevent

and

control chronic diseases

disabilities

and

to promote health

and

quality of life .

PARC, Aug 3, 2006

NC coordination: ellipsis

 Ellipsis 

car and truck production

 means

car production and truck production

 No ellipsis 

president and chief executive

 All-way coordination 

Securities and Exchange Commission

PARC, Aug 3, 2006

Results

428 examples from Penn TB PARC, Aug 3, 2006

Semantic Relation Detection

   Goal: automatically augment a lexical database Many potential relation types:   ISA (hypernymy/hyponymy) Part-Of (meronymy) Idea: find unambiguous contexts which (nearly) always indicate the relation of interest PARC, Aug 3, 2006

Lexico-Syntactic Patterns

PARC, Aug 3, 2006

Lexico-Syntactic Patterns

PARC, Aug 3, 2006

Adding a New Relation

PARC, Aug 3, 2006

Semantic Relation Detection

  Lexico-syntactic Patterns:  Should occur frequently in text  Should (nearly) always suggest the relation of interest  Should be recognizable with little pre-encoded knowledge.

These patterns have been used extensively by other researchers.

PARC, Aug 3, 2006

Semantic Relation Detection

  What relationship holds between two nouns?

 olive oil – oil comes from olives  machine oil – oil used on machines Assigning the meaning relations between these terms has been seen as a very difficult solution  Our solution:  Use clever queries against the web to figure out the relations.

PARC, Aug 3, 2006

Queries for Semantic Relations

    Convert the noun-noun compound into a query of the form: noun2 that * noun1 “oil that * olive(s)” This returns search result snippets containing interesting verbs.

      In this case: Come from Be obtained from Be extracted from Made from … PARC, Aug 3, 2006

Queries for Semantic Relations

 More examples:  Migraine drug -> treat, be used for, reduce, prevent  Wrinkle drug -> treat, be used for, reduce, smooth   Printer tray -> hold, come with, be folded, fit under, be inserted into Student protest -> be led by, be sponsored by, pit, be, be organized by PARC, Aug 3, 2006

Conclusions

 The enormous size of the web opens new opportunities for text analysis  There are many words, but they are more likely to appear together in a huge dataset  This allows us to do word-specific analysis  Unambiguous + Unlimited = Unsupervised  We’ve applied it to structural and semantic language problems.

 These are stepping stones towards sophisticated language understanding.

PARC, Aug 3, 2006

Thank you!

http://biotext.berkeley.edu

Supported in part by NSF DBI-0317510

Using n-grams to make predictions

 Say trying to distinguish: [home health] care home [health care]  Main idea: compare these co-occurrence probabilities   “home health” vs “health care” PARC, Aug 3, 2006

Using n-grams to make predictions

 Use search engines page hits as a proxy for n-gram counts  compare Pr(w 1 

2 |w 2 ) to Pr(w 1 

3 |w 3 )  Pr(w 1 

2 |w 2   ) = #(w 1 ,

2 ) / #(w 2 ) #(w 2 ) word frequency; query for “w 2 ” #(w 1 ,

2 ) bigram frequency; query for “w 1

2 ” PARC, Aug 3, 2006

Probabilities: Why? (1)

 Why should we use:   (a) Pr(w 1 

2 |w 2 ), (b) Pr(w 2 

1 |w 1 ) ?

rather than  Keller&Lapata (2004) calculate:   AltaVista queries:  

(a): 70.49%

(b): 68.85% British National Corpus:   (a): 63.11%

(b): 65.57%

PARC, Aug 3, 2006

Probabilities: Why? (2)

 Why should we use:   (a) Pr(w 1 

2 |w 2 ), (b) Pr(w 2 

1 |w 1 ) ?

rather than   Maybe to introduce a bracketing prior.

 Just like Lauer (1995) did.

But otherwise, no reason to prefer either one.

  Do we need probabilities? ( association is OK) Do we need a directed model? ( symmetry is OK) PARC, Aug 3, 2006

Adjacency & Dependency (2)

  

right

 bracketing: [w 1 [w 2

3 ] ]

3 is a compound (modified by w 1 ) 

1 and w 2 independently modify w 3

adjacency model

  Is w 2

3 a compound?

(vs. w 1

2 being a compound)

dependency model

  Does w 1 (vs. w 1 modify w 3 ? modifying w 2 )

3 PARC, Aug 3, 2006

Paraphrases: pattern (1) (1)v n1

p n2 

v

n1 ( noun )

   Can we turn “ n1 p n2 ” into a noun compound “ n2 n1 ”?

meet/v demands/n1 from/p customers/n2



meet/v the customer/n2 demands/n1

   

gave/v an apple/n1 to/p him/n2 gave/v him/n2 an apple/n1

Solution:    Problem: ditransitive verbs like give no determiner before n1 determiner before n2 is required the preposition cannot be to  PARC, Aug 3, 2006

Paraphrases: pattern (2) (2)v n1

p n2 

v

p n2

n1 ( verb )

   If “ p n2 ” is an indirect object of v , then it could be switched with the direct object n1 .

had/v a program/n1 had/v in/p place/n2 in/p place/n2 a program/n1

 Determiner before n1 is required to prevent “ n2 n1 ” from forming a noun compound.

PARC, Aug 3, 2006

Paraphrases: pattern (3) (3)v n1

p n2  p n2

* v n1 ( verb )

 “*” indicates a wildcard position (up to three intervening words are allowed)   Looks for appositions, where the PP has moved in front of the verb, e.g.

I gave/v an apple/n1 to/p him/n2

 

to/p him/n2 I gave/v an apple/n1

PARC, Aug 3, 2006

Paraphrases: pattern (4) (4)v n1 p n2



n1 p n2 v ( noun )

   Looks for appositions, where “ n1 p n2 ” has moved in front of v

shaken/v confidence/n1 in/p markets/n2



confidence/n1 in/p markets/n2 shaken/v

PARC, Aug 3, 2006

Paraphrases: pattern (5) (5)v n1 p n2



v PRONOUN p n2 ( verb )

 n1 verb (Hindle&Rooth, 93)    Pattern (5) substitutes n1 (

him

her

), e.g.

with a dative pronoun

put/v a client/n1 put/v

him

at/p odds/n2 at/p odds/n2

 PARC, Aug 3, 2006

Paraphrases: pattern (6)

(6) v

n1 p n2

 BE

n1 p n2 ( noun )



   Pattern (6) substitutes

are

), e.g.

with a form of to be (

eat/v

spaghetti/n1 with/p sauce/n2



spaghetti/n1 with/p sauce/n2

PARC, Aug 3, 2006

Faceted Metadata for Information Architecture and Search

Transcript Faceted Metadata for Information Architecture and Search

Natural Language Processing

Automatic Help Desk Translation at MS

Why is text analysis difficult?

How can a machine understand these?

How to tackle this problem?

Size Matters

Example Problem

Using Very, Very Large Corpora

The Effects of LARGE Datasets

How to Extend this Idea?

How to Avoid Labeling?

Web as a Baseline

Using Unambiguous Cases

Using Unambiguous Cases

Unambiguous + Unlimited = Unsupervised

Structural Ambiguity Problems

Applying U + U = U to Structural Ambiguity

Noun Compound Bracketing

Dependency Model

Related Work

Computing Bigram Statistics

Probabilities: Estimation

Association Models:

(Chi Squared)

Web-derived Surface Features

Paraphrases

Paraphrases

Evaluation: Datasets

Lauer Set

Biomedical Set

Evaluation: Experiments

Exact phrase queries Limited to English

Inflections:

Co-occurrence Statistics

Paraphrase and Surface Features Performance

Lauer Set

Biomedical Set

Individual Surface Features Performance: Bio

Individual Surface Features Performance: Bio

Results Lauer

Results: Comparing with Others

Results Bio

Summary: Results for Noun Compound Bracketing

Prepositional Phrase Attachment

Noun Phrase Coordination

NC coordination: ellipsis

Results

Semantic Relation Detection

Lexico-Syntactic Patterns

Lexico-Syntactic Patterns

Adding a New Relation

Semantic Relation Detection

Semantic Relation Detection

Queries for Semantic Relations

Queries for Semantic Relations

Conclusions

Thank you!

Using n-grams to make predictions

Using n-grams to make predictions

Probabilities: Why? (1)

Probabilities: Why? (2)

Adjacency & Dependency (2)

Paraphrases: pattern (1) (1)v n1

v

n1 ( noun )

Paraphrases: pattern (2) (2)v n1

v

n1 ( verb )

Paraphrases: pattern (3) (3)v n1

* v n1 ( verb )

Paraphrases: pattern (4) (4)v n1 p n2

n1 p n2 v ( noun )

Paraphrases: pattern (5) (5)v n1 p n2

v PRONOUN p n2 ( verb )

Paraphrases: pattern (6)

n1 p n2

n1 p n2 ( noun )

Directory