INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Download Report

Transcript INTRODUCTION TO ARTIFICIAL INTELLIGENCE

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Massimo Poesio LECTURE 16: Unsupervised methods, IR, and lexical acquisition

FEATURE-DEFINED DATA SPACE

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

..

.

.

2

UNSUPERVISED MACHINE LEARNING • In many cases, what we want to learn is not a target function from examples to classes, but what the classes are – I.e., learn without being told

EXAMPLE: TEXT CLASSIFICATION

• Consider clustering a large set of computer science documents NLP Graphics Arch.

Theory AI

CLUSTERING

• • Partition unlabeled examples into disjoint subsets of

clusters

, such that: – Examples within a cluster are very similar – Examples in different clusters are very different Discover new categories in an

unsupervised

manner (no sample category labels provided).

6

Deciding what a new doc is about

• Check which region the new doc falls into – can output “softer” decisions as well.

NLP Graphics Arch.

Theory AI = AI

Hierarchical Clustering

• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean • Recursive application of a standard clustering algorithm can produce a hierarchical clustering.

8

Agglomerative vs. Divisive Clustering • •

Agglomerative

(bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters.

Divisive

(partitional, top-down) separate all examples immediately into clusters.

9

Direct Clustering Method

• • •

Direct clustering

specification of the number of clusters, k, desired.

methods require a A

clustering evaluation function

assigns a real value quality measure to a clustering.

The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function.

10

Hierarchical Agglomerative Clustering (HAC) • • • Assumes a similarity function for determining the similarity of two instances.

Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

The history of merging forms a binary tree or hierarchy.

11

Cluster Similarity

• • Assume a similarity function that determines the similarity of two instances: sim(x,y).

– Cosine similarity of document vectors.

How to compute similarity of two clusters each possibly containing multiple instances?

– Single Link : Similarity of two most similar members.

– – Complete Link : Similarity of two least similar members.

Group Average : Average similarity between members.

12

Non-Hierarchical Clustering

• • • • • Typically must provide the number of desired clusters, k.

Randomly choose k instances as

seeds

, one per cluster. Form initial clusters based on these seeds.

Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering.

Stop when clustering converges or after a fixed number of iterations. 13

CLUSTERING METHODS IN NLP

• Unsupervised techniques are heavily used in : – Text classification – Information retrieval – Lexical acquisition

CLUSTERING METHODS IN NLP

• Unsupervised techniques are heavily used in : – Text classification – Information retrieval –

Lexical acquisition

Feature-based lexical semantics

• Very old idea in lexical semantics: the meaning of a word can be specified in terms of the values of certain `features’ (`DECOMPOSITIONAL SEMANTICS’) – dog : ANIMATE= +, EAT=MEAT, SOCIAL=+ – horse : ANIMATE= +, EAT=GRASS, SOCIAL=+ – cat : ANIMATE= +, EAT=MEAT, SOCIAL= 2004/05 ANLE 16

FEATURE-BASED REPRESENTATIONS IN PSYCHOLOGY • • • • Feature-based concept representations assumed by many cognitive psychology theories (Smith and Medin, 1981, McRae et al, 1997) Underpin development of prototype theory (Rosch et al) Used, e.g., to account for semantic priming (McRae et al, 1997; Plaut, 1995) Underlie much work on category-specific defects (Warrington and Shallice, 1984; Caramazza and Shelton, 1998; Tyler et al, 2000; Vinson and Vigliocco, 2004) Feb 21 st Cog/Comp Neuroscience 17

SPEAKER-GENERATED FEATURES (VINSON AND VIGLIOCCO) Feb 21 st Cog/Comp Neuroscience 18

Vector-based lexical semantics

• If we think of the features as DIMENSIONS we can view these meanings as VECTORS in a FEATURE SPACE – (An idea introduced by Salton in Information Retrieval, see below) 2004/05 ANLE 19

2004/05

Vector-based lexical semantics

CAT DO G HORSE ANLE 20

• • • General characterization of vector based semantics (from Charniak) Vectors as models of concepts The CLUSTERING approach to lexical semantics: 1. Define properties one cares about, and give values to each property (generally, numerical) 2. Create a vector of length n for each item to be classified 3. Viewing the n-dimensional vector as a point in n-space, cluster points that are near one another What changes between models: 1. The properties used in the vector 2. The distance metric used to decide if two points are `close’ 3. The algorithm used to cluster 2004/05 ANLE 21

• • • • • Using words as features in a vector based semantics The old decompositional semantics approach requires i.

ii.

Specifying the features Characterizing the value of these features for each lexeme Simpler approach: use as features the WORDS that occur in the proximity of that word / lexical entry – Intuition: “You can tell a word’s meaning from the company it keeps” More specifically, you can use as `values’ of these features – The FREQUENCIES with which these words occur near the words whose meaning we are defining – Or perhaps the PROBABILITIES that these words occur next to each other Alternative: use the DOCUMENTS in which these words occur (e.g., LSA) Some psychological results support this view. Lund, Burgess, et al (1995, 1997): lexical associations learned this way correlate very well with priming experiments. Landauer et al: good correlation on a variety of topics, including human categorization & vocabulary tests.

2004/05 ANLE 22

Using neighboring words to specify lexical meanings

Learning the meaning of DOG from text

Learning the meaning of DOG from text

Learning the meaning of DOG from text

Learning the meaning of DOG from text

Learning the meaning of DOG from text

The lexicon we acquire

Meanings in word space

• • Acquiring lexical vectors from a corpus (Schuetze, 1991; Burgess and Lund, 1997) To construct vectors C(w) for each word w: 1. Scan a text 2. Whenever a word w is encountered, increment all cells of C(w) corresponding to the words v that occur in the vicinity of w, typically within a window of fixed size Differences among methods: – Size of window – – Weighted or not Whether every word in the vocabulary counts as a dimension (including function words such as the or and) or whether instead only some specially chosen words are used (typically, the m most common content words in the corpus; or perhaps modifiers only). The words chosen as dimensions are often called CONTEXT WORDS – Whether dimensionality reduction methods are applied 2004/05 ANLE 32

Variant: using probabilities (e.g., Dagan et al, 1997) • • E.g., for

house

Context vector (using probabilities) – 0.001394 0.016212 0.003169 0.000734 0.001460 0.002901 0.004725 0.000598 0 0 0.008993 0.008322 0.000164 0.010771 0.012098 0.002799 0.002064 0.007697 0 0 0.001693 0.000624 0.001624 0.000458 0.002449 0.002732 0 0.008483 0.007929 0 0.001101 0.001806 0 0.005537 0.000726 0.011563 0.010487 0 0.001809 0.010601 0.000348 0.000759 0.000807 0.000302 0.002331 0.002715 0.020845 0.000860 0.000497 0.002317 0.003938 0.001505 0.035262 0.002090 0.004811 0.001248 0.000920 0.001164 0.003577 0.001337 0.000259 0.002470 0.001793 0.003582 0.005228 0.008356 0.005771 0.001810 0 0.001127 0.001225 0 0.008904 0.001544 0.003223 0 2004/05 ANLE 33

Variant: using modifiers to specify the meaning of words • …. The Soviet cosmonaut …. The American astronaut …. The red American car …. The old red truck … the spacewalking cosmonaut … the full Moon … 2004/05 Soviet American 0 spacewalking 1 cosmonaut astronaut moon car 1 0 0 1 1 1 0 0 1 0 red 0 0 0 1 full old 0 0 0 0 1 0 0 1 ANLE 0 1 1 0 truck 1 1 34

2004/05 Another variant: word / document matrices cosmonaut astronaut moon car truck d1 d2 1 0 1 1 0 0 1 1 0 0 d3 1 0 0 0 0 d4 0 0 0 1 1 d5 0 0 0 1 0 d6 0 0 0 0 1 ANLE 35

Measures of semantic similarity

• Euclidean distance:

d

 

n i

 1 

x i

y i

 2 • Cosine: • Manhattan Metric: cos(  )  

n i

 1 

n i

 1

x i

2

x i y i

i n

 1

y i

2

d

 

n i

 1

x i

y i

2004/05 ANLE 36

SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE d j θ q k cos  

d j

*

q k d j q k sim

  

q k

,

d

j

  

i N

  1

w k

,

i w j

,

i

i N

 1

w k

2 ,

i

i N

 1

w

2

j

,

i

EVALUATION

• • • Synonymy identification Text coherence Semantic priming

SYNONYMY: THE TOEFL TEST

TOEFL TEST: RESULTS

• • • Some psychological evidence for vector-space representations Burgess and Lund (1996, 1997): the clusters found with HAL correlate well with those observed using semantic priming experiments.

Landauer, Foltz, and Laham (1997): scores overlap with those of humans on standard vocabulary and topic tests; mimic human scores on category judgments; etc.

Evidence about `prototype theory’ (Rosch et al, 1976) – Posner and Keel, 1968 • subjects presented with patterns of dots that had been obtained by variations from single pattern (`prototype’) • Later, they recalled prototypes better than samples they had actually seen – Rosch et al, 1976: `basic level’ categories (apple, orange, potato, carrot) have higher `cue validity’ than elements higher in the hierarchy (fruit, vegetable) or lower (red delicious, cox) 2004/05 ANLE 41

The HAL model (Burgess and Lund, 1995, 1996, 1997) • • • • A 160 million words corpus of articles extracted from all newsgroups containing English dialogue Context words: the 70,000 most frequently occurring symbols within the corpus Window size: 10 words to the left and the right of the word Measure of similarity: cosine 2004/05 ANLE 42

HAL AND SEMANTIC PRIMING

INFORMATION RETRIEVAL

• • • GOAL: Find the documents most relevant to a certain QUERY Latest development: WEB SEARCH – Use the Web as the collection of documents Related: – QUESTION-ANSWERING – DOCUMENT CLASSIFICATION

DOCUMENTS AS BAGS OF WORDS DOCUMENT broad tech stock rally may signal trend - traders.

technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

INDEX broad may rally rallied signal stock stocks tech technology traders traders trend

THE VECTOR SPACE MODEL

• • • Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models) – RANKED similarity Most popular of all models (cfr. Salton and Lesk’s SMART)

TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE

tfidf

i

,

k

f

i

,

k

* log  

N df

i

  FREQUENCY of term i in document k Number of documents with term i

VECTOR-SPACE MODELS WITH SYNTACTIC INFORMATION • • • Pereira and Tishby, 1992: two words are similar if they occur as objects of the same verbs – John ate POPCORN – John ate BANANAS C(w) is the distribution of verbs for which w served as direct object.

– First approximation: just counts – In fact: probabilities Similarity: RELATIVE ENTROPY 2004/05 ANLE 49

(SYNTACTIC) RELATION-BASED VECTOR MODELS attacked det the subj fox mod red det the obj dog mod lazy attacked fox dog E.g., Grefenstette, 1994; Lin, 1998; Curran and Moens, 2002 Feb 21 st Cog/Comp Neuroscience 50

SEXTANT (Grefenstette, 1992)

It was concluded that the carcinoembryonic antigens represent cellular constituents which are repressed during the course of differentiation the normal digestive system epithelium and reappear in the corresponding malignant cells by a process of derepressive dedifferentiation 2004/05 antigen carcinoembryonic-ADJ antigen repress-DOBJ antigen represent-SUBJ constituent cellular-ADJ constituent represent-DOBJ course repress-IOBJ ……..

ANLE 51

SEXTANT: Similarity measure

DOG dog pet-DOBJ dog eat-SUBJ dog shaggy-ADJ dog brown-ADJ dog leash-NN CAT cat pet-DOBJ cat pet-DOBJ cat hairy-ADJ cat leash-NN Jaccard:

Count Count

  Unique Attributes attributes shared by A and B  possessed by A and B 

Count

 {leash

Count

 {brown ADJ, eat SUBJ, hairy NN, pet DOBJ}  ADJ, leash NN, pet DOBJ, shaggy ADJ}   2 6 2004/05 ANLE 52

MULTIDIMENSIONAL SCALING

• • Many models (included HAL) apply techniques for REDUCING the number of dimensions Intuition: many features express a similar property / topic

MULTIDIMENSIONAL SCALING

Latent Semantic Analysis (LSA) (Landauer et al, 1997) • • • Goal: extract relatons of expected contextual usage from passages Two steps: 1. Build a word / document cooccurrence matrix 2. `Weigh’ each cell 3. Perform a DIMENSIONALITY REDUCTION Argued to correlate well with humans on a number of tests 2004/05 ANLE 55

LSA: the method, 1

2004/05 ANLE 56

LSA: Singular Value Decomposition

2004/05 ANLE 57

LSA: Reconstructed matrix

2004/05 ANLE 58

Topic correlations in `raw’ and `reconstructed’ data 2004/05 ANLE 59

Some caveats

• • • Two senses of `similarity’ – Schuetze: two words are similar if one can replace the other – Brown et al: two words are similar if they occur in similar contexts What notion of `meaning’ is learned here?

– “One might consider LSA’s maximal knowledge of the world to be analogous to a well-read nun’s knowledge of sex, a level of knowledge often deemed a sufficient basis for advising the young” (Landauer et al, 1997) Can one do semantics with these representations?

– Our own experience: using HAL-style vectors for resolving bridging references – Very limited success – Applying dimensionality reduction didn’t seem to help 2004/05 ANLE 60

DAY

Wed 25/11 Tue 1/12 Fri 4/12 Wed 9/12 Thu 10/12 Fri 11/12 Mon 14/12

REMAINING LECTURES

HOUR

12-14 10-12 10-12 10-12 10-12 10-12 10-12

TOPIC

Text classification with Artificial Neural Nets Lab: Supervised ML with Weka Unsupervised methods & their application in lexical acq and IR Lexical acquisition by clustering Psychological evidence on learning Psychological evidence on language processing Intro to NLP

DAY

Tue 15/12 Tue 15/12 Wed 16/12 Thu 17/12 Fri 18/12 Mon 21/12 Tue 22/12 Tue 22/12

REMAINING LECTURES

HOUR

10-12 14-16 14-16 10-12 10-12 10-12 10-12 14-16

TOPIC

Machine learning for anaphora Lab: Clustering Lab: BART Ling. & psychological evidence on anaphora Corpora for anaphora Lexical & commons. knowledge for anaphora Salience Discourse new detection

ACKNOWLEDGMENTS

• Some of the slides come from – Ray Mooney’s Utexas AI course – Marco Baroni