Semantic Vectors

Download Report

Transcript Semantic Vectors

Semantic Vectors

A Scalable Open Source Package and Online Technology Management Application LREC Conference 28 th May, 2008 Dominic Widdows Google, Inc.

[email protected]

Kathleen Ferraro University of Pittsburgh [email protected]

Natural Language Software Engineering – Three Problems

  Software is often hard to use / unreliable

t (fiddling with computers) >> t (analysing data)

  Does it scale?

Moore's Law of Data – any algorithm more costly than linear hurts more every day!

   What is it for?

Systems / components Interesting (science) / useful (engineering)

Semantic Vector Models

    Count how many times words occur in some context

Term –Document matrix LSA (“Latent Semantic Analyis”)

Or count how many times words cooccur with one  another

HAL (“Hyperspace Analogue to Language”)

  Normally we reduce dimensions somehow

SVD, NNMR, LDA.

  Many uses

IR, WSD, OL / LA, DS / TDT, OCIM, DC, ... , Acronym Resolution.

Semantic Vectors Package

 http://semanticvectors.google.com/        Created by University of Pittsburgh and MAYA Design All Java (with some Perl / Python / php wrappers) Maintained by Google 20% project + other contributors BSD license – you can use it.

Nearly 1000 downloads Developer group, Wiki, mailing list, ...

“Child of Infomap” with lessons learned

Challenge 1: Make it Easy!

   100% Java Dependencies include Apache Lucene Installation      

User Download jarfiles

  

Add to your $CLASSPATH Assemble a corpus (example provided) Type “java pitt.search.semanticvectors.BuildModel” Developer Install SVN, Ant Checkout source (Google code helps) Install JUnit for testing

 We have had no reports of difficulty yet!

Challenge 2: Make it Scale!

  Dimension reduction and parallelization are key   Random Projection

Geometric alternatives: SVD (orthogonality) Probabilistic alternatives: PLSA, LDS (generative models)

 Sparse Random Vectors, e.g.

[0,0,0,1,0,-1,0,0,0,-1,0,0,0,0,1,0,0] [0,-1,0,0,0,1,0,0,0,0,1,0,0,0,0,-1,0]

   On average, dot products are nearly zero, so vectors are nearly orthogonal.

Approximate benefits of SVD, with none of the cost!

Believed to be trivially parallelizable and incremental (TODO)

Challenge 3: Make it Useful!

    Hardest of the three problems   Technology Matching at UPitt

http://real.hsls.pitt.edu/

Matches technology disclosures to documents harvested from company websites Traditionally needs much more than keywords

Does your data meet your needs?

Features and Demos ...

  Negation, Disjunction

“Quantum” / Vector Logic

  Translation

Bilingual Vector Models

  Semantic Vector Products

Direct, Tensor, Convolution, Subspace

  Clustering

kMeans

  Context Window Approach (HAL)

Thanks to Trevor Cohen, ASU, Biomedical Informatics

Mathematics, Technology, Cognition

 Geometry, Probability, Logic Intersection “That one term should be included in another as in a whole is the same as for the other to be predicated of all of the first.”

Prior Analytics

(Bk I, Ch 1)   The equations work ... does the method?

“It is the mark of an educated man to look for precision in each class of things just so far as the nature of

the subject admits; it is evidently equally foolish to ac

cept scientific probable reasoning from a mathematician and to demand from a rhetorician proofs.”

Nicomachean Ethics (Bk I, Ch 3)

What do people do?

“By nature animals are born with the faculty of sensation, and from sensation memory is produced in some of them, though not in others ... Now from memory experience is produced in humans; for the several memories of the same thing produce finally the capacity for a single experience.”

Metaphysics (Bk I, Ch 3)

Many Thanks ...

ELRA and the LREC conference

Developers of Java, Lucene, Ant, Junit, ...

Google, University of Pittsburgh

Harris, Firth, Van Rijsbergen, Salton, McGill, Landauer, Deerwester, Berry, Dumais, Schutze, Lund, Burgess, Sahlgren, Kahlgren, Kaufmann, Dorow, Cederberg, Hofmann, Kanerva, Plate, Papadimitriou, McArthur, Bruza, ...

And finally ...

http://infomap.stanford.edu/boo k

Introduction to Vectors, WordSpace, Quantum Logic, etc.

A few for sale here ... 150 .

Download the package ...

Google(Semantic Vectors)