The Unreasonable Effectiveness of Data

Transcript The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Data

Alon Halevy, Peter Norvig, and Fernando Pereira

Kristine Monteith May 1, 2009 CS 652

Why “Unreasonable Effectiveness”?

  Title taken from an article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” ◦ Physics formulas are nice an tidy ◦ Linguistic formulas not so simple (e.g. an incomplete grammar of the English language would be over 1700 pages long) But what linguistics lacks in elegant formulas, it makes up for with LOTS of data ◦ In 2006, Google released an annotated corpus of over one trillion words

What makes a task “easy”?

 Biggest successes in NLP: statistical speech recognition and statistical machine translation ◦ Harder than tasks such as document classification ◦ But there’s lots of data available that doesn’t require expensive manual annotation (e.g. European Union translators, closed captioning) ◦ Automatically discover semantic relationships from the accumulated evidence of web-based text patterns  “Invariably, simple models and a lot of data trump more elaborate models based on less data”

False dichotomy

  “Deep” approach ◦ Hand coded grammars and ontologies, represented by complex networks of relationships Statistical approach ◦ N-gram statistics from large corpora

Actually three problems of NLP

   ◦ Choosing a representation language First order logic, finite state machines, etc.

◦ Encoding a model in that language Manual encoding, word counts, etc.

◦ Performing inference on that model Complex inference models, Bayesian statistics

Semantic Web vs. Semantic Interpretation

 “The Semantic Web is a convention for formal representation languages that lets software services interact with each other ‘without needing artificial intelligence.’” ◦ Agree on standards for representing dates, prices, locations, etc.

◦ Services can then interact with other services that use the same standard or a different one with a known translation  “The problem of understanding human speech and writing—the semantic interpretation problem— is quite different from the problem of software service interoperability.”

Challenges of Building Semantic Web Services

 Ontology writing    Difficulty of implementation Competition Inaccuracy and deception

Challenges of Achieving Accurate Semantic Interpretation

 The semantic web has managed to get hundreds of millions of authors to share a trillion pages of content, and aggregated and indexed context  Still need to find meaning of entries: ◦ Does “HP” refer to “Helmerich and Payne” or “Hewlett-Packard” ◦ Which “Joe’s Pizza” are we talking about

Example Task: Find Synonyms for Attribute Names

 Looking to recognize facts such as “Company Name” = “Company” or “Price” = “Discount” ◦ Extract 2.5 million distinct schemata from 150 million tables ◦ Examine co-occurrence of names in these schemata ◦ If A and B rarely occur together, but both often occur with C, then A and B may be synonyms

“So, follow the data. Choose a representation that can use

unsupervised learning

unlabeled data

, which is so much more plentiful than labeled data. Represent all the data with a

nonparametric model

rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications,

words for the important concepts

gather some data, and see what it can do.”

trust that human language has already evolved

. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and

The Unreasonable Effectiveness of Data

Transcript The Unreasonable Effectiveness of Data