Transcript The Unreasonable Effectiveness of Data
The Unreasonable Effectiveness of Data
Alon Halevy, Peter Norvig, and Fernando Pereira
Kristine Monteith May 1, 2009 CS 652
Why “Unreasonable Effectiveness”?
Title taken from an article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” ◦ Physics formulas are nice an tidy ◦ Linguistic formulas not so simple (e.g. an incomplete grammar of the English language would be over 1700 pages long) But what linguistics lacks in elegant formulas, it makes up for with LOTS of data ◦ In 2006, Google released an annotated corpus of over one trillion words
What makes a task “easy”?
Biggest successes in NLP: statistical speech recognition and statistical machine translation ◦ Harder than tasks such as document classification ◦ But there’s lots of data available that doesn’t require expensive manual annotation (e.g. European Union translators, closed captioning) ◦ Automatically discover semantic relationships from the accumulated evidence of web-based text patterns “Invariably, simple models and a lot of data trump more elaborate models based on less data”
False dichotomy
“Deep” approach ◦ Hand coded grammars and ontologies, represented by complex networks of relationships Statistical approach ◦ N-gram statistics from large corpora
Actually three problems of NLP
◦ Choosing a representation language First order logic, finite state machines, etc.
◦ Encoding a model in that language Manual encoding, word counts, etc.
◦ Performing inference on that model Complex inference models, Bayesian statistics
Semantic Web vs. Semantic Interpretation
“The Semantic Web is a convention for formal representation languages that lets software services interact with each other ‘without needing artificial intelligence.’” ◦ Agree on standards for representing dates, prices, locations, etc.
◦ Services can then interact with other services that use the same standard or a different one with a known translation “The problem of understanding human speech and writing—the semantic interpretation problem— is quite different from the problem of software service interoperability.”
Challenges of Building Semantic Web Services
Ontology writing Difficulty of implementation Competition Inaccuracy and deception
Challenges of Achieving Accurate Semantic Interpretation
The semantic web has managed to get hundreds of millions of authors to share a trillion pages of content, and aggregated and indexed context Still need to find meaning of entries: ◦ Does “HP” refer to “Helmerich and Payne” or “Hewlett-Packard” ◦ Which “Joe’s Pizza” are we talking about
Example Task: Find Synonyms for Attribute Names
Looking to recognize facts such as “Company Name” = “Company” or “Price” = “Discount” ◦ Extract 2.5 million distinct schemata from 150 million tables ◦ Examine co-occurrence of names in these schemata ◦ If A and B rarely occur together, but both often occur with C, then A and B may be synonyms
“So, follow the data. Choose a representation that can use
unsupervised learning
on
unlabeled data
, which is so much more plentiful than labeled data. Represent all the data with a
nonparametric model
rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications,
words for the important concepts
gather some data, and see what it can do.”
trust that human language has already evolved
. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and