Transcript Document

The Discipline and Future
of Machine Learning
Tom M. Mitchell
E. Fredkin Professor and Department Head
March 2007
The Discipline of Machine Learning
The defining question:
• How can we build computer systems that automatically
improve with experience, and what are the fundamental
laws that govern all learning processes?
A process learns with respect to <T,P,E> if it
• Improves its performance P
• at task T
• through experience E
Machine Learning - Practice
Speech Recognition
Object recognition
Mining Databases
• Reinforcement learning
• Supervised learning
Control learning
Extracting facts from text
• Bayesian networks
• Hidden Markov models
• Unsupervised clustering
• Explanation-based learning
• ....
Machine Learning - Theory
Other theories for
PAC Learning Theory
(for supervised concept learning)
# examples (m)
error rate (e)
representational
complexity (H)
failure
probability (d)
• Reinforcement skill learning
• Semi-supervised learning
• Active student querying
•…
… also relating:
• # of mistakes during learning
• convergence rate
• asymptotic performance
• bias, variance
• VC dimension
The Discipline of Machine Learning
Machine Learning:
• How can we build computer systems that automatically improve with
experience, and what are the fundamental laws that govern all
learning processes?
Computer Science:
• How can we build machines that solve problems, and which
problems are inherently tractable/intractable?
Statistics:
• What can be learned from data with a set of modeling assumptions,
while taking into account the data-collection process?
Computer science
Animal learning
Economics
Machine learning
(Cognitive science,
Psychology,
Neuroscience)
Adaptive Control
Theory
Evolution
Statistics
and
Robotics
ML and CS
• Machine learning already the preferred approach to
–
–
–
–
–
Speech recognition, Natural language processing
Computer vision
All software
Medical outcomes analysis
Many robot control problems
ML software
…
• The ML niche will grow
– Why?
ML and Empirical Sciences
• Empirical science is a learning process, subject to automation and to study
– improve performance P (accuracy)
– at task T (predict which gene knockouts will impact the aromatic AA pathway,
and how)
– with experience E (active experimentation)
Which protein
ORFs influence
which enzymes in
the AAA pathway
Functional genomic hypothesis generation and experimentation
by a robot scientist, King et al., Nature, 427(6971), 247-252
Our current state:
• The problem of tabula-rasa function approximation is
solved (in an 80-20 sense):
– Given:
• Class of hypotheses H = {h: X Y}
• Labeled examples {<xi,f(yi)>}
– Determine:
• The h from H that best approximates f
• It’s time to move on
– Enrich the function approx problem definition
– Use function approx as building block
– Work on new problems
Some Current Research Questions
• When/how can unlabeled data be useful in function approximation?
• How can assumed sparsity of relevant features be exploited in high
dimensional nonparametric learning?
• How can information learned from one task be transferred to
simplify learning another?
• What algorithms can learn control strategies from delayed rewards
and other inputs?
• What are the best “active learning” strategies for different learning
problems?
• To what degree can one preserve data privacy while obtaining the
benefits of data mining?
The Future of Machine Learning
A Quick Look Back
Privacy
preserving
data mining
Speech
Robot Large scale
applications control datamining
Evolutionary and revolutionary changes
Dimensionality
Decision tree
learning
reduction
What might
Winston’s lead to the next revolution?
Samuel’s
checker
learner
Neural
networks
symbolic
concept learner
1960
Perceptrons
1970
Rule
learning
Version
Spaces
Theories of
perceptron
capacity and
learnability
Theories of
grammar
induction
HMMs
1980
1990
SVMs
Reinforcement
learning
2000
Nonparametric
methods
SemiExplanation- Architectures Bayes nets supervised
based
for learning
learning
learning
and problem
solving
Transfer
learning
PAC learning
Statistical
theory
perspective
on learning
1. Use Machine Learning to help
understand Human Learning
(and vice versa)
Models of Learning Processes
Machine Learning:
Human Learning:
•
•
•
•
•
•
•
•
# of examples
Error rate
Reinforcement learning
Explanations
• Learning from examples
• Complexity of learner’s
representation
• Probability of success
• Prior probabilities
• Loss functions
# of examples
Error rate
Reinforcement learning
Explanations
• Human supervision
– Lectures
– Questions, Homeworks
• Attention, motivation
• Skills vs. Principles
• Implicit vs. Explicit learning
• Memory, retention, forgetting
• Hebbian learning, consolidation
Reinforcement Learning
[Sutton and Barto 1981; Samuel 1957]
Observed immediate reward
Learned sum of future rewards
V* (s)  E[rt  γ rt 1  γ2rt 2  ...]
Reinforcement Learning in ML
 = .9
r =100
S0
S1
S2
S3
V=72
V=81
V=90
V=100
0
V(s t )  E[rt  γ rt 1  γ2rt 2  ...]
V(s t )  E[rt ]  γ V(s t 1 )
To learn V, use each transition to generate a training signal:
Reinforcement Learning in ML
trainingerror  rt  γ V(s t 1 )  V(s t )
• Variants of RL have been used for a variety of practical
control learning problems
– Temporal Difference learning
– Q learning
– Learning MDPs, POMDPs
• Theoretical results too
– Assured convergence to optimal V(s) under certain conditions
– Assured convergence for Q(s,a) under certain conditions
Dopamine As Reward Signal
[Schultz et al.,
Science, 1997]
t
Dopamine As Reward Signal
[Schultz et al.,
Science, 1997]
t
Dopamine As Reward Signal
[Schultz et al.,
Science, 1997]
error  rt  γ V(s t 1 )  V(s t )
t
RL Models for Human Learning
[Seymore et al., Nature 2004]
[Seymore et al., Nature 2004]
Human and Machine Learning
Additional overlaps:
• Learning of perceptual representations
– Dimensionality reduction methods, low level percepts
– Lewicky et al.: optimal sparse codes of natural scenes yield gabor
filters found in primate visual cortex. Similar result for auditory cortex.
• Learning with redundant sensory input
– CoTraining methods, Sensory redundancy hypothesis in development
– De Sa & Ballard; Coen: co-clustering voice/video yields phonemes
– Mitchell & Perfetti: co-training in second language learning
• Learning and explanations
– Explanation-based learning, teaching concepts & skills, chunking
– VanLehn et al: explanation-based learning accounts for some human
learning behaviors.
– Chi: students learn best when forced to explain
– Newell; Anderson: chunking/knowledge-compilation models
2. Never-ending learning
Never-Ending Learning
Current machine learning systems:
• Learn one function
• Are shut down after they learn it
• Start from scratch when programmed to learn the next
function
Let’s study and construct learning processes that:
• Learn many different things
• Formulate their own next learning task
• Use what they have already learned to help learn the
next thing
Example: Never-ending learning robot
Imagine a robot with three goals: (1) avoid collisions, (2) recharge when
battery low, and (3) find and collect trash
What is stopping us from giving it some trash examples, then letting it
learn for a year?
What must it start with to formulate and solve relevant learning subtasks?
• Learn to recognize trash in scene
• Learn where to search for trash, and when
• Learn how close to get to find out whether trash is there
• Learn to manipulate trash
• Transfer what it learned about paper trash to help with bottle trash
• Discover relevant subcategories of trash (e.g., plastic versus glass
bottles), and of other objects in the environment
Core Questions for Never-Ending Learning Agent
• What function or fact to learn next?
– Self-reflection on performance, credit assignment
• What representation for this target function or fact?
– Choice of input-output representation for target function
– E.g., “classify whether it’s trash”
• How to obtain (which type of) training experience?
– Primarily self-supervised, but occasional teacher input
– E.g., “classify whether it’s trash”
• Guided by what prior knowledge?
– Transfer learning, but transfer between what?
– XPaperTrash help learn XPlasticTrash ?
– State(t) x Action(t)  State(t+1) help learn XPlasticTrash ?
Example: Never-ending language learner
[Carlson, Cohen, Fahlman, Hong, Nyberg, Wang, ...]
Read the Web project: Create 24x7 web agent that each day:
• Extracts more facts from the web into structured database
• Learns to extract facts better than yesterday
Starting point:
• Ontology of hundreds of categories and relations
– and 6-10 training examples of each
• Never-ending learning architecture
– State of art language processing primitives
– Learning mechanisms
• Top level task:
– Populate a database of these categories and relations by reading
the web, and improve continually
Q: how can it obtain useful training
experience (i.e., self-supervise)?
A: redundancy
Bootstrapping: Learning to extract named entities
location?
I arrived in Pittsburgh on Saturday.
x1: I arrived in _________ on Saturday.
x2 :
Pittsburgh
Bootstrap learning to extract named entities
[Riloff and Jones, 1999], [Collins and Singer, 1999], ...
Initialization
Australia
Canada
China
England
France
Germany
Japan Mexico
Switzerland
United_states
South Africa
United Kingdom
Warrenton
Far_East
Oregon
Lexington
Europe
U.S._A.
Eastern Canada
Blair
Southwestern_states
Texas
States
Singapore …
...
Thailand
Maine
production_control
northern_Los
New_Zealand
eastern_Europe
Americas
Michigan
New_Hampshire
Hungary
south_america
district
Latin_America
Florida ...
…
Iterations
locations in ?x
operations in ?x
republic of ?x
Co-Training
Idea: Train Classifier1 and Classifier2 to:
1. Correctly classify labeled examples
2. Agree on classification of unlabeled
Answer1
Answer2
Classifier1
Classifier2
New York
I flew to ____ today
I flew to New York today.
Co-Training Theory
[Blum&Mitchell 98; Dasgupta 04, ...]
CoTraining setting :
learn f : X  Y
where X  X 1  X 2
where x drawn from unknown distribution
and g1 , g 2 (x) g1 ( x1 )  g 2 ( x2 )  f ( x)
# labeled examples
# unlabeled examples
Final
Accuracy
Number of
redundant
inputs
Conditional
dependence
among inputs
 want inputs less dependent,
 disagreement overincreased
unlabelednumber of redundant
…
examples can bound inputs,
true error
Example Bootstrap learning algorithms:
•
•
•
•
•
•
•
•
•
Classifying web pages [Blum&Mitchell 98; Slattery 99]
Classifying email [Kiritchenko&Matwin 01; Chan et al. 04]
Named entity extraction [Collins&Singer 99; Jones 05]
Wrapper induction [Muslea et al., 01; Mohapatra et al. 04]
Word sense disambiguation [Yarowsky 96]
Discovering new word senses [Pantel&Lin 02]
Synonym discovery [Lin et al., 03]
Relation extraction [Brin et al.; Yangarber et al. 00]
Statistical parsing [Sarkar 01]
What is relation between “Elvis” and “January 8”?
Q: how can it choose next learning task?
A: self-reflect on where it is failing, then
formulate learning task to repair failure
Some strategies for generating new tasks
• Collect more data from web
– To learn about specific entities (e.g., “Rolling Stones”)
– To learn meaning of particular language (e.g., “will attend”)
– To locate easy-to extract facts (e.g., web pages with lists)
• Learn regularities from the populated KB
– “Most LTI office names are of the form “NSH dddd”
• Explore specializations of ontological categories
– What distinguishes events occurring on CMU campus from
those who occurring elsewhere? Can this be predicted?
What subsets of events warrant becoming categories?
• Explore specializations of language structures
– Which ‘location’ entities share surrounding language?
e.g., “the city of ?x,” Do they share other properties?
Some Types of Knowledge to Learn
• Linguistic regularities
– {“spoon”,”fork”,”chopsticks”} occur often in “eat with my ___”
– They’re instances of ontology class “eating implement”
• HTML layout regularities
– HTML lists often contain items of the same class
• Web site regularities
– University departments often have page listing all faculty
• Regularities over extracted facts
– ‘Professors typically have more publications than their advisees’
– ‘Professors typically received their BS degree before their advisees’
• Temporal stability
– Birthdays don’t change. Stock prices do.
Research Issues
•
•
•
•
What target knowledge representation?
How can initial ontology be extended?
What types of self-reflection are required?
Can one learn language without non-linguistic
knowledge?
• How can we manage mapping between text
tokens and non-text entities they describe?
• What curriculum for staging the learning?
• What active learning methods?
More Revolutionary Research Directions
• Can we design new kinds of computer programming languages
with explicit learning primitives?
• Can we build robot scientists?
• What are the fundamental tradeoffs between computational
efficiency and statistical efficiency?
• How can we build systems that learn from instruction, dialogs
and problem sets, in addition to labeled examples?
• How can we unify machine learning theories and models with
those from other fields studying adaptation, eg., adaptive control
theory, economics, evolution?
Summary
• Machine Learning research is (should be more)
connected to understanding all learning
processes
• Field is ripe for new revolutionary directions:
– Computational models for human learning
– Never-ending learners
– <your idea here>
Thank you!