Transcript Document

DD2447, DD3342, fall 2011
Statistical Methods in
Applied Computer Science
Stefan Arnborg, KTH
http://www.nada.kth.se/~stefan
SYLLABUS
Common statistical models and their use:
Bayesian, testing, and fiducial statistical philosophy
Hypothesis choice
Parametric inference
Non-parametric inference
Elements of regression
Clustering
Graphical statistical models
Prediction and retrodiction
Chapman-Kolmogoroff formulation
Evidence theory, estimation and combination of evidence.
Support Vector Machines and Kernel methods
Vovk/Gammerman hedged prediction technology
Stochastic simulation, Markov Chain Monte Carlo.
Variational Bayes
LEARNING GOALS
After successfully taking this course, you will be able to:
-motivate the use of uncertainty management and statistical
methodology in computer science applications, as well as
the main methods in use,
-account for algorithms used in the area and use
the standard tools,
-critically evaluate the applicability of these methods in new
contexts, and design new applications of
uncertainty management,
-follow research and development in the area.
GRADING
DD2447: Bologna grades
Grades are E-A during 2009. 70% of homeworks and a very
short oral discussion of them gives grade C. Less gives F-D.
For higher grades, essentially all homeworks should be
turned in on time. Alternative assignments will be substituted
for those homeworks you miss.
For grade B you must pass one Master's test, for grade A you
must do two Master's tests or a project with some research
content.
DD3342: Pass/Fail
Research level project, or deeper study of part of course
Previous course analyses on previous course pages.
Applications of Uncertainty
everywhere
Medical Imaging/Research (Schizophrenia)
Land Use Planning
Environmental Surveillance and Prediction
Finance and Stock
Marketing into Google
Robot Navigation and Tracking
Security and Military
Performance Tuning
…
Some Master’s Projects using
this syllabus (subset)
•
•
•
•
•
•
•
•
•
•
Recommender system for Spotify
Behavior of mobile phone users
Recommender system for book club
Recommender for job search site
Computations in evolutionary genetics
Gene hunting
Psychiatry: genes, anatomy, personality
Command and control: Situation awareness
Diagnosing drilling problems
Speech, Music, …
Aristotle: Logic
Logic as a semi-formal system was
created by Aristotle, probably inspired
by current practice in mathematical
arguments.
There is no record of Aristotle himself
applying logic, but probably the Elements
of Euclid derives from Aristotles
illustrations of the logical method.
Which role has logic in Computer Science??
Visualization
• Visualize data in such a way that the
important aspects are obvious - A
good visualization strikes you as a
punch between your eyes
(Tukey, 1970)
• Pioneered by Florence Nightingale,
first female member of
Royal Statistical Society, inventor of
pie charts and performance metrics
Probabilistic approaches
• Bayes: Probability conditioned by observation
• Cournot: An event with very small probability
will not happen.
• Vapnik-Chervonenkis: VC-dimension and PAC,
distribution-independence
• Kolmogorov/Vovk: A sequence is random if it
cannot be compressed
Peirce: Abduction and
uncertainty
Aristotles induction , generalizing
from particulars, is considered invalid
by strict deductionists.
Peirce made the concept clear, or at
least confused on a higher level.
Abduction is verification by finding
a plausible explanation. Key process
in scientific progress.
Sherlock Holmes:
common sense inference
Techniques used by Sherlock are
modeled on Conan Doyle’s
professor in medical school,
who followed the
methodological tradition of
Hippocrates and Galen.
Abductive reasoning, first
spelled out by Peirce, is found in
217 instances in Sherlock
Holmes adventures - 30 of them
in the first novel, ‘A study in
Scarlet’.
Thomas Bayes,
amateur mathematician
If we have a probability model
of the world we know how to
compute probabilities of events.
But is it possible to learn about
the world from events we see?
Bayes’ proposal was forgotten
but rediscovered by Laplace.
Antoine Augustine Cournot (1801--1877)
Pioneer in stochastic processes, market theory
and structural post-modernism. Predicted
demise of academic system due to discourses
of administration and excellence(cf Readings).
An alternative to Bayes’ method hypothesis testing - is based on
’Cournot’s Bridge’:
an event with very small
probability will not happen
Kolmogorov and randomness
Andrei Kolmogorov(1903-1987) is the
mathematician best known for shaping
probability theory into a modern axiomatized
theory. His axioms of probability tells how
probability measures are defined, also on
infinite and infinite-dimensional event spaces
and complex product spaces.
Kolmogorov complexity characterizes a
random string by the smallest size of a
description of it. Used to explain
Vovk/Gammerman scheme of hedged
prediction. Also used in MDL
(Minimum Description Length) inference.
Normative claim of Bayesianism
• EVERY type of uncertainty should be
treated as probability
• This claim is controversial and not
universally accepted: Fisher(1922), Cramér,
Zadeh, Dempster, Shafer, Walley(1999) …
• Students encounter many approaches to
uncertainty management and identify
weaknessess in foundational arguments.
Foundations for Bayesian Inference
• Bayes method, first documented method
based on probability: Plausibility of event depends on
observation, Bayes rule:
• Bayes’ rule organizing principle for uncertainty
• Parameter and observation spaces can be extremely
complex, priors and likelihoods also.
• MCMC current approach -- often but not always
applicable (difficult when posterior has many local
maxima separated by low density regions)
• Variational Bayes –approximate posterior by
factorized function – result also approximate.
Showcase application:
PET-camera
f (l | D) µ f (D | l) f (l )
Camera geometry&noise film scene regularity
and also any other camera or imaging device …
PET camera
likelihood
prior
D: film, count by detector pair j
X: radioactivity in voxel i
a: camera geometry
Inference about X gives posterior,
its mean is often a good picture
Sinogram and reconstruction
Tumour
Fruit Fly
Drosophila
family (Xray)
Introduction
GOMOS (Global Ozone Monitoring by Occultation of
Stars)
The Royal Statistical Society
London 10 December 2003
Markov chain Monte Carlo methods for high
dimensional inversion in remote sensing
Heikki Haario1 , Marko Laine1 , Markku Lehtinen2
Eero Saksman3 and Johanna Tamminen4
1
4
University of Helsinki, Finland
2
University of Oulu, Sodankylä, Finland
3
University of Jyväskylä, Finland
Finnish Meteorological Institute, Helsinki, Finland
The Royal Statistical Society
London 10 December 2003
* WIRED on Total Information Awareness
WIRED (Dec 2, 2002) article "Total Info System Totally Touchy"
discusses the Total Information Awareness system.
~~~ Quote:
"People have to move and plan before committing a terrorist act. Our
hypothesis is their planning process has a signature."
Jan Walker, Pentagon spokeswoman, in Wired, Dec 2, 2002.
"What's alarming is the danger of false
positives based on incorrect data,"
Herb Edelstein, in Wired, Dec 2, 2002.
Combination of evidence
f (l | D) µ f (D | l) f (l )
f (l |{d1,d2}) µ f (d1 | l ) f (d2 | l ) f (l )
In Bayes’ method, evidence is likelihood for observation.
Particle filtergeneral tracking
Chapman Kolmogorov version
of Bayes’ rule
f (lt | Dt ) µ f (dt | lt )ò f (lt | lt -1) f (lt -1 | Dt -1 )dlt-1
Berry and Linoff have eloquently stated their preferences with
the often quoted sentence:
"Neural networks are a good choice for most classification problems
when the results of the model are more important than understanding
how the model works".
“Neural networks typically give the right answer”
1950-1980: The age of rationality. Let us describe the world with
a mathematical model and compute the best way to manage it!!
Ed Jaynes devoted a large
part of his career to promote
Bayesian inference.
He also championed the
use of Maximum Entropy in physics
Outside physics, he received
resistance from people who had
already invented other methods.
Why should statistical mechanics
say anything about our daily human
world??
Robust Bayes
• Priors and likelihoods are convex sets of probability
distributions (Berger, de Finetti, Walley,...): imprecise
probability:
f (l | D) µ f (D | l) f (l )
F( l | D) µ F(D| l )F(l )
• Every member of posterior is a ’parallell combination’ of
one member of likelihood and one member of prior.
• For decision making: Jaynes recommends to use that
member of posterior with maximum entropy (Maxent
estimate).
SVM and Kernel method
Based on Vapnik-Chervonenkis learning theory
Separate classes by wide margin hyperplane classifier,
or enclose data points between close parallell hyperplanes
for regression
Possibly after non-linear mapping to highdimensional space
Assumption is only point exchangeability
Given training sequence ((xi,yi), i=1…N),
find y(N+1) given x(N+1).
Y discrete: classification;
Y real valued: regression.
Classify with hyperplanes
Frank Rosenblatt (1928 – 1971)
Pioneering work in classifying by
hyperplanes in high-dimensional
spaces.
Criticized by Minsky-Papert, since
real classes are not normally
linearly separable.
ANN research taken up again in
1980:s, with non-linear mappings
to get improved separation.
Predecessor to SVM/kernel methods
Find parallel hyperplanes
Classification
Red: true separating
plane.
Blue: wide margin
separation in sample
Classify by plane
between blue planes
SVM and Kernel method
Vovk/Gammerman Hedged
predictions
• Based on Kolmogorov complexity or
non-conformance measure
• In classification, each prediction comes
with confidence
• Asymptotically, misclassifications
appear independently and with
probability 1-confidence.
• Only assumption is exchangeability