Transcript Overview

Overview
G. Jogesh Babu
Overview of Astrostatistics
A brief description of modern astronomy &
astrophysics.
Many statistical concepts have their roots in
astronomy (starting with Hipparchus in 4th c.
BC)
Relevance of statistics in astronomy today
State of astrostatistics today
Methodological challenges for astrostatistics in
2000s
Descriptive Statistics
Introduction to R programming language, an
integrated suite of software facilities for data
manipulation, calculation and graphical
display.
Descriptive statistics helps in extracting the
basic features of data & provide summaries
about the sample and the measures.
Commonly used techniques such as,
graphical description, tabular description, and
summary statistics, are illustrated through R.
Exploratory Data Analysis
An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical) to:
– maximize insight into a data set
– uncover underlying structure
– extract important variables
– detect outliers and anomalies
– formulate hypotheses worth testing
– develop parsimonious models
– provide a basis for further data collection through
surveys or experiments
Probability theory
Conditional probability & Bayes theorem
(Bayesian analysis)
Expectation, variance, standard deviation
(units free estimates)
density of a continuous random variable (as
opposed to density defined in physics)
Normal (Gaussian) distribution, Chi-square
distribution (not Chi-square statistic)
Probability inequalities and the CLT
Correlation & Regression
Correlation coefficient
Underlying principles of linear and
multiple linear regression
Least squares estimation
Ridge regression
Principal components
Linear regression issues in astronomy
Compares different regression lines
used in astronomy
Illustrates them with Faber-Jackson
relation.
Statistical Inference
While Descriptive Statistics provides tools to describe
what the data shows, the statistical inference helps in
reaching conclusions that extend beyond the
immediate data alone.
Statistical inference helps in making judgments of an
observed difference between groups is a dependable
one or one that might have happened by chance in a
study.
Topics to be covered include:
– Point estimation
– Confidence intervals for unknown parameters
– Principles of testing of hypotheses
Maximum Likelihood Estimation
Likelihood - differs from that of a probability
– Probability refers to the occurrence of future events
– while a likelihood refers to past events with known outcomes
MLE is used for fitting a mathematical model
to data.
Modeling real world data by estimating
maximum likelihood offers a way of tuning the
free parameters of the model to provide a
good fit.
MLE Contd.
Thomas Hettmansperger's lecture includes:
– Maximum likelihood method for linear regression,
an alternative to least squares method
– Cramer-Rao inequality, which sets a lower bound
on the error (variance) of an estimator of
parameter. It helps in finding the `best' estimator.
Analysis of data from two or more different
populations involve mixture models.
– The likelihood calculations are difficult, so an
iterative device called EM algorithm will be
introduced. Computations are illustrated in the Lab
Nonparametric Statistics
These statistical procedures make no assumptions
about the probability distributions of the population.
The model structure is not specified a priori but is
instead determined from data.
As non-parametric methods make fewer
assumptions, their applicability is much wider
Procedures described include:
– Sign test
– Mann-Whitney two sample test
– Kruskal-Wallis test for comparing several samples
Bayesian Inference
As evidence accumulates, the degree of
belief in a hypothesis ought to change
Bayesian inference takes prior knowledge
into account
The quality of Bayesian analysis depends on
how best one can convert the prior
information into mathematical prior probability
Tom Loredo describes methods for parameter
estimation, model assessment etc
Illustrates with examples from astronomy
Multivariate analysis
Analysis of data on two or more attributes
(variables) that may depend on each other
– Principle components analysis, to reduce the
number of variables
– Canonical correlation
– Tests of hypotheses
– Confidence regions
– Multivariate regression
– Discriminant analysis (supervised learning).
Computational aspects are covered in the lab
Bootstrap
How to get most out of repeated use of the
data.
Bootstrap is similar to Monte Carlo method
but the `simulation' is carried out from the
data itself.
A very general, mostly non-parametric
procedure, and is widely applicable.
Applications to regression, cases where the
procedure fails, and where it outperforms
traditional procedures will be also discussed
Goodness of Fit
Curve (model) fitting or goodness of fit using
bootstrap procedure.
Procedure like Kolmogorov-Smirnov does not
work in multidimensional case, or when the
parameters of the curve are estimated.
Bootstrap comes to rescue
Some of these procedures are illustrated
using R in a lab session on Hypothesis
testing and bootstrapping
Model selection, evaluation, and
likelihood ratio tests
The model selection procedures covered
include:
Chi-square test
Rao's score test
Likelihood ratio test
Cross validation
Time Series & Stochastic
Processes
Time domain procedures
State space models
Kernel smoothing
Poisson processes
Spectral methods for inference
A brief discussion of Kalman filter
Illustrations with examples from astronomy
Monte Carlo Markov Chain
MCMC methods are a collection of techniques that
use pseudo-random (computer simulated) values to
estimate solutions to mathematical problems
MCMC for Bayesian inference
Illustration of MCMC for the evaluation of
expectations with respect to a distribution
MCMC for estimation of maxima or minima of
functions
MCMC procedures are successfully used in the
search for extra-solar planets
Spatial Statistics
Spatial point processes
Intensity function
Homogeneous and inhomogeneous
Poisson processes
Estimation of Ripley's K function
(useful for point pattern analysis).
Cluster Analysis
Data mining techniques
Classifying data into clusters
– k-means
– Model clustering
– Single linkage (friends of friends)
– Complete linkage clustering algorithm