Transcript lecture2
EART20170 Computing, Data
Analysis & Communication
Dr Paul Connolly (F18 – Sackville Building)
skills Lecturer:[email protected]
1. Data analysis (statistics)
3 lectures & practicals
statistics open-book test (2 hours)
2. Computing (Excel statistics/modelling)
2 lectures
assessed practical work
Course notes etc: http://cloudbase.phy.umist.ac.uk/people/connolly
Recommended reading: Cheeney. (1983) Statistical methods in
Geology. George, Allen & Unwin
Recap – last lecture
The four measurement scales: nominal,
ordinal, interval and ratio.
There are two types of errors: random errors
(precision) and systematic errors (accuracy).
Basic graphs: histograms, frequency
polygons, bar charts, pie charts.
Gaussian statistics describe random errors.
The central limit theorem
Central values, dispersion, symmetry
Weighted mean.
Some common problems
X 1,4,6,3,7,4 [ x1 , x2 , x3 , x4 , x5 , x6 ]
N
x
i 1
N
i
2
(
x
x
)
i
i 1
Use tables
xx
( x x )2
1
-3.1667
10.0278
4
-0.1667
0.0278
6
1.8333
3.3611
3
-1.1667
1.3611
7
2.8333
8.0278
4
-0.1667
0.0278
25
0
22.8333
x
Lecture 2
Correlation between two variables
Classical linear regression
Reduced major axis regression
Propagation of errors in compound
quantities.
Correlation
Many real-life quantities have a dependence
on some thing else. E.g dependence of rock
permeability on porosity.
How can we quantify the strength and
direction of a linear relationship between X
and Y variables?
Correlation
Linear correlation (Pearson’s coefficient)
x y
N
r
2 x 2 2 y 2
x
y
N
N
xy
y = sum of all y-values
x = sum of all x-values
x2 = sum of all x2 values
y2 = sum of all y2 values
xy = sum of the x times y values
Like other numerical measures, the population correlation
coefficient is (the Greek letter ``rho'‘, ) and the sample
correlation coefficient is denoted by r.
Correlation
Values of r
y r = +1
y
x
Perfect
positive
correlation
r = -1
y
r=0
x
Perfect
negative
correlation
x
No
correlation
Correlation
r2 is the amount of variation in x and y that is explained by the
r2, fraction of explained
variation
linear relationship. It is often called the `goodness of fit’
E.g. if an r = 0.97 is obtained then r2 = 0.95 so 100x0.95=95% of
the total variation in x and y is explained by the linear
relationship, but the remaining 5% variation is due to “other”
causes.
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
+1.0
+0.5
+0.0
-0.5
Correlation coefficient, r
-1.0
Regression analysis
How can we fit an equation to a set of
numerical data x, y such that it yields the best
fit for all the data?
Classical linear regression
An approximate fit yields a straight line that
passes through the set of points in the best
possible manner without being required to
pass exactly through any of the points.
Classical linear regression
Linear Regression
Y=mx+c
y
{
m
ei
c
x
Where ei is the deviation of the data point from the fit line, c is
the intercept, m is the gradient.
Assumes that the error is present only in y.
How do we define a good fit?
If the sum of all deviations is a minimum? ei
If the sum of all the absolute deviations is a
minimum? |ei|
If the maximum deviation is a minimum? emax
If the sum of all the squares of the deviations
is a minimum? ei2
Classical linear regression
The best way is to minimise the sum of the squares
of the deviation. Formally this involves some
Mathematics:
At each value of xi:
yi mxi c
Therefore the deviations from the curve are:
ei (Yi yi )
The sum of the squares:
S (c, m) e i 1 (Yi c mx i ) 2
N
2
i 1 i
N
Classical linear regression
How do you find the minimum of a function?
Use calculus
Differentiate and set to zero
S (c, m)
N
i 1 2(Yi c m xi )(1) 0
c
S (c, m)
N
i 1 2(Yi c m xi )( xi ) 0
m
Two simultaneous equations
cN mi 1 xi i 1Yi
N
N
ci 1 xi m x i 1 xiYi
N
N
2
i 1 i
N
Classical linear regression
Solving the two
equations yields:
c
N
Y
i 1 i
x x
N x x
2
N
i 1 i
N
N
i 1 i
i 1 i i
2
N
2
i 1 i
i 1 i
N i 1 xiYi i 1 xi i 1 Yi
N
m
N
N
N x
N
2
i 1 i
N
x
N
i 1 i
2
xY
Classical linear regression
x
y
xy
x2
?
?
?
?
Classical linear regression
Classical linear regression only considered
errors in the Y values of the data.
How can we consider errors in both x and y
values?
Use Reduced major axis regression
Reduced major axis regression
dx
{
y
dy
{
c
x
Method to quantify a linear relationship where both
variables are dependent and have errors
Instead of minimising e2=(Y-y)2 we minimise
e2=dy2+dx2.
Reduced major axis regression
y
2
y
y
m
x
2
2
N
2
x
x
N
c y mx
Reduced major axis regression
x
y
x-x’
y-y’
(x-x’)2 (y-y’)2
?
?
?
?
?
?
Error propagation
Every measurement of a variable has an
error.
Often the error quoted is one standard
deviation of the mean (mean ± standard
deviation)
The standard deviation of the sample mean is
usually our best estimate of the population
standard deviation
Error propagation
Error propagation is a way of combining two or more
random errors together to get a third. The equations
assume that the errors are Gaussian in nature.
It can be used when you need to measure more than
one quantity to get at your final result. For example, if
you wanted to predict permeability from a measured
porosity and grainsize. The equations introduced
here let you propagate the uncertainties on your data
through the calculation and come up with an
uncertainty on your results.
How then do we combine variables which have
errors?
Error propagation - quoted
Relationship
zx y
z xy
z xy
z
x
y
z kx
z xn
Error propagation
2
z
2
z
z 2
z
z 2
z
z n x
x
z x
z
z
x
x
z
x
x
z kx
z
2
x
x
z logex
ex
2
x
x
2
2
2
2
y
y
y
y
y
y
2
2
(k=constant)
Example of propagation of error
Suppose we measure the thickness of a rock bed
using a tape measure.
The tape measure is shorter then the bed thickness
so we have to do it in two steps x and y.
We repeat the measurements 100 times and obtain
the following mean and standard deviation values for
x and y:
x=12.1±0.3 cm
y=4.2±0.2 cm
The thickness of the bed should be simply:
x+y=16.3 cm
But what about the error on the total thickness?
Example of propagation of error
It is given by propagating the individual errors as follows:
So the final answer for the total thickness of the bed is:
16.3±0.4 cm
Error propagation formulae are non-intuitive and understanding
how they are derived requires some mathematical knowledge
More complex examples
What if we have several functions of several variables?
E.g. calculating density using Archimedes Principle:
wt . in air (A)
wt. in air(A)- wt in water(W)
This equation contains two functions and two variables
Density
Error propagation is best done in parts, so first work out value
and error in denominator:
Then the value and error of:
x A W
A
x
In a few of weeks we will use a Monte Carlo method for solving
more complex functions
Density
Reminder Statistics practical #2
Those not taking BIOL20451: Roscoe 3.5
1100 – 1300 Tuesday
Those taking BIOL20451: Williamson 1.12
1400 – 1600 Tuesday
Some common problems
Weighted mean
f
x
What does adding two variables really
mean?