Transcript lecture2

EART20170 Computing, Data
Analysis & Communication
Dr Paul Connolly (F18 – Sackville Building)
skills Lecturer:[email protected]
1. Data analysis (statistics)
3 lectures & practicals
statistics open-book test (2 hours)
2. Computing (Excel statistics/modelling)
2 lectures
assessed practical work
Course notes etc: http://cloudbase.phy.umist.ac.uk/people/connolly
Recommended reading: Cheeney. (1983) Statistical methods in
Geology. George, Allen & Unwin
Recap – last lecture
 The four measurement scales: nominal,
ordinal, interval and ratio.
 There are two types of errors: random errors
(precision) and systematic errors (accuracy).
 Basic graphs: histograms, frequency
polygons, bar charts, pie charts.
 Gaussian statistics describe random errors.
 The central limit theorem
 Central values, dispersion, symmetry
 Weighted mean.
Some common problems
X  1,4,6,3,7,4  [ x1 , x2 , x3 , x4 , x5 , x6 ]
N
x
i 1
N
i
2
(
x

x
)
 i
i 1
Use tables
xx
( x  x )2
1
-3.1667
10.0278
4
-0.1667
0.0278
6
1.8333
3.3611
3
-1.1667
1.3611
7
2.8333
8.0278
4
-0.1667
0.0278
25
0
22.8333
x

Lecture 2
 Correlation between two variables
 Classical linear regression
 Reduced major axis regression
 Propagation of errors in compound
quantities.
Correlation
 Many real-life quantities have a dependence
on some thing else. E.g dependence of rock
permeability on porosity.
 How can we quantify the strength and
direction of a linear relationship between X
and Y variables?
Correlation

Linear correlation (Pearson’s coefficient)
 x y
N
r
 2  x 2   2  y 2 
 x 
   y 

N
N

 

 xy 





 y = sum of all y-values
 x = sum of all x-values
 x2 = sum of all x2 values
 y2 = sum of all y2 values
 xy = sum of the x times y values
 Like other numerical measures, the population correlation
coefficient is (the Greek letter ``rho'‘, ) and the sample
correlation coefficient is denoted by r.
Correlation
 Values of r
y r = +1
y
x
Perfect
positive
correlation
r = -1
y
r=0
x
Perfect
negative
correlation
x
No
correlation
Correlation
 r2 is the amount of variation in x and y that is explained by the
r2, fraction of explained
variation
linear relationship. It is often called the `goodness of fit’
 E.g. if an r = 0.97 is obtained then r2 = 0.95 so 100x0.95=95% of
the total variation in x and y is explained by the linear
relationship, but the remaining 5% variation is due to “other”
causes.
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
+1.0
+0.5
+0.0
-0.5
Correlation coefficient, r
-1.0
Regression analysis
 How can we fit an equation to a set of
numerical data x, y such that it yields the best
fit for all the data?
Classical linear regression
 An approximate fit yields a straight line that
passes through the set of points in the best
possible manner without being required to
pass exactly through any of the points.
Classical linear regression
Linear Regression
Y=mx+c
y
{
m
ei
c
x
 Where ei is the deviation of the data point from the fit line, c is
the intercept, m is the gradient.
 Assumes that the error is present only in y.
How do we define a good fit?
 If the sum of all deviations is a minimum? ei
 If the sum of all the absolute deviations is a
minimum? |ei|
 If the maximum deviation is a minimum? emax
 If the sum of all the squares of the deviations
is a minimum? ei2
Classical linear regression
 The best way is to minimise the sum of the squares
of the deviation. Formally this involves some
Mathematics:
 At each value of xi:
yi  mxi  c
 Therefore the deviations from the curve are:
ei  (Yi  yi )
 The sum of the squares:
S (c, m)   e  i 1 (Yi  c  mx i ) 2
N
2
i 1 i
N
Classical linear regression
 How do you find the minimum of a function?
 Use calculus
 Differentiate and set to zero
S (c, m)
N
 i 1 2(Yi  c  m xi )(1)  0
c
S (c, m)
N
 i 1 2(Yi  c  m xi )( xi )  0
m
 Two simultaneous equations
cN  mi 1 xi  i 1Yi
N
N
ci 1 xi  m x  i 1 xiYi
N
N
2
i 1 i
N
Classical linear regression
 Solving the two
equations yields:

c
N
Y
i 1 i
 x    x 
N  x   x 
2
N
i 1 i
N
N
i 1 i
i 1 i i
2
N
2
i 1 i
i 1 i
N i 1 xiYi  i 1 xi i 1 Yi
N
m
N
N
N x 
N
2
i 1 i
N
 x 
N
i 1 i
2
xY
Classical linear regression

x
y
xy
x2
?
?
?
?
Classical linear regression
 Classical linear regression only considered
errors in the Y values of the data.
 How can we consider errors in both x and y
values?
 Use Reduced major axis regression
Reduced major axis regression
dx
{
y
dy
{
c
x
 Method to quantify a linear relationship where both
variables are dependent and have errors
 Instead of minimising e2=(Y-y)2 we minimise
e2=dy2+dx2.
Reduced major axis regression

 y

2
y
y
m

x
2
2
N
2

 x
x 
N
c  y  mx
Reduced major axis regression

x
y
x-x’
y-y’
(x-x’)2 (y-y’)2
?
?
?
?
?
?
Error propagation
 Every measurement of a variable has an
error.
 Often the error quoted is one standard
deviation of the mean (mean ± standard
deviation)
 The standard deviation of the sample mean is
usually our best estimate of the population
standard deviation
Error propagation
 Error propagation is a way of combining two or more
random errors together to get a third. The equations
assume that the errors are Gaussian in nature.
 It can be used when you need to measure more than
one quantity to get at your final result. For example, if
you wanted to predict permeability from a measured
porosity and grainsize. The equations introduced
here let you propagate the uncertainties on your data
through the calculation and come up with an
uncertainty on your results.
 How then do we combine variables which have
errors?
Error propagation - quoted
Relationship
zx y
z  xy
z  xy
z 
x
y
z  kx
z  xn
Error propagation
2
 z 






2

z 
 z 2




z



 z 2




z















z  n x
x
z   x
z 
z  
x
x
z
x
x
z  kx
z
2

x 
x




z  logex
ex
2
 x 


x
2




2








2




2


y 

 y 


y










y
y
y
2




2




(k=constant)
Example of propagation of error
 Suppose we measure the thickness of a rock bed
using a tape measure.
 The tape measure is shorter then the bed thickness
so we have to do it in two steps x and y.
 We repeat the measurements 100 times and obtain
the following mean and standard deviation values for
x and y:
x=12.1±0.3 cm
y=4.2±0.2 cm
 The thickness of the bed should be simply:
x+y=16.3 cm
 But what about the error on the total thickness?
Example of propagation of error
 It is given by propagating the individual errors as follows:
 So the final answer for the total thickness of the bed is:
16.3±0.4 cm
 Error propagation formulae are non-intuitive and understanding
how they are derived requires some mathematical knowledge
More complex examples
 What if we have several functions of several variables?
 E.g. calculating density using Archimedes Principle:
wt . in air (A)
wt. in air(A)- wt in water(W)
 This equation contains two functions and two variables
Density 
 Error propagation is best done in parts, so first work out value
and error in denominator:
 Then the value and error of:
x  A W
A
x
 In a few of weeks we will use a Monte Carlo method for solving
more complex functions
Density 
Reminder Statistics practical #2
 Those not taking BIOL20451: Roscoe 3.5
1100 – 1300 Tuesday
 Those taking BIOL20451: Williamson 1.12
1400 – 1600 Tuesday
Some common problems
 Weighted mean
f
x
What does adding two variables really
mean?