Data Handling II - KEATS

download report

Transcript Data Handling II - KEATS

Drug Development Statistics & Data Management Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics Division of Health and Social Care Research King's College London Email: [email protected]

Types of data

• Quantitative data – continuous, discrete – distributions may symmetric or skewed • Qualitative (categorical) data – binary – nominal, ordinal 2

Skewed Distributions

Posi tively s kewed data Negatively Skewed data 30 25 20 15 10 5 0 Long tail to left 25 20 15 10 5 0 Long tail to right 3

.4

.3

.2

.1

0 0

Symmetric Distribution

2 4 6 4

Summary statistics

• • • • ‘Where the data are’ - location – mean, median, mode, geometric mean Used to describe baseline data and main outcomes ‘How variable the data are’ - spread – standard deviation, variance, range, interquartile range, 95% range Needed (primarily) to describe baseline data in RCT and cohort study 5

Definition of the Mean

The

mean

of a sample of values is the arithmetic average and is determined by dividing the sum of the values by the number of the values.

6

Definition of the Median

The

median

is the middle value.

not affected by skewness and outliers, but less precise than mean theoretically.

7

Ordered Blood Glucose Values

2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6 3.6 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0

4.0 4.0

4.1 4.1 4.1

4.2 4.3 4.4 4.4 4.4 4.5 4.6 4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0

8

Definition of the Mode

The

mode

is the most frequent value.

9

Ordered Blood Glucose Values

2.2 3.4 3.4 3.4 3.8 3.8 3.8

4.1 4.1 4.1

4.4 4.4 4.4

4.7 4.7 4.7 5.0

2.9

3.6 3.6 3.6 3.6

3.9

4.2

4.5

4.8

5.1

3.3 3.3 3.3 3.7 3.7

4.0 4.0 4.0

4.3

4.6 4.9 4.9

6.0

10

Location = Central Tendency Mode - not necessarily central (categorical data) 7 Median - only uses relative magnitudes 6 Arithmetic Mean - outlier prone 5 4 3 2 1 0 2 3 4 Blood glucose (mmol/litre) 5 6 11

Relation of mean, median and mode • • • • If distribution is unimodal (has only one mode) then: Mean=median=mode for symmetric distribution.

Mean>median>mode for positively skewed distribution.

Mean

12

Serum Triglyceride Levels from Cord Blood of 282 Babies 80 70 60 50 40 30 20 10 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Serum Triglyceride Levels 13

Log(Serum Triglyceride Levels) from Cord Blood of 282 Babies 35 30 25 20 15 10 5 0 -1.9

-1.7

-1.5

-1.3

-1.1

-0.9

-0.7

-0.5

-0.3

log(Serum Triglyceride) Levels -0.1

0.1

0.3

0.5

14

Definition of the Geometric Mean

The

geometric mean

of a sample of n values is determined by multiplying all the values together and taking the nth root (for only two values this is the more familiar square root).

15

Geometric Mean

• A common example of when the geometric mean is the correct choice average is when averaging growth rates.

• Another Method: Take log of each value, find arithmetic mean and anti-log the result.

Exp( (log(0.15) + … + log(1.66) )/40) = 0.467

Serum Triglyceride Levels from Cord Blood of 282 Babies Median=0.460

Geometric Mean=0.467

80 70 60 50 40 30 20 10 Mean=0.506

0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Serum Triglyceride Levels 17

Why measures of variability are important Production of Aspirin • • • • New production process of 100 mg tabs Random sample from process – 96 97 100 101 101 mgs - mean 99 mg Random sample from old process – 88 93 100 104 110 mgs - mean 99 mg Same means but new is better because less variable 18

Definition of Range

The

range

of a sample of values is the largest value minus the smallest value. • • New process the range is 101-96=5 Old process the range is 110-88=22 • Range is simple ….. BUT – Only uses min and max – Gets larger as sample size increases 19

Definition of Inter-quartile Range

The

inter-quartile range

of a sample of values is the difference between the upper and lower quartiles. The

lower quartile

is the value which is greater than ¼ of the sample and less than ¾ of the sample. Conversely, the

upper quartile

is the value which is greater than ¾ of the sample and less than ¼ of the sample. 20

Ordered Blood Glucose Values

1/4 of 40 = 10 3/4 of 40 = 30 2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6

3.6 3.6

3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0 4.0 4.0 4.1 4.1 4.1

4.2 4.3 4.4 4.4 4.4

4.5 4.6

4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0

21

Inter-Quartile Range Inter-quartile range 7 6 5 4 3 2 1 0 2 Lower quartile 3 4 Blood glucose (mmol/litre) 5 Upper quartile 6 22

Standard deviation

• • • • • Neither measure uses the numerical values - only relative magnitudes A measure accounting for the values is the

standard deviation

Consider the aspirin data from the new process 96 97 100 101 101 (mean 99 mg) Determine deviations from mean -3 -2 1 2 2 Square , add, average and square-root 9  4  1 5  4  4  4 .

4  2 .

098 23

Measures of scatter/dispersion – ‘how variable the data are’ • • • • Range – smallest to biggest value – increases with sample size Standard deviation – measure of variation around the mean – affected by skewness and outliers Variance = square of standard deviation Interquartile range (IQR) – from 25th centile to 75th centile 24

Plotting Data

• • Histograms Stem and Leaf Plots Box Plots Stem Leaf 60 0 1 58 56 54 52 50 00 2 48 000 3 46 0000 4 44 0000 4 42 00 2 40 000000 6 38 0000 4 36 000000 6 34 000 3 32 000 3 30 28 0 1 26 24 22 0 1 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 6 5 4 3 2 25

Mean and standard deviation

• • Best description if distribution reasonably symmetric (and single mode) Give full description if data have Normal distribution 26

.4

Mean 3, s.d. 1 .3

Mean 5, s.d. 1 .2

.1

Mean 5, s.d. 2 0 0 1 2 3 4 5 x 6 7 8 9 10 27

Properties of Normal distribution

• • • • Symmetric distribution – mean, median and mode equal Completely specified by mean and standard deviation 95% of distribution contained within mean  1.96 standard deviations 68% within mean  1 standard deviation 28

Continuous data, not Normally distributed • • If symmetric use mean and standard deviation If skewed use median and IQR • Unless Positively skewed, but log transformation creates symmetric distribution – use geometric mean 29

Nominal categorical data

• • Mode.

% in each category, especially when binary.

Wheeze in last 12 months No Yes Total Frequency (n) 1945 642 2587 % 75.2

24.8

100.0

30

Ordinal categorical data

• • Median and IQR if enough separate values.

Otherwise as for nominal.

31

Discrete quantitative data

• As for continuous data if many values, as for ordinal data if fewer.

33 Difference Between Standard Deviation & Standard Error

34 Measure of Variability of the Sample Mean • Range, inter-quartile range and standard deviation relate to population (sample) not mean.

• To understand the difference carry out a sampling experiment using the Ritchie Index values

35 Values of the Ritchie Index (Measure of Joint Stiffness) in 50 Untreated Patients 14 9 8 9 1 20 3 3 2 4 2 3 6 1 2 11 16 24 16 21 19 22 33 12 12 12 19 10 33 2 19 40 1 20 1 2 4 7 9 4 9 6 14 8 27 10 27 7 24 21 Mean = (14+…+21)/50 = 12.18

Location = Central Tendency 36 16 14 12 10 8 2 0 6 4 0 - 5 Mode - not necessarily central (categgorical data) Median - only uses relative magnitudes Arithmetic Mean - outlier prone 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index

37 Sampling Experiment • • • • Take a random sample (10) from the 50 values Determine the mean of the 10 values Repeat 50 times These means show variation - HOW LARGE IS IT ?

6 4 2 0 16 14 12 10 8 Variations in Samples 6 4 2 0 16 14 12 10 8 0 - 5 Mean=12.18

6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index Mean=10.00

6 4 2 0 16 14 12 10 8 0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index 6 4 2 0 16 14 12 10 8 0 - 5 Mean=13.40

6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index Mean=12.60

6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index 6 4 2 0 16 14 12 10 8 0 - 5 Mean=11.50

6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index

39 30 Ritchie Values

Original values (mean - 12.18 ; sd - 9.69)

25 20 15 10 5 0 0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index

40 Ritchie Values Sampling Experiment – Sample Means

Original values (mean - 12.18 ; sd - 9.69)

30

Sample means

25

(mean - 12.21 ; sd - 2.97)

20 15 10 5 0 0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index

41 Definition of the Standard Error The standard deviation of the sampling distribution of the mean is called the

standard error

of the mean.

Increasing Sample Size 40 n=10 35 30 25 20 15 10 Sample means (mean - 12.21 ; sd - 2.97) 5 0 0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index 40 35 30 25 20 15 10 5 0 n=15 Sample means (mean - 12.37 ; sd - 2.43) 0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40 Values of the Ritchie Index 42 • • Increased precision (smaller standard error) Less skewness

43 Standard error of the mean as a function of the sample size 10 9 8 7 sd   6 5 4 se   / n 3 2 1 0 0 10 20 Sample Size 30 40

Population of Gene Lengths n=20,290 44 0 5000 10000 Gene Length (# of nucleotides) 15000

Samples of size : n=100

45 0 5000 10000 Gene Length (# of nucleotides) 15000

46

Practical Confusion

• A mean is often reported in medical papers as 12.18  1.37

what is 1.37 ?

sd or se ?

Thanks!

Tea break