Review of key biostatistical concepts relevant to EBM

Download Report

Transcript Review of key biostatistical concepts relevant to EBM

Introduction to
Biostatistics
Prof Haroon Saloojee
Division of Community Paediatrics
Introduction to Biostatistics
Lecture 1
Summarising your data 1
The evidence-based clinician’s
motto
In God we trust.
All others must bring data.
Challenges
Statistical ideas can be difficult and
intimidating
Thus:


Statistical results are often “skipped-over”
when reading scientific literature
Data is often misinterpreted
Misinterpretation of Data
“Celebrating birthdays is healthy”
Statistics show that those that celebrate the most
birthdays, live the longest.
You may think that:
A Bar Chart is a map of the locations of
the nearest taverns
A p-value is the result of a urinalysis
A t-test is a taste test between rooibos tea
and Five Roses tea
Course Structure
“BIO-SADISTICS”
Four 45-minute lectures
PowerPoint presentations on student web site
Some text (content) also on web page
Plus, additional internet “links”
Syllabus for the Course
􀃙
SESSION 1: Summarizing your data 1
Types of data (quantitative and
categorical variables)
Describing data- average (mean,
median, and mode)
Displaying data graphically (box plots,
histograms, bar charts, pie diagrams)
Frequency distributions
SESSION 2: Summarizing your data 2
The normal distribution
Describing data – spread (range,
variance, standard deviation, z score)
Quartiles, percentiles
Standard error of the mean
Confidence intervals
SESSION 3: Sampling principles
Study Population
The sample
Random sampling
Non random sampling
Sampling bias
Sample size and power
SESSION 4: Statistical tests and
the concept of significance
Hypothesis testing
p value
Statistical versus clinical
significance
Parametric versus non-parametric
methods
Free textbook on-line
Statistics at Square One
http://bmj.bmjjournals.com/collections/statsbk/index.shtml
http://www.medstatsaag.com/mcqs.asp
Relevant topics
Handling data
1, 4, 5, 6, 7
Sampling
10, 11
Hypothesis testing
17, 18
Today’s Lecture
What types of data are there?
(numerical vs. categorical variables)
Describing data - measures of central tendency
(mean, median and mode)
Summarising data graphically (histograms, box
plots, bar charts, pie diagrams)
Types of data
Variable
Categorical
Nominal
Ordinal
Numerical
Discrete
Continuous
Types of Data
Numerical data
Discrete
Examples
No. of children
No. asthma attacks in a week
No. of rooms in home
Types of Data
Numerical data
Continuous
Any value on the continuum is possible (even fractions or
decimals)
Examples
Weight
Age
Temperature
Heart rate
Types of Data
Categorical data
Nominal
Mutually exclusive unordered categories
Examples




Sex (male, female)
Eye colour (brown, grey, green, blue)
Are you happy? (Yes, No)
Diarrhoea (Present, absent)
Can summarize in:

Tables – using counts and percentages

Bar Chart
Types of Data
Categorical data
Ordinal (ordered categories)
Examples
Degree of agreement

(Strongly Agree, Agree, Disagree, Strongly disagree)
Severity of injury

Severe, Moderate, Mild
Income level

High, medium, low
PRACTICE
Discrete or Continuous ?
mg of tar in cigarettes
Continuous
number of people in a car
Discrete
high to low temperature in
any day
Continuous
weight
Continuous
time
Continuous
number of children in the
average family
Discrete
Nominal or Ordinal?
Average / above avg /
below average
Ordinal
Colours of Smarties
Nominal
Grades (A, B, C, D, F)
Ordinal
Data Summaries
It is ALWAYS a good idea to summarise
your data


You become familiar with the data and the
characteristics of the people that you are
studying
You can also identify problems or errors with
the data (data management issues).
Summarising and Describing
Continuous Data
Measures of the centre of data (central
tendency)
Mean
Median
Mode
Definitions
The arithmetic mean is what is commonly called
the average. The mean is the sum of all the
scores divided by the number of scores.
The median is the middle of a distribution: half
the scores are above the median and half are
below the median.
The mode is the most frequently occurring score
in a distribution
“It has been said that a fellow with one
leg frozen in ice and the other leg in
boiling water is comfortable…
…on average.”
J.M. Yancy
Sample Mean X
The Average or Arithmetic Mean
Add up data, then divide by sample size (n)
The sample size n is the number of observations
(pieces of data)
􀃙 Example
Systolic blood pressures (mmHg)
X1 = 120
X2 = 80
X3 = 90
X4 = 110
X5 = 95
n=5
Notation
S (sigma) denotes the summation of a set of values
x is the variable usually used to represent the individual
data values
n represents the number of data values in a sample
N represents the number of data values in a population
x is pronounced ‘x-bar’ and denotes the mean of a set of
Sample values
µ is pronounced ‘mu’ and denotes the mean of all
values in a population
Definitions
Mean
the value obtained by adding the scores and
dividing the total by the number of scores
Sample
Population
x =
Sx
n
Sx
µ =
N
Notes on Sample Mean
Also called sample average or arithmetic mean
Sensitive to extreme values
- One data point could make a great change in
sample mean
Why is it called the sample mean?
– To distinguish it from population mean
Population Versus Sample
Population - The entire group you want
information about
– For example: The blood pressure of all 20-year-old male
university students in South Africa
Sample - A part of the population from which we
actually collect information and draw
conclusions about the whole population
– For example: Sample of blood pressures (n=50) of 20year-old male university students in South Africa
The sample mean X is not the population mean µ
Population Versus Sample
We don’t know the population mean µ but
would like to know it
We draw a sample from the population
We calculate the sample mean X
How close is X to µ?
Statistical theory will tell us how close X is to µ
Statistical inference is the process of trying to
draw conclusions about the population from
the sample
Weighted Mean
S (w • x)
x =
Sw
Your grade in many courses are weighted means (averages).
In other words, some things count (are weighted) more than
others.
Geometric Means
These are
histograms
rotated 90º, and
box plots.
Note how the log
transformation
gives a
symmetric
distribution.
•
•
5
5
5
3
1
5
1
4
3
5
2
1 1 2
(in order)
3
3
4
5
5
5
5
5
MEDIAN is 4
exact middle
•
1
1
3
3
4
5
5
5
5
5
no exact middle -- shared by two numbers
4+5
2
= 4.5
MEDIAN is 4.5
Mode
The score that occurs most frequently
Bimodal
Multimodal
No Mode
The only measure of central tendency that can be used
with nominal data
Examples
a.
5 5 5 3 1 5 1 4 3 5
b.
2 2 2 3 4 5 6 6 6 7 9
c.
2 3 6 7 8 9 10
d.
2 2 3 3 3 4
e.
2 2 3 3 4 4 5 5
•
•
•
Mode is 5
Bimodal – 2 & 6
No Mode
• Mode is 3
• No Mode
Shapes of the Distribution
Shapes of the Distribution
Distribution Characteristics
Shapes of the Distribution
Example: Height of students in the class
Shapes of the Distribution
Example: Serum cholesterol level
Shapes of the Distribution
Example: Birth weight of newborn babies
Shapes of the Distribution
Some visual ways to summarize
data
Tables
Frequency table
Graphs
Histograms
Bar graphs
Box plots
Line plots
Scatter graphs
Charts
Bar chart
Pie diagram
Frequency Tables
Summarizes a variable with counts and
percentages
The variable is categorical

Note that you can take a continuous variable
and create categories with it
How do you create categories for a continuous
variable?


Choose cutoffs that are biologically meaningful
Natural breaks in the data
Example of frequency table
When raw data are arranged with frequencies, they are said to form a frequency table
for ungrouped data.
When the data are divided into groups/ classes, they are called grouped data.
The classes have to be decided according to the range of data and size of class.
The number of observations lying in a particular class is called its frequency and the
table showing classes with frequencies is called a frequency table.
The total of frequencies of a particular class and of all classes prior to that class is
called the cumulative frequency of that class.
Graphical Summaries
Histograms

Continuous or ordinal data on horizontal axis
Bar Graphs

Nominal data
No order to horizontal axis
Box Plots

Continuous data
Histogram
A histogram is a graphic representation of the frequency distribution of a
variable. Vertical rectangles (bars) are drawn in such a way that their bases lie
on a linear scale representing different intervals, and their heights are
proportional to the frequencies of the values within each of the intervals.
Bar Chart
A bar chart is a method of presenting discrete data organized in such
a way that each observation can fall into one of mutually exclusive
categories.
The frequencies (or percentages) are listed along the Y axis and the
categories of the variable along the X axis. The heights of the bars
correspond to the frequencies. The bars should be of equal width
and they should not be touching me other bars.
Difference between bar chart and
histogram
Bar charts for categories that are separate
Histograms if you got categories by
dividing up continuous data.
Bars do not touch, histogram rectangles
do touch.
Line graph
If the mid-points of the top of the bars of a histogram are connected together by a
line and if the bars were omitted from the display, the resultant graph will be a line
graph (also called a frequency polygon).
Line graphs are good at showing trends over a period of time. When trends of rates
(e.g. death rate, Infant Mortality Rate, etc.) are to be displayed it is better done with
line graphs rather than histograms.
Scatter plot
Also called a scattergram. This a method of displaying the distribution
of two variables in relation to each other another. The value of one
variables is measured on the X axis and the values of the other on the
Y axis. The variables have to be on a continuous scale. Each plot thus
has two values (coordinates) from the Y and X axis scales.
A wide scatter of the plots denotes poor correlation between the two
variables. If the two variables are perfectly correlated, then all the plots
will fall on the diagonal (regression line).
Survival curve
Pie chart
This is a circular diagram (can
be shown as 2-D or 3-D)
divided into segments, each
representing a category or
subset of data (part of the
whole). The amount for each
category is proportional to the
area of the sector (slice of the
pie). The total area of the
circle is 100% and it
represents the total population
that is being shown.
Pictures of Data
Continuous Variables
Histograms
Means and medians do not tell whole
story
Differences in spread (variability)
Differences in shape of the distribution
How to Make a Histogram
Divide range of data into intervals (bins)
of equal width
Count the number of observations in
each class
Draw the histogram
Label scales
Pictures of Data: Histograms
Pictures of Data: Histograms
Pictures of Data: Histograms
Box plot
Another common visual display tool is the
box plot
Gives good insight into distribution shape
in terms of skewness and outlying values
Very nice tool for easily comparing
distribution of continuous data in multiple
groups – can be plotted side by side
Box plot
A box plot provides an
excellent visual summary
of many important aspects
of a distribution. The box
stretches from the lower
hinge (defined as the 25th
percentile) to the upper
hinge (the 75th percentile)
and therefore contains the
middle half of the scores
in the distribution.
The median is shown as a
line across the box.
Therefore 1/4 of the
distribution is between this
line and the top of the box
and 1/4 of the distribution
is between this line and
the bottom of the box.
Hospital Length of Stay
Box plot: Length of Stay
Box plot: Length of Stay
Misuse of graphics
" It pays to be wide awake in studying any graph. The
thing looks so simple, so frank, and .so appealing. that
the careless are easily fooled. " - M J Moroney.
Graphs and charts are often misused. The honest
researcher must have a good handle on how graphs can
be used to deliberately mislead people so that such
misadventures can be avoided.
Common tricks used to mislead:




The problem of scaling
The Advertiser's Graph
The transformed graph
The chart with too much data
Which graph to use?
Statistical methods depend on the “form” of a set of data, which
can be assessed with some common useful graphics:
Graph Name
Y-axis
X-axis
Histogram
Count
Category
Scatterplot
Continuous
Continuous
Dot Plot
Continuous
Category
Box Plot
Percentiles
Category
Line Plot
Mean or value
Category
Example of MCQ 1
The arithmetic mean of a set of values:
a) Is a particular type of average.
b) Is a useful summary measure of location if the data are
skewed to the right.
c) Coincides with the median if the distribution of the data
is symmetrical.
d) Is always greater than the median.
e) Cannot be calculated if the data set contains both
positive and negative values
Example of MCQ 2
A histogram:
a) Can be used instead of a pie chart to display categorical
data.
b) Is similar to a bar chart but there are no gaps between the
bars.
c) Contains contiguous bars, with the height of each bar being
proportional to the frequency of the observations in the
range specified by the bar.
d) Can be used to display either a frequency or a relative
frequency distribution.
e) Is used to show the relationship between two variables.
Any questions?