mrsgreenbiology.files.wordpress.com

Download Report

Transcript mrsgreenbiology.files.wordpress.com

Topic 1 Statistical Analysis

What you need to know:

1.1.1 State that error bars are a graphical representation of the variability of data.

1.1.2 Calculate the mean and standard deviation of a set of values.

1.1.3 State that the term standard deviation is used to summarize the spread of values around the mean, and that 68% of the values fall within one standard deviation of the mean.

1.1.4 Explain how the standard deviation is useful for comparing the means and the spread of data between two or more samples.

1.1.5 Deduce the significance of the difference between two sets of data using calculated values for t and the appropriate tables.

1.1.6 Explain that the existence of a correlation does not establish that there is a causal relationship between two variables.

Many advertisements for products from cosmetics to cars use Science to sell them.

Look at these examples:

Why do people use Science in advertisements?

People trust science!

Why?

Because they believe that Scientists have carried out research and that the data they have collected can be trusted.

This is because of the rigorous nature of the

Scientific Method

TOK Link!

Can you always believe scientists?

Lies, damned lies and statistics!

What has statistics got to do with Biology?!

•As Scientists we make observations •We come up with a hypothesis •We carry out experimental work •We collect data •But, what does the data tell us?

•Data needs to be processed •Data needs to be analysed

Assessment Statement

1.1.2

Calculate the mean and standard deviation of a set of values.

1.1.3

State that the term standard deviation is used to summarize the spread of values around the mean, and that 68% of the values fall within one standard deviation of the mean.

1.1.4

Explain how the standard deviation is useful for comparing the means and the spread of data between two or more samples.

The Mean

When you carried out experiments in the past you will have probably calculated an

average

or

mean

of your results.

The mean is a measure of the

central tendency

(middle value) of the data. But the mean isn’t always the best way of representing the central tendency of our data. It depends on what type of data we have.

There are three measures of central tendency

Data collected from an experiment falls into 3 types:

Data Type

Nominal Ordinal Integral

Example

Cats, red cars, usually frequency counts

Central Tendency

mode Ranked - 1st, 2nd or relative data median On a scale (measured) - mm, o C etc.

mean

Statistical Analysis

Measures of central tendency (of a set of data) Mean Median Mode -

Mean

Task =109/10 Find the mean, median and mode of the =

10.9

following data points.

Mode

=

10 Median

9,10,10,10,10,11,11,12,12,14 Even number so 10+11 / 2 =

10.5

Task

Look at the examples on the next three slides.

Which measure of central tendency would you use in each situation and why?

As part of a promotion for Justin Bieber’s new CD, you can win 1 million VND by guessing how many tracks will be on the album!

To help him make the best guess, Huy has written down the number of tracks on each of his other Justin Bieber CDs.

His results are as follows:

10, 14, 10, 12, 10, 10, 11, 9 and 11

What number should he guess and why?

Indonesia is a country with 237 million people. Most of the people are extremely poor, living in huts. There is a small, but growing “middle class” and there are a few extremely wealthy people. Does mean, mode or median give the best idea of the “average” income?

A fertilizer company developed a new high potassium fertilizer that increased the yield of rice. In their advertisements, what number should they use to state the percentage increase in yield - the mean, median or mode?

Let’s look at some data

Class A height/cm +/- 0.1cm

150 162 182 165 177 175 166 163 172 185 Class B height/cm +/- 0.1cm

166 164 163 159 162 166 165 168 167 164

Class A height/cm +/ 0.1cm

150 162 182 165 177 175 166 163 172 Mean = 178 169 Class B height/cm +/ 0.1cm

166 168 167 169 170 166 171 168 167 168 168

What does this tell us?

Looking at the mean heights for the two classes one might conclude that...

...the heights of the students in each class are pretty similar However, the mean doesn’t tell us anything about the

spread of data

Range and variability

Range In this case, the and lies in the middle of the frequency distribution In this case the range is

large

and the data is much more variable

Standard Deviation

Is a measure of how much the data varies from the mean.

Is meaningful only for data with a normal distribution.

95% of data lie within 2 standard deviations of the mean In a normal distribution, 68% of data lie within 1 standard deviation of the mean

Calculating Standard Deviation

You will be given the formula (phew!) You are expected to be able to calculate STD using a graphic or scientific calculator You can also use Excel for this • Standard deviation calculator

1. Click on the cell where you wish the answer to appear 1. Type: =stdev( 1. Highlight the data using your mouse 1. Close brackets 1. Press enter

standard deviation sum mean number of data points data point

What does it mean?

A small STD indicates that the data is clustered closely around the mean A large STD indicates a wider spread around the mean We may add STD as an error bar on a graph

Assessment Statements

1.1.1

State that error bars are a graphical representation of the variability of data.

Error bars

Error bars can be added to graphs to show the range of data or the STD This shows us how data is spread What does the size of the error bar tell us?

Bigger error bars show a greater spread of data

The blue line represents the height of year 12 students in the school. What might the red line represent?

What if distribution isn’t normal?

Then the mean does not lie at the centre of the frequency distribution and you cannot use STD to show the spread of the data

Assessment Statement

1.1.5

Deduce the significance of the difference between two sets of data using calculated values for t and the appropriate tables.

Student t Test

The t test tells us the probability (P) that two sets of data are the same If P = 0 the two sets of data are exactly the same If P = 1 the two sets of data are not at all the same The higher the value of P the more the data overlap Smaller overlap = more significant results

How do we use a t test?

A researcher wishes to learn whether the pH of soil affects seed germination of a plant found in forests near her home. She filled 10 flower pots with acid soil (pH 5.5) and ten flower pots with neutral soil (pH 7.0) and planted 100 seeds in each pot. The mean number of seeds that germinated in each type of soil is below.

Acid Soil pH 5.5

42 45 40 37 41 41 48 50 45 46 Mean = 43.5

Neutral Soil pH 7.0

43 51 56 40 32 54 51 55 50 48 Mean = 48.0

Hypothesis

The researcher is testing whether soil pH affects germination of the herb.

Her hypothesis (H 1 ) states that the mean germination at pH 5.5 is different than the mean germination at pH 7.0.

The null hypothesis (H o ) states that there is

no significant

difference between the two soils.

Putting the data into a programme to calculate the t value gives us an answer of 1.66

GraphPad QuickCalcs: t test calculator We can look this value up in a t table. The t table tells you how confident you can be that your values are different t-test table

1. Select the column with the probability that you want.

2. e.g. 0.05 means '95% chance' 3. Select the row for degrees of freedom.

4. For two data sets the number of degrees of freedom is equal to (n 1 + n 2 )-2 In this case (10+10) -2 =18 5. Compare the critival value in the table with your t-value.

6. The results are significant if the t-value is

greater

than the critical value.

So, our critical value from the t table is

2.09

Our calculated t value is

1.66

If t < critical value we accept the null hypothesis If t > critical value we reject the null hypothesis In this case 1.66(t) < 2.09(critical value) So we accept the null hypothesis.

pH does not affect the germination of the plant.

Limitations to the t test

For the t test to be applied: The data must have a normal distribution Must have a sample size of at least 10

Assessment Statement

1.1.6

Explain that the existence of a correlation does not establish that there is a causal relationship between two values.

Correlation:

Relationship between two quantities such that when one quantity changes the other does too

Correlation and Causation

A phrase used to emphasize that correlation between two variables does not automatically imply that one causes the other.

Correlation

: The more firemen fighting a fire, the bigger the fire is going to be.

Causation

: Firemen make fires bigger

Correlation

: As ice cream sales increase, the rate of drowning deaths increases sharply.

Causation

: Ice cream causes drowning

Correlation

:Since the 1950s, both the atmospheric CO

2

level and crime levels have increased sharply.

Causation

: Atmospheric CO

2

causes crime

Determining Causation

Imagine you did badly on a test and guessed that the cause was not studying.

How could you prove this?

If one could rewind history, and change only one small thing, then causation could be observed.

The same student writing the same test under the same circumstances but having studied the night before.

A major goal of scientific experiments is to control variables as best as possible.

We could run an experiment on identical twins who were known to consistently get the same grades on their tests.

One twin is sent to study for six hours while the other is sent to the amusement park.

If their test scores suddenly diverged by a large degree, this would be strong evidence that studying had a causal effect on test scores.

Correlation between studying and test scores would almost certainly imply causation.

Headline “Diet of fish can prevent teen violence.”

Participants were a group of 3-year olds given an “enriched diet, exercise, and cognitive stimulation.” They were compared to a control group who did not go through this same program.

By age 23 they were 64% less likely than a control group of children not on the program to have criminal records.

Assume, of course, that the enriched diet included fish.

Note, also, that the

media

article does not mention what the other kids ate or did.

Does the data support the headline?

What are some “third variable” explanations?

How could you reword the headline?

Headline “Higher beer prices cut gonorrhea rates”

The research suggests “that raising the price of a six-pack of beer by 20 cents would cut gonorrhea rates by almost 9%” Researchers considered gonorrhea rates from 1981 to 1995 among teens and young adults in states that raised the legal drinking age or increased the state beer tax.

“Of the 36 beer tax increases that we reviewed, gonorrhea rates declined among teens aged 15 to 19 in 24 instances. Among young adults aged 20 to 24, they declined in 26 instances.” Important side note: 1981 is also when the CDC recognized AIDS and HIV; condoms protect against both HIV and gonorrhea.

Does the data support the headline?

What are some “third variable” explanations?

How could you reword the headline?

Headline “Luckiest people” born in summer

Online public survey (40,000 people) Those born in May were most likely to consider themselves lucky; those born in October had most negative views of their life.

People who took part in the survey gave their birthdates and rated the degree to which they saw themselves as lucky or unlucky The poll found there was a summer-winter divide between people born from March to August and those born from September to February.

50% of those born in May considered themselves lucky; 43% of those born in October.

It isn’t clear when the survey took place (i.e., what month)

Does the data support the headline?

What are some “third variable” explanations?

How could you reword the headline?

See some interesting trends on this website: Gapminder World