Data and Error Treatment and some Statistics Data • In science, we observe and measure things • We call these observations/measurements data (e.g.
Download ReportTranscript Data and Error Treatment and some Statistics Data • In science, we observe and measure things • We call these observations/measurements data (e.g.
Data and Error Treatment and some Statistics Data • In science, we observe and measure things • We call these observations/measurements data (e.g. individual facts, statistics, or items of information) • Two different types of data – Quantitative: data that is measured • Quantities = numbers – Qualitative: data that is observed • Qualities = words/descriptions of characteristics Quantitative vs. Qualitative Data Accuracy vs. Precision Accuracy vs. Precision • Accuracy – How close a value is to its expected or theoretical value • Precision – How reproducible a result is Accuracy vs. Precision • Accuracy – How close a value is to its expected or theoretical value Accuracy vs. Precision • Precision – How reproducible a result is Error • The deviation from the expected result is called the error • This is a common obstacle in science that we seek to minimize Introduction of Error • Intrinsic - inherent variability in the system being measured (can be high for biological systems) • Systematic - variability due to the measurement process • We have statistical methods to determine how much error has occured Deviation • The deviation tell us whether we are above or below our target value or theoretical value • Deviation = Experimental value – Theoretical value where experimental value is the value you obtain from your experiment and theoretical value is the expected result from your experiment Percent Error • To determine the accuracy of a series of measurements, scientists calculate the percent error (% Error) • This will show how close your measurements came to the expected or theoretical value • % Error = Experimental value – Theoretical value Theoretical value • The lower the percent error, the more accurate are the measurements in the experiment x 100% What about when we have lots of data like in our first lab? Never fear... STATISTICS will save the day! Statistics • Statistics are SUMMARIES of data. • Many studies generate large numbers of data points. To make sense of all that data, researchers use statistics that summarize the data, providing a better understanding of overall tendencies within the distributions of scores. Reasons for using statistics • aid in summarization • aid in understanding the trends in the data • aid in extracting “information” from the data • aid in communication Summary Statistics • Measures of Central Tendency – Arithmetic mean – Median – Mode • Measures of Variation or Spread of Distributions – Range – Standard deviation Measures of Central Tendency • Mode - the most frequently occurring score • Median - the value that lies in the middle after ranking all the scores • Mean - the average score Measures of Central Tendency Mean … the most frequently used but is sensitive to extreme scores e.g. 1 2 3 4 5 6 7 8 9 10 Mean = 5.5 (median = 5.5) e.g. 1 2 3 4 5 6 7 8 9 20 Mean = 6.5 (median = 5.5) e.g. 1 2 3 4 5 6 7 8 9 100 Mean = 14.5 (median = 5.5) Measures of Central Tendency Median … is not sensitive to extreme scores … use it when you are unable to use the mean because of extreme scores Measures of Central Tendency Mode … does not involve any calculation or ordering of data … use it when you have categories Another tool - Histogram • A Histogram is a variation of a bar chart in which data values are grouped together and put into different classes. • This grouping allows you see how frequently data in each class occurs in the data set. A Distribution Curve Mean: 54 Median: 56 Mode: 63 The Normal Distribution Curve "Bell" Curve In everyday life many variables such as height, weight, shoe size and exam marks all tend to be normally distributed, that is, they all tend to look like the following curve. The Normal Distribution Curve Mean, Median, Mode 0.025 0.02 0.015 0.01 0.005 0 0 20 40 60 80 100 • It is bell-shaped and symmetrical about the mean • The mean, median and mode are equal • It is a function of the mean and the standard deviation Variation or Spread of Distributions Measures that indicate the spread of scores: • Range • Standard Deviation Variation or Spread of Distributions Range • Compares the minimum score with the maximum score • Max score – Min score = Range • Gives a rough indication of the spread of the scores because it does not tell us much about the shape of the distribution and how much the scores vary from the mean Variation or Spread of Distributions Standard Deviation • Tells us what is happening between the minimum and maximum scores • Tells us how much the scores in the data set vary around the mean • Useful when we need to compare groups using the same scale Standard Deviations in a Normal Distribution So how do you calculate a Standard Deviation? Short answer: Let Excel do it for you FYI Long answer: Calculating Standard Deviation (from scratch, not using Excel) 1. Compute the mean, X, for the data set. 2. Compute the deviation by subtracting the mean, X, from each value, Xi. 3. Square, (Xi – X)2, each individual deviation. 4. Add up the squared deviations. 5. Divide by one less than the sample size. 6. Take the square root. Lets see how all this relates to a real example A flighty example • Suppose a biologist wants to determine the home range of a particular species of bird. – From all members of the population of this species, the biologist collects a sample of 100 birds and attaches a radio transmitter to the leg of each bird. – After monitoring the position of the birds for 1 month, the biologist finds that the mean home range of the sample is 23.1 km. • Is this the exact value of the home range of the population? Point Estimation • The question we can ask is, “Well how close is 23.1 km to the mean home range for the population?” – This value (23.1) is called the point estimate, it is the value we will use to estimate the population value – This is the “Best Guess” of the value of the population parameter, given your sample information • The way we will answer this question is to generate a range of values for which we can be reasonably confident that the population value falls – This process is called interval estimation – The resulting range of values is called a confidence interval. Returning to our flighty example… • The biologist has a point estimate of 23.1 km, and can guess that the true population mean is probably between 20 and 26. • While the biologist may be confident that the true range is between 20 and 26, she may be even more confident that with a range of 15-30. • Thus, the wider the interval estimation, the greater the confidence that we have that the interval contains the population mean. Confidence Intervals • Now, it is possible and preferable to be more quantitative about how we go about determining the interval (rather than just guessing). • So, a confidence interval provides a range of numbers along with the percentage confidence that the parameter lies within – A 95% confidence interval means that 95% of similarly constructed intervals will contain the population parameter – Note also that although there are many different CIs that we could construct, in practice the 90%, 95%, and 99% CI are used most often. Computing Confidence Intervals: Sampling Error • Realize that whenever we use a sample to estimate a population characteristics, we are going to have some amount of sampling error. – Sometimes, we will overestimate the true value – Sometimes, we will underestimate the true value True Value Standard Error • But, we also know (from the central limit theorem) that if our sample size is sufficiently large (n > 30), the sampling distributions of the sample means will be approximately normal. – Thus, 95% of the sample means will fall between + 2 standard deviations from the mean of the population What does this mean? • When we collect a sample of data, we can be reasonably certain that the true population value falls within 2 standard deviations (plus or minus) of our sample mean! _ – Bottom line: If you have X and add and subtract about 2 standard deviations from it, this is 95% confidence interval Elements of a Confidence Interval Confidence interval Confidence limit (lower) Sample statistic (point estimate) Confidence limit (upper) Common Levels of Confidence • Commonly used confidence levels are 90%, 95%, and 99% Confidence Depends on Interval (z) Standard Error (SE) _ xz= x X = ± Zx /2 -1.65x -2.575x -1.96x σ n X +1.65x +2.575x +1.96x 90% CI 95% CI 99% CI Lower 95% confidence interval = Mean – 1.96 x SE Upper 95% confidence interval = Mean + 1.96 x SE Let’s return to our flighty example • Our biologist found a point estimate of 23.1 km for the mean range size • Suppose the biologist also knows that the standard deviation for the range of this bird population is 4.7 km • Recall that the sample size was 100 birds • To find the 95%CI, – 23.1 + 1.96 (4.7/100) – 23.1 + .9212 – 95% CI is 22.18 to 24.02 km Now for a little on Correlations, a way to show relationships between experimental variables Types of Correlation Positive correlation Negative correlation No correlation Simple linear regression describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis Regression line is calculated to give a minimum for the differences between the points and the line Dependent Variable (Y) Values measured in response to changes in the independent variable Independent Variable (X) Values chosen by the experimenter How well does your regression equation truly represent your set of data? • One of the ways to determine the answer to this question is to examine the correlation coefficient, r • Simple Linear Correlation (Pearson r) – Shows proportionality of two variables – -1 < r < 1 the closer to 1, better the correlation Several sets of (x, y) points, with the correlation coefficient, r, of x and y for each set. Here, r, reflects the noisiness and direction of a linear relationship Here, r, does NOT reflect the slope of the linear relationship Here, r, does NOT reflect nonlinear relationships Outliers How well does your regression equation truly represent your set of data? • Another way is to examine the coefficient of determination, r2 , i.e. “strength” of relationship • Coefficient of Determination (r2) – explains how much of the variability of one factor can be caused or explained by its relationship to another factor – 0 < r2 < 1 the higher the value, better the fit – For example, an r2 of 0.70 implies that 70% of the variation in y is accounted for by its relationship to x. – Most statisticians consider an r2 of 0.7 or higher to show a reasonable model for a relationship between variables. Outliers Usually exclude data that is more than 2 SD from the trend line Analysis without the outliers Outlier along the trend line Outlier perpendicular to the trend line