#### Transcript Data Handling/Statistics - LSU Macromolecular Studies Group

Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: •The Cartoon Guide to Statistics, L. Gonick & W. Smith •Data Reduction in the Physical Sciences, P. R. Bevington •Workshop Statistics, A. J. Rossman & B. L. Chance •Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T.Vetterling •Origin 6.1 Users Manual, MicroCal Corporation Outline •Our motto •What those books look like •Stuff you need to be able to look up •Samples & Populations •Mean, Standard Deviation, Standard Error •Probability •Random Variables •Propagation of Errors •Stuff you must be able to do on a daily basis •Plot •Fit •Interpret Our Motto That which can be taught can be learned. An opposing, non-CMC IGERT viewpoint The “progress” of civilization relies being able to do more and more things while thinking less and less about them. What those books look like The Cartoon Guide to Statistics The Cartoon Guide to Statistics In this example, the author provides step-by-step analysis of the statistics of a poll. Similar logic and style tell you how to tell two populations apart, whether your measley five replicate runs truly represent the situation, etc. The Cartoon Guide gives an enjoyable account of statistics in scientific and everyday life. An Introduction to Error Analysis A very readable text, but with enough math to be rigorous. The cover says it all – the book’s emphasis is how statistics and error analysis are important in the everyday. Author John Taylor is known as “Mr. Wizard” at Univ. of Colorado, for his popular science lectures aimed at youngsters. Bevington Bevington is really good at introducing basic concepts, along with simple code that really, really works. Our lab uses a lot of Bevington code, often translated from Fortran to Visual Basic. “Workshop Statistics” This book has a website full of data that it tells you how to analyze. The test cases are often pretty interesting, too. Many little shadow boxes provide info. “Numerical Recipes” A more modern and thicker version of Bevington. Code comes in Fortran, C, Basic (others?). Includes advanced topics like digital filtering, but harder to read on the simpler things. With this plus Bevington and a lot of time, you can fit, smooth, filter practically anything. Stuff you need to be able to look up Samples vs. Populations The world as we understand it, based on science. The world as God understands it, based on omniscience. Statistics is not art but artifice–a bridge to help us understand phenomena, based on limited observations. Our problem Sitting behind the target, can we say with some specific level of confidence whether a circle drawn around this single arrow (a measurement) hits the bullseye (the population mean)? Measuring a molecular weight by one Zimm plot, can we say with any certainty that we have obtained the same answer God would have gotten? Sample View Average Variance x1 x2 x3...xn 1 n x xi n n i 1 1 n 2 s ( x x ) i n 1 i 1 2 Standard deviation Standard error of mean s s2 SEM s n Population View E(x) 2 ( x) n Sample View: direct, experimental, tangible The single most important thing about this is the reduction In standard deviation or standard error of the mean according To inverse root n. s s2 ~1 (for large n) n Three times better takes 9 times longer (or costs 9 times more, or takes 9 times more disk space). If you remembered nothing else from this lecture, it would be a success! Population View: conceptual, layered with arcana! The purple equation in the table is an expression of the central limit theorem. If we measure many averages, we do not always get the same average: x is itself a random variable! “if one takesrandomsamplesof size n froma populationwith mean and standarddeviation , then(for large n) x itself approachesa normaldistribution with mean and standard deviation n " (from" Cartoon..." ). It means…if you want to estimate , which only God really knows, you should measure many averages, each involving n data points, figure their standard deviation, and multiply by n1/2. This is hard work! Huh? A lot of times, is approximated by s. If you wanted to estimate the population average , the best you can do is to measure many averages and averaging those. A lot of times is approximated by x. IT’S HARD TO KNOW WHAT GOD DOES. I think the in the purple equation should be an s, but the equation only works in the limit of large n anyhow, so there is no difference. You got to compromise, fool! The t-distribution was invented by a statistician named Gosset, who was forced by his employer (the Guinness brewery!) to publish under a pseudonym. He chose “Student” and his t-distribution is known as student’s t. The student’s t distribution helps us assign confidence in our imperfect experiments on small samples. Input: desired confidence level, estimate of population mean (or estimated probability), estimated error of the mean (or probability). Output: ± something Probability …is another arcane concept in the “population” category: something we would like to know but cannot. As a concept, it’s wonderful. The true mean of a distribution of mass is given as the probability of that mass times the mass. The standard deviation follows a similarly simple rule. In what follows, F means a normalized frequency (think mole fraction!) and P is a probability density. P(x)dx represents the number of things (think molecules) with property x (think mass) between x+dx/2 and x-dx/2. xF ( x) all x 2 2 ( x ) F ( x) all x xP( x)dx 2 2 ( x ) P ( x)dx Discrete system Continuous system Here’s a normal probability density distribution from “Workshop…” where you use actual data to discover. 68% of results 2 95% of results What it means Although you don’t usually know the distribution, (either or ) about 68% of your measurements will fall within 1 of ….if the distribution is a “normal”, bell-shaped curve. t-tests allow you to kinda play this backwards: given a finite sample size, with some average, x, and standard deviation, s—inferior to and , respectively—how far away do we think the true is? Details No way I could do it better than “Cartoon…” or “Workshop…” Remember…this is the part of the lecture entitled “things you must be able to look up.” Propagation of errors Suppose you give 30 people a ruler and ask them to measure the length and width of a room. Owing to general incompetence, otherwise known as human nature, you will get not one answer but many. Your averages will be L and W, and standard deviations sW and sL. Now, you want to buy carpet, so need area A = L·W. What is the uncertainty in A due to the measurement errors in L and W? Answer! There is no telling….but you have several options to estimate it. A = L·W example Here are your measured data: L 30 1 ft W 19 2 ft You can consider “most” and “least” cases: Amax L W 31 20 ft 2 620 ft 2 Amin L W 29 17 ft 2 490 ft 2 620 490 2 Aaverage ft 557 ft 2 2 620- 490 estimat eduncertainty : 65 2 reportedarea : (560 65) ft 2 Another way We can use a formula for how propagates. Suppose some function y (think area) depends on two measured quantities t and s (think length & width). Then the variance in y follows this rule: 2 y 2 2 y 2 y 2 t s t s Aren’t you glad you took partial differential equations? What??!! You didn’t? Well, sign up. PDE is the bare minimum math for scientists. Translation in our case, where A = L·W: 2 A 2 2 A 2 A 2 L W L W W 2 L 2 L 2 W 2 Problem: we don’t know W, L, L or W! These are population numbers we could only get if we had the entire planet measure this particular room. We therefore assume that our measurement set is large enough (n=30) That we can use our measured averages for W and L and our standard deviations for L and W. 2 A (19 ft ) (1 ft ) (30 ft ) (2 ft ) 3961ft 2 2 2 2 4 A 63 ft 2 So the value to reportis : (19 30) ft 2 63 ft 2 or.... A (570 63) ft Comparethis to our empirical,most/leastcalculation : 2 A (560 65) ft 2 Error propagation caveats 2 2 y y The equation, 2y t2 s2 , assumes t s normal behavior. Large systematic errors—for example, 3 euroguys who report their values in metric units—are not taken into consideration properly. In many cases, there will be good knowledge a priori about the uncertainty in one or more parameters: in photon counting, if N is the number of photons detected, then N = (N)1/2 . Systematic error that is not included in this estimate, so photon folk are well advised to just repeat experiments to determine real standard deviations that do take systematic errors into account. Stuff you must know how to do on daily basis Plot!!! 99.97% of the trend can be explained by the fitted relation. 25000 20000 r=0.99987 r2=0.9997 Larger Particle 30.9 g/ml /Hz 15000 Parameter Value Error -----------------------------------------------------------A -0.00267 44.94619 B 2.25237E-7 8.46749E-10 ------------------------------------------------------------ 10000 Intercept = 0.003 ± 45 (i.e., zero!) 5000 R SD N P -----------------------------------------------------------0.99987 118.8859 21 <0.0001 ------------------------------------------------------------ 0 0 2 4 6 q2/1010cm-2 8 10 The same data 3.0 twilight users rcueto e739 How to find this file! Larger Particle 30.9 g/ml 2.5 Parameter Value Error 1.5 -----------------------------------------------------------A 2.2725E-7 7.62107E-10 app D / cm2s-1 2.0 B -3.09723E-20 1.43575E-20 1.0 ------------------------------------------------------------ r=0.444 r2=0.20 R SD N P ------------------------------------------------------------ 0.5 -0.44355 2.01583E-9 21 0.044 Only 20% of the data can be explained by the line! While 2 on 8 depended 10 12 q , Dapp does not! ------------------------------------------------------------ 0.0 0 2 4 6 q2/1010cm-2 What does the famous “ r2 ” really tell us? Suppose you invented a new polymer that you hoped was more stable over time than its predecessor… So you check. time melting point 2 4 8 12 16 24 36 48 110.2 110.9 108.8 109.1 109.0 108.5 110.0 109.2 Question: time melting point 2 4 8 12 16 24 36 48 110.2 110.9 108.8 109.1 109.0 108.5 110.0 109.2 What describes the data better: A simple average (meaning things aren’t really changing over time: it is stable) OR A trend (meaning melting point might be dropping over time)? How well does the mean describe the data? These are called ‘residuals.’ The sum of the square of all the residuals characterizes how well the data fit the mean. St (Ti Tmean ) 2 i (= 4.6788) How much better is a fit (i.e., a regression in this case)? The regression also has residuals. The sum of their squares is smaller than St. S r (Ti T fit,i ) 2 i (= 4.3079) The r2 value simply compares the fit to the mean, by comparing the sums of the squares: St S r r Sr 2 4.6788 4.3079 r 0.0793 4.6788 2 In our case, the fit was NOT a dramatic improvement, explaining only 7.9% of the variability of the data! Plot showing 95% confidence limits. Excel doesn’t excel at this! 25 Rh/nm 20 Range of Rg values obsreved in MALLS 15 1/2 (3/5) Rh [6/7/01 13:44 "/Rhapp" (2452067)] Linear Regression for BigSilk_Ravgnm: Y=A+B*X 10 Parameter Value Error -----------------------------------------------------------A 20.88925 0.19213 B 0.01762 0.01105 ------------------------------------------------------------ 5 R SD N P -----------------------------------------------------------0.62332 0.28434 6 0.18611 ------------------------------------------------------------ 0 0 10 20 -1 c/g-ml 30 Interpreting data: Life on the bleeding edge of cutting technology. Or is that bleating edge? Rg/nm The noise level in individual runs is much less than The run-to-run variation. That’s why many runs are a good idea. More would be good here, but we are still overcoming the shock that we can do this at all! n = 0.324 +/- 0.04 df = 3.12 +/- 0.44 10 3E6 1E7 M 2E7 Excel does not automatically provide estimates! Correlation Caveat! Correlation Cause. No, Correlation=Association. Country Life Expectancy People per TV TV's per person 44 76.5 49.5 76.5 70 60.5 78 53.5 67 79 52.5 72 64.5 56.5 69 64 71.5 51 76 75.5 65 50 200 2 177 1.7 8 15 2.6 234 18 1.8 92 6.6 21 73 3.2 11 28 191 3 1.3 29 38 0.0050 0.5000 0.0056 0.5882 0.1250 0.0667 0.3846 0.0043 0.0556 0.5556 0.0109 0.1515 0.0476 0.0137 0.3125 0.0909 0.0357 0.0052 0.3333 0.7692 0.0345 0.0263 Life Expectancy Chart Title Angola Australia Cambodia Canada China Egypt France Haiti Iraq Japan Madagascar Mexico Morocco Pakistan Russia South Africa SriLanka Uganda United Kingdom United States Vietnam Yemen 90 80 70 60 50 40 30 20 10 0 0.0000 0.2000 0.4000 y = 35.441x + 57.996 R2 = 0.5782 0.6000 TV's per person 58% of life expectancy is associated with TV’s. Would we save lives by sending TV’s to Uganda? 0.8000 1.0000 Linearize it! Life Expectancy Chart Title y = -0.1156x + 70.717 R2 = 0.6461 Linearity is improved by plotting Life vs. people per TV rather than TV’s per people. 90 80 70 60 50 40 30 20 10 0 0 50 100 150 200 250 People per TV Observant scientists are adept at seeing curvature. Train your eye by looking for defects in wallpaper, door trim, lumber bought at Home Depot, etc. And try to straighten out your data, rather than let the computer fit a nonlinear form, which it is quite happy to do! Plots are pictures of science, worth thousands of words in boring tables. These 4 plots all have the Same slopes, intercepts and r values! From whence do those lines come? Least squares fitting. “Linear Fits” the fitted coefficients appear in linear part expression. e.g.. y =a+bx+cx2+dx3 An analytical “best fit” exists! “Nonlinear fits” At least some of the fitted coefficients appear in transcendental arguments. e.g., y =a+be-cx+dcos(ex) Best fit found by trial & error. Beware false solutions! Try several initial guesses! CURVE FITTING: Fit the trend or fit the points? Earth’s mean annual temp has natural fluctuations year to year. To capture a long term trend, we don’t want to fit the points, so use a low-order polynomial regression. BUT, The bumps and jiggles in the U.S. population data are ‘real.’ We don’t want to lose them in a simple trend. REGRESSION: We lost the baby boom! SINGLE POLYNOMIAL: Does funny things (see 1905). SPLINE: YES: Lots of individual polynomials give us a smooth fit (especially good for interpolation). All data points are not created equal. Since that one point has so much error (or noise) should we really worry about minimizing its square? No. We should minimize “chisquared.” n ( yi y fit ) 2 i 1 i2 2 Goodness of fit parameter that should be unity for a “fit within error” 2 reduced 1 n n ( yi y fit ) 2 i 1 i2 n is the # of degrees of freedom n n-# of parameters fitted Why is a fit based on chisquared so special? Based on chi: these two curves fit equally well! Based on |chi| (absolute value): these three curves fit equally well! Based on max(chi): outliers exert too strong an influence! 2 caveats •Chi-square lower than unity is meaningless…if you trust your 2 estimates in the first place. •Fitting too many parameters will lower 2 but this may be just doing a better and better job of fitting the noise! •A fit should go smoothly THROUGH the noise, not follow it! •There is such a thing as enforcing a “parsimonious” fit by minimizing a quantity a bit more complicated than 2. This is done when you have a-priori information that the fitted line must be “smooth”. Achtung! Warning! This lecture is an example of a very dangerous phenomenon: “what you need to know.” Before you were born, I took a statistics course somewhere in undergraduate school. Most of this stuff I learned from experience….um… experiments. A proper math course, or a course from LSU’s Department of Experimental Statistics would firm up your knowledge greatly. AND BUY THOSE BOOKS! YOU WILL NEED THEM! Cool Excel/Origin Demo