Transcript Slide 1
Probability & Statistics Data (Bock Ch2) What are Data? • Data can be numbers, record names, or other labels. • Not all data represented by numbers are numerical data (e.g., 1=male, 2=female). • Data are useless without their context… The “W’s” • To provide context we need the W’s – Who – What (and in what units) – When – Where – Why (if possible) – and How of the data. • Note: the answers to “who” and “what” are essential. Organization! Organization is important! Here are some customer records from a company’s database: B001OAA 24 Veterans 902 Boston Y Garbage 43 Y 440 17 15.98 Chicago 18 Kansas 10.99 B0015Y6 413 Y N 368 Fenway B001OAA 312 Without some organization, what can be made of this data? Data Table • The following data table clearly shows the context of the data presented: Name Ship To Price Area Code # of Previous Purchases Gift ? Catalog ID Artist Katherine H. Ohio 10.99 440 17 N B0015Y6 Kansas Samuel P. Illinois 16.99 312 3 Y B002BK9 Boston Chris G. New York 15.98 413 0 N B0068ZVQ Chicago Monique D. Canada 11.98 902 10 Y B001OAA Garbage • Notice that this data table tells us the What (column titles) and Who (row titles) for these data. Data Table The “What” Name Ship To Price Area Code # of Previous Purchases Gift ? Catalog ID Artist Katherine H. Ohio 10.99 440 17 N B0015Y6 Kansas Samuel P. Illinois 16.99 312 3 Y B002BK9 Boston Chris G. New York 15.98 413 0 N B0068ZVQ Chicago Monique D. Canada 11.98 902 10 Y B001OAA Garbage The “Who” Data Table The “What” needs to labeled as either Categorical or Quantitative. If it is Quantitative, the units should be included. Name Ship To Price Area Code # of Previous Purchases Gift ? Catalog ID Artist Katherine H. Ohio 10.99 440 17 N B0015Y6 Kansas Samuel P. Illinois 16.99 312 3 Y B002BK9 Boston Chris G. New York 15.98 413 0 N B0068ZVQ Chicago Monique D. Canada 11.98 902 10 Y B001OAA Garbage Who • The Who of the data tells us the individual cases about which (or whom) we have collected data. – Individuals who answer a survey are called respondents. – People on whom we experiment are called subjects or participants. – Animals, plants, and inanimate subjects are called experimental units. Who (cont.) • Sometimes people just refer to data values as observations and are not clear about the Who. – But we need to know the Who of the data so we can learn what the data say. What • The “what” are Variables or characteristics recorded about each individual. • The variables should have a name that identify What has been measured. • To understand variables, you must Think about what you want to know. What • Some variables have units that tell how each value has been measured and tell the scale of the measurement. What • A categorical variable names categories and answers questions about how cases fall into those categories. – Categorical examples: sex, race, ethnicity • A quantitative variable is a measured variable (with units) that answers questions about the quantity of what is being measured. – Quantitative examples: income ($), height (inches), weight (pounds) What • Example: An online store that sells sports memorabilia keeps track of the addresses of all of their customers. One of the variables they keep track of are customers’ zip codes. • Question: Are zip codes categorical or quantitative? What • Question: Are zip codes categorical or quantitative? • Although zip codes are numbers and can be put in order, there are no natural units for the variable zip code. • Variables like “zip code” are considered categorical data. Why • Why we are collecting data is important in understanding what we think about and how we treat the variables. Where, When, and How • We need the Who, What, and Why to analyze data. But, the more we know, the more we understand. • When and Where give us some nice information about the context. – Example: Values recorded at a large public university may mean something different than similar values recorded at a small private college. (The “where” makes a difference) Where, When, and How • How the data are collected can make the difference between insight and nonsense. – Example: results from voluntary Internet surveys are often useless • The first step of any data analysis should be to examine the W’s—this is a key part of the Think step of any analysis. • And, make sure that you know the Why, Who, and What before you proceed with your analysis. What can go wrong? • Don’t label a variable as categorical or quantitative without thinking about the question you want it to answer. • Just because your variable’s values are numbers, don’t assume that it’s quantitative. • Always be skeptical—don’t take data for granted. Summary • Data are information in a context. – The W’s helpthe with context. Why: Who: Who Why (or what) data was thecollected? data is about? – We must know the Who (cases), What (variables), and Why to be able to say anything useful about the data. What: The variables (and their units). When and Where: and quantitative WhenBoth and categorical Where the data were recorded? measuerments about the “who”. How the them?) data collected? Was (Whatwas about is a legitimate source? How: Summary (cont.) • We treat variables (the “what”) as categorical or quantitative. – Categorical variables identify a category for each case. – Quantitative variables record measurements or amounts of something and must have units. – Some variables can be treated as categorical or quantitative depending on what we want to learn from them. Example For the following description of data, identify the W’s, name the variables, classify each variable as categorical or quantitative, and for any quantitative variable, identify the units in which it was measured (or state that they were not provided). According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees to shift part of their before-tax salaries into investments such as mutual funds. Employers typically match 50% of the employees’ contribution up to about 6% of salary. One company, concerned with what it believed was a low employee participation rate in its 401(k) plan, sampled 30 other companies with similar plans and asked for their 401(k) participation rates. Example According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees to shift part of their before-tax salaries into investments such as mutual funds. Employers typically match 50% of the employees’ contribution up to about 6% of salary. One company, concerned with what it believed was a low employee participation rate in its 401(k) plan, sampled 30 other companies with similar plans and asked for their 401(k) participation rates. Who: 30 other companies Although the interest here is in 401(k) plans, the 30 companies are the subject being studied. Example According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees to shift part of their before-tax salaries into investments such as mutual funds. Employers typically match 50% of the employees’ contribution up to about 6% of salary. One company, concerned with what it believed was a low employee participation rate in its 401(k) plan, sampled 30 other companies with similar plans and asked for their 401(k) participation rates. What: Be sure to label each variable as quantitative or categorical Employer’s contribution – quantitative (%) Contribution limit – quantitative (%) Participation rate – ??? Participation rate could be quantitative or categorical depending on how it is measured. If it is measured by number of employees or percent of employees, then it would be quantitative. If it is categorized as high, medium, or low, then participation rate would be categorical. Example According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees to shift part of their before-tax salaries into investments such as mutual funds. Employers typically match 50% of the employees’ contribution up to about 6% of salary. One company, concerned with what it believed was a low employee participation rate in its 401(k) plan, sampled 30 other companies with similar plans and asked for their 401(k) participation rates. Why: to improve participation rates Because the company has low employee participation in its 401(k) plan, we can infer that the study is being done in an attempt to improve participation rates. Example According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees to shift part of their before-tax salaries into investments such as mutual funds. Employers typically match 50% of the employees’ contribution up to about 6% of salary. One company, concerned with what it believed was a low employee participation rate in its 401(k) plan, sampled 30 other companies with similar plans and asked for their 401(k) participation rates. Where: not specified Although the “where” is not specified here, we can assume that it is in the area that Fortune magazine services. If the magazine were local, then we could assume that the “where” is local. If the magazine were only distributed in the U.S. then we could assume that the “where” is in the U.S. only. Example According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees to shift part of their before-tax salaries into investments such as mutual funds. Employers typically match 50% of the employees’ contribution up to about 6% of salary. One company, concerned with what it believed was a low employee participation rate in its 401(k) plan, sampled 30 other companies with similar plans and asked for their 401(k) participation rates. When: before Dec. 28, 1992 This passage seems to imply that the study of the 30 companies was part of the magazine article. If this is so, then the data must have been collected before the article was published. If the company had done the study as a result of the magazine article, then we would have said that the “when” was after Dec. 28, 1992. Example According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees to shift part of their before-tax salaries into investments such as mutual funds. Employers typically match 50% of the employees’ contribution up to about 6% of salary. One company, concerned with what it believed was a low employee participation rate in its 401(k) plan, sampled 30 other companies with similar plans and asked for their 401(k) participation rates. How: Sampled other companies How the sample was conducted is not specified. It could have been by asking the company, sending them a survey to fill out, or by simply checking company records. We do know that they sampled the other companies; a more specific “how” was not specified.