EXPERIMENTAL METHODS LECTURES 1-8 Eric Bettinger Stanford University Where do we see randomized experiments in public policy? Oportunidades/Progressa in Mexico Reducing Crime Juvenile Delinquency Early Childhood development Head start Education (general IES.
Download ReportTranscript EXPERIMENTAL METHODS LECTURES 1-8 Eric Bettinger Stanford University Where do we see randomized experiments in public policy? Oportunidades/Progressa in Mexico Reducing Crime Juvenile Delinquency Early Childhood development Head start Education (general IES.
EXPERIMENTAL METHODS LECTURES 1-8 Eric Bettinger Stanford University Where do we see randomized experiments in public policy? 2 Oportunidades/Progressa in Mexico Reducing Crime Juvenile Delinquency Early Childhood development Head start Education (general IES focus) Vouchers Career Themes Class Size Electricity pricing Automated Medical Response Housing Assistance Housing Vouchers Job Training Unemployment Insurance Welfare to Work Health Services College Financial Aid College Work Mental Health Treatments How Have Randomized Experiments Affected Public Policy? 3 Class Size and Tennessee STAR California Pre School and Perry Preschool Head Start Reading Curricula and Success for All Conditional Cash Transfer and Progressa Escola Class Size Bolsa and others Educational Vouchers Experiments as the “Gold Standard” 4 Wide recognition as the “Gold Standard” World Bank US Government and “No Child Left Behind” Legislature Why the “Gold Standard”? Identifying Causal Impacts Eliminating Selection Bias Simplicity Much Easier to Explain No Need for Complex Statistical Modeling Are there limitations? 5 YES! Some potential limitations: General equilibrium Interpretation Mechanism Fragility of design More… We will return to this later in the course. Key goals this week: 6 1. 2. 3. 4. 5. 6. 7. Understand causal modeling Understand relationship of randomization to causal modeling Gain the statistical tools to analyze randomized experiments Become aware of underlying assumptions, strengths, and weaknesses of experimental approaches Become acquainted with key players in the development and implementation of randomized experiments Gain the statistical tools to design randomized experiments Understand other key issues in design and implementation of random experiments. What is causality? 7 There are many potential definitions … Does the treatment have to be manipulable? Morgan and Winship: C causes E if 1) Both C and E occur; and 2) If C did not occur and everything else was equal, E would not have occurred. Fisher: Outcomes differ by treatment Neyman: Average outcomes of treatments A and B differ Can there be effects of being female? How do we think about mechanisms and counterfactuals? Can differences in outcomes come from different dosages? What is the counterfactual in the Neyman and Fisher definitions? Defining Counterfactuals 8 Assume that each individual in the population has a potential outcome for each potential causal state. In two-state model, we can define Yi =Y1i if exposed to treatment Y0i if not exposed to treatment We could generalize this to Y2i, Y3i, . . . , Yki for k different treatments. These are potential outcomes, and we assume they exist. Each individual has their own outcome – heterogeneous outcomes. We never observe more than one outcome for each person. Counterfactual? 9 KEY POINT: We have to make assumptions about whether the average observed for some aggregated group is an accurate representation of the counterfactual. In essence, the strength of an identification strategy is our ability to accurately measure the counterfactual outcome. Rubin (1986): “What ‘Ifs’ Have Causal Answers?” 10 Not all questions have causal answers. Would I be here in Russia if I had studied art history instead of economics? Would I be here in Russia if I were a woman rather than a man? SUTVA is the key assumption for determining which questions have causal answers. SUTVA = “stable unit treatment value assumption” Baseline framework for understanding SUTVA 11 What are the ingredients of an experiment? N units indexed from i=1,…,N T Treatments indexed by t=1,…,T Y is outcome and indexed by t and i. Two Key Conditions: SUTVA says that Y will be the same for any i no matter how t is assigned. SUTVA also says that Y will be the same no matter what treatments the other units receive. Condition 1. Same Y no matter on how t assigned. 12 Consider the following statement: If John Doe, had been born a female, his life would have been different. How do you make John Doe a female? Hypothetical Y to X chromosome treatment at conception Massive doses of hormones in utero At-birth sex-change Does the form of the change make a difference? Consider an educational example 13 Are there treatments where different versions of t matter? Consider Class size. If we assign a small class to a student (t), is there more than one (t) in existence. Are all small classes equal? No – teachers differ, peers differ, and so on. What about educational vouchers? Vouchers are “coupons” which allow students to attend the school of their choice. Are outcomes the same no matter the voucher? We have to assume that all versions of t are the same under SUTVA. Consider Tennessee class size experiment (Ding and Lehrer 2011) 14 Under SUTVA, should class size effect vary by the proportion of the school involved in experiment? Consider Tennessee class size experiment 15 The problem is that it does. Why? 16 Condition 2. Yti cannot depend on whether i’ received to or t1 [d1=1, d2=0, d3=0} Y1i = 3 d1=0,d2=1,d3=0 d1=0 d2=0 d3=1 [d1=1, d2=1, d3=0} Y1i = 2 d1=0,d2=1,d3=1 d1=1 d2=0 d3=1 Y0i=1 Y0i=1 Consider the example above. Are there treatments that are diluted as more people get them? Condition 2. Outcomes do not depend on others’ assignments. 17 What other examples are there? General equilibrium effects. Do outcomes change for the control group? What is the effect to take this course in Russia versus Stanford? Is the treatment the same? Key point: My treatment cannot affect your treatment. Why do we care about SUTVA? 18 We can identify which questions can really be answered. Causal question is possible when SUTVA is satisfied. SUTVA can help us sort through possible mechanisms. It make the interpretation clearer. Consider religious schools 19 Would we expect the religious school effect to change if more students attended religious school? SUTVA does not hold if effectiveness of religious schools depends on the number/composition of students. The distribution of students change What then are the implications of religious school literature on voucher debates? What should we think about SUTVA? 20 This is a useful framework for isolating the precise question for which we have a causal question. It likely holds in small samples. In large movements, we could have general equilibrium effects. Simple Example 21 Recall that experiments include treatments, units of observation, and outcomes. Consider the following scenario: If females at firm f had been male, their starting salaries would have averaged 20% higher. How do we divide this into treatments, units, and outcomes? It is likely not possible. A useful motto: There is no causation without manipulation. The Perry Preschool Experiment 22 Motivation for large investments in preschool. Hundreds of billions of dollars in investments. Outline of the experiment: 1962. 123 black preschoolers (58 in treatment) Low-income and low IQ scores (70-85, 1sd below mean) Randomly assigned at ages 3-4 to a high-quality preschool program or no program. Data collected annually from 3 through 11 and at 14, 15, 19, 27, and 40. Overview of Perry Results 23 Source: Schweinhart (2007) More on Perry. 24 Source: Schweinhart (2007) More Perry Results 25 Source: Schweinhart (2007) More from Perry on Crime 26 Source: Schweinhart (2007) Which Causal Questions Does Perry Help Resolve? 27 Is attending preschool better than not attending for low-income, low-ability, African-American students in 1962? Is attending preschool better than not attending for low-income, low-ability students? Is attending preschool better than not attending for low-income students? Is attending preschool better than not attending for all students? SUTVA may not be fully satisfied but it helps us identify which questions our studies may resolve. Is Perry valid today? 28 Source: Schweinhart (2007) Let’s formulate causality mathematically 29 Yi = Y1i if Di=1 Y0i if Di=0 Yi = Y0i + (Y1i – Y0i)Di The most common approach is to compare students attending a “treatment” to those not attending a treatment. E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) 1st term is average effect on treated 2nd term is selection bias Thinking through the expectation 30 E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) Example 1. Treatment = Religious Private Schooling 1st term is average effect on treated Average private school outcome of individuals in religious private schooling minus the average public school outcome for individuals attending private school The latter term is the unobserved counterfactual. We have to find a way to estimate it. 2nd term is selection bias The difference between the public school outcome that private school attendees would have had and the public school outcome for public school attendees. Informs us as to how different private school attendees are from public school attendees. Thinking through the expectation 31 E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) Example 2. Treatment = Attending Preschool 1st term is average effect on treated Average private school outcome of individuals in preschool minus the average outcome that students who attended preschool would have had if they had not gone to preschool 2nd term is selection bias The difference between the outcome that preschool attendees would have had without preschool and the outcome of students not attending preschool. Use of the formulation 32 It helps us figure out what we are estimating. Later we will augment this model with the “probability of compliance.” It identifies the key means we need to estimate to gain causal estimates. It helps us analyze our approach regardless of our methodology What happens if we have randomization? 33 E[Y0i | Di=0] = E[Y0i | Di=1] Selection Bias is gone. E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) = E[Y1i | Di=1] – E[Y0i | Di=1] + = E[Y1i –Y0i | Di=1] 0 = E[Y1i –Y0i] Simple differences in averages reveal the treatment effect. Hypothetical Example 34 A recent study found in the overall population that students who attended schools of type X had higher test scores than other students. What do we expect the selection bias to look like? This was a school requiring motivated parents. If we randomized in the whole population, what direction should we expect the treatment effect to go? E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) Abdulkadiroglu et al (2011) 35 Examines charter schools in Boston Charter schools are public schools which operate more closely to private schools. Obama has pushed for more charter schools. Important research question to know if they work. Charter schools are often oversubscribed. When oversubscribed, they use lotteries to determine who gets in. Other charters are not oversubscribed. If a school is oversubscribed, what would you expect? It probably does a pretty good job. Abdulkadiroglu et al (2011) 36 Abdulkadiroglu et al (2011) 37 Synthesizing 38 The authors found a selection of schools of type X which ran lotteries to determine who entered the schools. In these schools, the researchers exploited the randomization. The difference between winners and losers was even higher than the previous comparison. Explain? Basically, observed effect greater than treatment effect then two possibilities: Selection bias is negative at no wait list schools [likelihood?] Treatment effect is much lower. Stronger the selection effects in these other schools the lower the treatment effect. Regression Formulation 39 Suppose we want to use regression analysis, how do we estimate treatment effects? Yi = a + b*Treatmenti + ei E[Yi | Di=1] = a + b +E(e|Di=1) E[Yi | Di=0] = a + E(e|Di=0) E[Yi | Di=1] – E[Yi | Di=0] =b + E(e|Di=1) - E(e|Di=0) Underlined term is selection bias. This is just equal to correlation between e and D Regression Formulation 40 Yi = a + b*Treatmenti + ei Consider the OLS estimator of b. We will call it bhat. E[bhat] = b + E[T’e/T’T] Selection bias would suggest that e and T are correlated. Think of it as an omitted variable problem. If T is randomly assigned, then E[T’e]=0 . No omitted variable can be correlated with T. Multivariate Regression 41 What is the consequence of including more X in a regression where there is randomization? Generally, the standard errors are lower X is correlated with Y and once controlled for reduce the variance of Y Estimated treatment effect should be unbiased if there is no selection bias We will return to this. Recap up to now. 42 Experiments is our key interest Causal questions need treatments that are able to be manipulated whether by the researcher or “nature.” SUTVA are essential for identifying which questions we are asking which are causal. We need units, treatments, and outcomes. Condition 1. Treatment is constant. Condition 2. Your treatment does not affect mine. Typical comparisons mask treatment effects and selection bias. Randomization removes selection bias, but there are other ways to get rid of it. Experiments vs. Observational Studies 43 Cox and Reid: The word experiment is used in a quite precise sense to mean an investigation where the system under study is under the control of the investigator. This means that the individuals or material investigated, the nature of the treatments or manipulations under study and the measurement procedures used are all selected, in their important features at least by the investigator. By contrast in an observational study some of these feature, and in particular the allocation of individuals to treatment groups is outside the investigator’s control. Definition of Experiment 44 Notice that Cox and Reid never mentioned “randomization” in their definition. Are there experiments which are not randomized? Not all studies lend themselves to randomization: Effect of parental death on child outcomes Effect of divorce on children Effect of no schooling We will distinguish between “field experiments” and “natural experiments.” Natural experiments are places where randomization occurs “by nature.” Variation between treatment and control takes place for arbitrary and seemingly random reasons. Next week we are going to focus here. In field experiments, the researcher “controls” the randomization. Choosing the treatment 45 In designing random experiments, we start with the treatment. Two schools of thought: 1. 2. Start with theory, policy, or conceptual frameworks. Identify opportunities for randomization and then identify questions which might be of interest to academics or policymakers. Angrist and Kremer and divergent approaches The difference between program evaluation and research The importance of partners. Oftentimes the relationship is more important than the question Duflo article discusses partners before the basics of randomization The importance of partners 46 Who are our partners? Governments (Job Training, Negative Income Tax, Progressa, Generating new pilots) NGO (Can focus on smaller population than governments) Private Groups Partners have priorities and existing beliefs. These create limitations and opportunities in our design. Partners have resources. Few of us have the money or time to implement new treatments to units or to gather data on outcomes. Partners are key to this. Partners can help us find populations to study. Some examples to work through 47 Hypothesis: students’ writing and self-confidence are linked. Selfaffirming writing experiences reinforces student confidence and subsequently student outcomes. Hypothesis: Remediation is actually counterproductive in college. What’s the ideal experiment? Is it plausible? Experiment? Plausibility? Hypothesis: Deworming improves health and educational outcomes. Experiment? Plausibility? Hypothesis: Paying students for test scores improves academic performance. Experiment? Plausibility? Hypothesis: Positive (negative) verbal reinforcement improves (destroys) academic progress. Experiment? Plausibility? Worms paper results 48 Administer a vaccine to cure students of worms. Worms are in dirty water. They are a parasite causing health problems. School attendance declines. Randomized which students received the treatment within a community Randomized which communities received treatments. Key results? No difference within communities Communities who received the treatment were better off. Unit of randomization 49 Once we have the treatment, we can determine the optimal units. Why do we care about “units”? Statistical power. Typically the unit of analysis is the same as the unit of randomization. Statistical modeling. Contamination. The causal question often changes with the units. “External Validity.” Can we generalize to other populations? Mode of delivery. Political considerations. Consider some simple cases 50 Which level (individuals, classes, schools) would be best to test the following treatments? Adoption of a new curriculum. Creating a conditional cash transfer program based on schooling. Preschool subsidies. Giving incentives to teachers. Giving incentives to students studying in an online course. How many? 51 Once we know the level of randomization, we now need to think about the ramifications in terms of the number of units. Power calculations are our key to determining the size of the population necessary for answering our causal questions. We can derive power calculations from our statistical models. Case 1. Individual randomization 52 Our simple model works well (simple t-stats do also): Yi = a + b*Treatmenti + ei Recall that the OLS estimator of b can be given by ( y y )(t p ) N i i 1 N i 2 ( t p ) i i 1 Case 1. (cont.) 53 More importantly, we need to worry about the variance of the estimator. 1 1 V E ( X ' X ) X ' ' X ( X ' X ) N X'X pN pN pN Case 1. (cont.) 54 2 p p 2 1 V (X ' X ) p(1 p) N p 1 V 2 p(1 p) N Case 1. (cont.) 55 So our t-statistic will be tH ˆ 0 0 ˆ 2 ˆ p (1 p ) N And hence the standard error bands ˆ 1.96* ˆ p(1 p) N Case 1. Standard Error Bands 56 ˆ 1.96* ˆ p(1 p) N Notice that the standard error bands depend on the proportion getting the treatment, the total sample size, and the estimated variance of the error term. The standard error bands shrink as . . . As p approaches ½. As N increases. As the variance of the error becomes smaller. Including additional explanatory variables reduces the error term. Case 1. Power Calculation 57 tH ˆ 0 0 ˆ ˆ 2 p (1 p ) N Suppose that I think a plausible effect size is 2 percent. I want to make sure that 2 percent is statistically significant. Given the variance of the error, a coefficient of 2 percent, and the proportion treated, I can vary N to determine the sample size needed to make 2 percent significant for a given statistical power. This exercise is called a power calculation. Optimal Design 58 Doing these power calculations by hand is tedious. The Smith Richardson Foundation funded the creation of a software which would do these power calculations. It is freely available on their website http://www.wtgrantfdn.org/resources/research-tools The manual is very easy to read. All of the ingredients are simple except the standard deviation of the error term. However, we can standardize our effect sizes by dividing effects by standard deviation of the error term. Optimal Design reports things in standard deviation units. Optimal Design also assumes that 50 percent are treated as this maximizes power. Optimal Design (Screenshot 1) 59 Optimal Design (Screenshot for Case 1) 60 Optimal Design (Screenshot for Case 1) 61 Case 1. Choose the 62 st 1 Option Case 1. Customizing 63 We next customize our calculations for what our case. For example, in Perry Preschool there were 123 observations. Effects on high school graduation were about 0.4 standard deviations. Let’s check on the statistical power for 3 alternatives. Case 1. Customizing 64 I’m also going to adjust the X axis so that we can get a better picture. Power calculations 65 So what kind of power did they have? 66 Generally, we want 70-80 percent power at a minimum. This was a risky study. I would have counseled them to get closer to 150 observations. Other ways to view it. 67 Let’s focus on the lower bound of the confidence interval. We will focus on the Minimal Detectable Effect Size (MDES). We choose the MDES versus sample size option. MDES 68 We no longer assign a treatment effect. Now we assume a level of power – 80 percent is the default. Effect size is on Y-axis. Sample size on X. 69 Case 2. Blocking (Multi-site Trials) 70 Oftentimes, it might be easier to run lotteries at multiple sites. For example, we examined the charter school paper. Lotteries were held at each school. In some sense, the charter paper is a combination of about 15 different lotteries or experiments. The authors could have used just one site, but they use all of the sites. Blocking can also be used to decrease the variance and the probability of bad randomization. Blocks could be groups of homogeneous students. Statistical Models with Blocks 71 Split between fields. Sociology: Yij = a + (b+uj)*Treatmentij + eij The model allows for heterogeneity in treatment effects. Good for trying to understand heterogeneous treatment effects (e.g. what percentage of schools have positive treatment effects) Economics: Yij = a + b*Treatmentij + uj + eij The models typically have fixed effects for each block or site. Good for trying to decrease the standard error. Let’s move on to more complex designs 72 Bettinger (2011) examines student incentives in Coshocton, Ohio, United States. A donor gave students up to $100 per year based on their test scores. Students took five tests each year and could earn $20 per test. Coshocton had about 150 students in grades 3-6 each year. There were four schools in the city with two classrooms in each school. Teachers did not want to randomize within classes. Principals did not want to randomize by classroom since many teachers “team” teach and combine classes for some subjects. The donor did not want to bring in additional schools outside of Coshocton. How do you maximize power given the political constraints? Coshocton Randomization 73 Politically, the only option was to randomize at the grade-school level. There were four grades in each of the four schools. The district was willing to run the project for as many years as would be necessary (within reason). How would you do the power calculation? Somewhat limited because we are going to assume that individuals are independent over time. We won’t be able to 75 Internal Validity Unit of Randomization Design Variation Statistical Model Verifying Randomization Limits to Randomization Key goals this week: 76 1. 2. 3. 4. 5. 6. 7. Understand causal modeling (Yesterday) Understand relationship of randomization to causal modeling (Today) Gain the statistical tools to analyze randomized experiments (Today) Become aware of underlying assumptions, strengths, and weaknesses of experimental approaches Become acquainted with key players in the development and implementation of randomized experiments Gain the statistical tools to design randomized experiments (Today and Tomorrow) Understand other key issues in design and implementation of random experiments. Defining causality 77 Compare outcomes of different treatments Treatments have be manipulable or able to be changed Key is to identify the counterfactual outcome What would have happened without the treatment? The counterfactual is never observed. We need assumptions to justify why our comparison group represents the counterfactual. Causal Questions and SUTVA 78 What are the ingredients of an experiment? N units indexed from i=1,…,N T Treatments indexed by t=1,…,T Y is outcome and indexed by t and i. Two Key Conditions: SUTVA says that Y will be the same for any i no matter how t is assigned. SUTVA also says that Y will be the same no matter what treatments the other units receive. SUTVA and the Question 79 SUTVA is not about the experiment. It is about the question. When we decide on a comparison, SUTVA helps us understand what causal question we are really able to answer. Remember how the question for Perry Preschool could be very narrow. What should we think about SUTVA? 80 This is a useful framework for isolating the precise question for which we have a causal question. It likely holds in small samples. In large movements, we could have general equilibrium effects. Let’s formulate causality mathematically 81 Yi = Y1i if Di=1 Y0i if Di=0 Yi = Y0i + (Y1i – Y0i)Di The most common approach is to compare students attending a “treatment” to those not attending a treatment. E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) 1st term is average effect on treated 2nd term is selection bias Thinking through the expectation 82 E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) Example 1. Treatment = Religious Private Schooling 1st term is average effect on treated Average private school outcome of individuals in religious private schooling minus the average public school outcome for individuals attending private school The latter term is the unobserved counterfactual. We have to find a way to estimate it. 2nd term is selection bias The difference between the public school outcome that private school attendees would have had and the public school outcome for public school attendees. Informs us as to how different private school attendees are from public school attendees. Let’s formulate causality mathematically 83 Yi = Y1i if Di=1 (TREATED GROUP) Y0i if Di=0 (CONTROL GROUP) Yi = Y0i + (Y1i – Y0i)Di (Average for group 1)– (Average for group 2) = Average improvement for group 1 because of treatment + Average difference between group 1 and group 2 without the experiment Thinking through the expectation 84 E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) Example 2. Treatment = Attending Preschool 1st term is average effect on treated Average private school outcome of individuals in preschool minus the average outcome that students who attended preschool would have had if they had not gone to preschool 2nd term is selection bias The difference between the outcome that preschool attendees would have had without preschool and the outcome of students not attending preschool. Use of the formulation 85 It helps us figure out what we are estimating. Later we will augment this model with the “probability of compliance.” It identifies the key means we need to estimate to gain causal estimates. It helps us analyze our approach regardless of our methodology What happens if we have randomization? 86 E[Y0i | Di=0] = E[Y0i | Di=1] Selection Bias is gone. E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) = E[Y1i | Di=1] – E[Y0i | Di=1] + = E[Y1i –Y0i | Di=1] 0 = E[Y1i –Y0i] Simple differences in averages reveal the treatment effect. Hypothetical Example 87 A recent study found in the overall population that students who attended schools of type X had higher test scores than other students. What do we expect the selection bias to look like? This was a school requiring motivated parents. If we randomized in the whole population, what direction should we expect the treatment effect to go? E[Yi | Di=1] – E[Yi | Di=0] = E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0]) Abdulkadiroglu et al (2011) 88 Examines charter schools in Boston Charter schools are public schools which operate more closely to private schools. Obama has pushed for more charter schools. Important research question to know if they work. Charter schools are often oversubscribed. When oversubscribed, they use lotteries to determine who gets in. Other charters are not oversubscribed. If a school is oversubscribed, what would you expect? It probably does a pretty good job. Abdulkadiroglu et al (2011) 89 Abdulkadiroglu et al (2011) 90 Synthesizing 91 The authors found a selection of schools of type X which ran lotteries to determine who entered the schools. In these schools, the researchers exploited the randomization. The difference between winners and losers was even higher than the previous comparison. Explain? Basically, observed effect greater than treatment effect then two possibilities: Selection bias is negative at no wait list schools [likelihood?] Treatment effect is much lower. Stronger the selection effects in these other schools the lower the treatment effect. Regression Formulation 92 Suppose we want to use regression analysis, how do we estimate treatment effects? Yi = a + b*Treatmenti + ei E[Yi | Di=1] = a + b +E(e|Di=1) E[Yi | Di=0] = a + E(e|Di=0) E[Yi | Di=1] – E[Yi | Di=0] =b + E(e|Di=1) - E(e|Di=0) Underlined term is selection bias. This is just equal to correlation between e and D Regression Formulation 93 Yi = a + b*Treatmenti + ei Consider the OLS estimator of b. We will call it bhat. E[bhat] = b + E[T’e/T’T] Selection bias would suggest that e and T are correlated. Think of it as an omitted variable problem. If T is randomly assigned, then E[T’e]=0 . No omitted variable can be correlated with T. Multivariate Regression 94 What is the consequence of including more X in a regression where there is randomization? Generally, the standard errors are lower X is correlated with Y and once controlled for reduce the variance of Y Estimated treatment effect should be unbiased if there is no selection bias We will return to this. Recap up to now. 95 Experiments is our key interest Causal questions need treatments that are able to be manipulated whether by the researcher or “nature.” SUTVA are essential for identifying which questions we are asking which are causal. We need units, treatments, and outcomes. Condition 1. Treatment is constant. Condition 2. Your treatment does not affect mine. Typical comparisons mask treatment effects and selection bias. Randomization removes selection bias, but there are other ways to get rid of it. Experiments vs. Observational Studies 96 Cox and Reid: The word experiment is used in a quite precise sense to mean an investigation where the system under study is under the control of the investigator. This means that the individuals or material investigated, the nature of the treatments or manipulations under study and the measurement procedures used are all selected, in their important features at least by the investigator. By contrast in an observational study some of these feature, and in particular the allocation of individuals to treatment groups is outside the investigator’s control. Definition of Experiment 97 Notice that Cox and Reid never mentioned “randomization” in their definition. Are there experiments which are not randomized? Not all studies lend themselves to randomization: Effect of parental death on child outcomes Effect of divorce on children Effect of no schooling We will distinguish between “field experiments” and “natural experiments.” Natural experiments are places where randomization occurs “by nature.” Variation between treatment and control takes place for arbitrary and seemingly random reasons. Next week we are going to focus here. In field experiments, the researcher “controls” the randomization. Choosing the treatment 98 In designing random experiments, we start with the treatment. Two schools of thought: 1. 2. Start with theory, policy, or conceptual frameworks. Identify opportunities for randomization and then identify questions which might be of interest to academics or policymakers. Angrist and Kremer and divergent approaches The difference between program evaluation and research The importance of partners. Oftentimes the relationship is more important than the question Duflo article discusses partners before the basics of randomization The importance of partners 99 Who are our partners? Governments (Job Training, Negative Income Tax, Progressa, Generating new pilots) NGO (Can focus on smaller population than governments) Private Groups Partners have priorities and existing beliefs. These create limitations and opportunities in our design. Partners have resources. Few of us have the money or time to implement new treatments to units or to gather data on outcomes. Partners are key to this. Partners can help us find populations to study. Some examples to work through 100 Hypothesis: students’ writing and self-confidence are linked. Selfaffirming writing experiences reinforces student confidence and subsequently student outcomes. Hypothesis: Remediation is actually counterproductive in college. What’s the ideal experiment? Is it plausible? Experiment? Plausibility? Hypothesis: Deworming improves health and educational outcomes. Experiment? Plausibility? Hypothesis: Paying students for test scores improves academic performance. Experiment? Plausibility? Hypothesis: Positive (negative) verbal reinforcement improves (destroys) academic progress. Experiment? Plausibility? Worms paper results 101 Administer a vaccine to cure students of worms. Worms are in dirty water. They are a parasite causing health problems. School attendance declines. Worms are highly infectious. If a peer has worms, it is easy to get it yourself. Randomized which students received the treatment within a community Randomized which communities received treatments. Why would you want this type of design? Key results? No difference within communities Communities who received the treatment were better off. Unit of randomization 102 Once we have the treatment, we can determine the optimal units. Why do we care about “units”? Statistical power. Typically the unit of analysis is the same as the unit of randomization. Statistical modeling. Contamination. The causal question often changes with the units. “External Validity.” Can we generalize to other populations? Mode of delivery. Political considerations. Consider some simple cases 103 Which level (individuals, classes, schools) would be best to test the following treatments? Adoption of a new curriculum. Creating a conditional cash transfer program based on schooling. Preschool subsidies. Giving incentives to teachers. Giving incentives to students studying in an online course. How many? 104 Once we know the level of randomization, we now need to think about the ramifications in terms of the number of units. Power calculations are our key to determining the size of the population necessary for answering our causal questions. We can derive power calculations from our statistical models. Case 1. Individual randomization 105 Our simple model works well (simple t-stats do also): Yi = a + b*Treatmenti + ei Recall that the OLS estimator of b can be given by ( y y )(t p ) N i i 1 N i 2 ( t p ) i i 1 Case 1. (cont.) 106 More importantly, we need to worry about the variance of the estimator. 1 1 V E ( X ' X ) X ' ' X ( X ' X ) N X'X pN pN pN Case 1. (cont.) 107 2 p p 2 1 V (X ' X ) p(1 p) N p 1 V 2 p(1 p) N Case 1. (cont.) 108 So our t-statistic will be tH ˆ 0 0 ˆ 2 ˆ p (1 p ) N And hence the standard error bands ˆ 1.96* ˆ p(1 p) N Case 1. Standard Error Bands 109 ˆ 1.96* ˆ p(1 p) N Notice that the standard error bands depend on the proportion getting the treatment, the total sample size, and the estimated variance of the error term. The standard error bands shrink as . . . As p approaches ½. As N increases. As the variance of the error becomes smaller. Including additional explanatory variables reduces the error term. Case 1. Power Calculation 110 tH ˆ 0 0 ˆ ˆ 2 p (1 p ) N Suppose that I think a plausible effect size is 2 percent. I want to make sure that 2 percent is statistically significant. Given the variance of the error, a coefficient of 2 percent, and the proportion treated, I can vary N to determine the sample size needed to make 2 percent significant for a given statistical power. This exercise is called a power calculation. Optimal Design 111 Doing these power calculations by hand is tedious. The Smith Richardson Foundation funded the creation of a software which would do these power calculations. It is freely available on their website http://www.wtgrantfdn.org/resources/research-tools The manual is very easy to read. All of the ingredients are simple except the standard deviation of the error term. However, we can standardize our effect sizes by dividing effects by standard deviation of the error term. Optimal Design reports things in standard deviation units. Optimal Design also assumes that 50 percent are treated as this maximizes power. Optimal Design (Screenshot 1) 112 Optimal Design (Screenshot for Case 1) 113 Optimal Design (Screenshot for Case 1) 114 Case 1. Choose the 115 st 1 Option Case 1. Customizing 116 We next customize our calculations for what our case. For example, in Perry Preschool there were 123 observations. Effects on high school graduation were about 0.4 standard deviations. Let’s check on the statistical power for 3 alternatives. Case 1. Customizing 117 I’m also going to adjust the X axis so that we can get a better picture. Power calculations 118 So what kind of power did they have? 119 Generally, we want 70-80 percent power at a minimum. This was a risky study. I would have counseled them to get closer to 150 observations. Other ways to view it. 120 Let’s focus on the lower bound of the confidence interval. We will focus on the Minimal Detectable Effect Size (MDES). We choose the MDES versus sample size option. MDES 121 We no longer assign a treatment effect. Now we assume a level of power – 80 percent is the default. Effect size is on Y-axis. Sample size on X. 122 Case 2. Blocking (Multi-site Trials) 123 Oftentimes, it might be easier to run lotteries at multiple sites. For example, we examined the charter school paper. Lotteries were held at each school. In some sense, the charter paper is a combination of about 15 different lotteries or experiments. The authors could have used just one site, but they use all of the sites. Blocking can also be used to decrease the variance and the probability of bad randomization. Blocks could be groups of homogeneous students. Statistical Models with Blocks 124 Split between fields. Sociology: Yij = a + (b+uj)*Treatmentij + eij The model allows for heterogeneity in treatment effects. Good for trying to understand heterogeneous treatment effects (e.g. what percentage of schools have positive treatment effects) Economics: Yij = a + b*Treatmentij + uj + eij The models typically have fixed effects for each block or site. Good for trying to decrease the standard error. Let’s move on to more complex designs 125 Bettinger (2011) examines student incentives in Coshocton, Ohio, United States. A donor gave students up to $100 per year based on their test scores. Students took five tests each year and could earn $20 per test. Coshocton had about 150 students in grades 3-6 each year. There were four schools in the city with two classrooms in each school. Teachers did not want to randomize within classes. Principals did not want to randomize by classroom since many teachers “team” teach and combine classes for some subjects. The donor did not want to bring in additional schools outside of Coshocton. How do you maximize power given the political constraints? Coshocton Randomization 126 Politically, the only option was to randomize at the grade-school level. There were four grades in each of the four schools. The district was willing to run the project for as many years as would be necessary (within reason). How would you do the power calculation? Somewhat limited because we are going to assume that individuals are independent over time. We won’t be able to control for this covariance. Some examples to work through 127 Hypothesis: students’ writing and self-confidence are linked. Selfaffirming writing experiences reinforces student confidence and subsequently student outcomes. Hypothesis: Remediation is actually counterproductive in college. What’s the ideal experiment? Is it plausible? Experiment? Plausibility? Hypothesis: Deworming improves health and educational outcomes. Experiment? Plausibility? Hypothesis: Paying students for test scores improves academic performance. Experiment? Plausibility? Hypothesis: Positive (negative) verbal reinforcement improves (destroys) academic progress. Experiment? Plausibility? Worms paper results 128 Administer a vaccine to cure students of worms. Worms are in dirty water. They are a parasite causing health problems. School attendance declines. Worms are highly infectious. If a peer has worms, it is easy to get it yourself. Randomized which students received the treatment within a community Randomized which communities received treatments. Why would you want this type of design? Key results? No difference within communities Communities who received the treatment were better off. Unit of randomization 129 Once we have the treatment, we can determine the optimal units. Why do we care about “units”? Statistical power. Typically the unit of analysis is the same as the unit of randomization. Statistical modeling. Contamination. The causal question often changes with the units. “External Validity.” Can we generalize to other populations? Mode of delivery. Political considerations. Consider some simple cases 130 Which level (individuals, classes, schools) would be best to test the following treatments? Adoption of a new curriculum. Creating a conditional cash transfer program based on schooling. Preschool subsidies. Giving incentives to teachers. Giving incentives to students studying in an online course. How many? 131 Once we know the level of randomization, we now need to think about the ramifications in terms of the number of units. Power calculations are our key to determining the size of the population necessary for answering our causal questions. We can derive power calculations from our statistical models. Let’s do a brief review of power on the whiteboard. Case 1. Individual randomization 132 Our simple model works well (simple t-stats do also): Yi = a + b*Treatmenti + ei Recall that the OLS estimator of b can be given by ( y y )(t p ) N i i 1 N i 2 ( t p ) i i 1 Case 1. (cont.) 133 More importantly, we need to worry about the variance of the estimator. 1 1 V E ( X ' X ) X ' ' X ( X ' X ) N X'X pN pN pN Case 1. (cont.) 134 2 p p 2 1 V (X ' X ) p(1 p) N p 1 V 2 p(1 p) N Case 1. (cont.) 135 So our t-statistic will be tH ˆ 0 0 ˆ 2 ˆ p (1 p ) N And hence the standard error bands ˆ 1.96* ˆ p(1 p) N Case 1. Standard Error Bands 136 ˆ 1.96* ˆ p(1 p) N Notice that the standard error bands depend on the proportion getting the treatment, the total sample size, and the estimated variance of the error term. The standard error bands shrink as . . . As p approaches ½. As N increases. As the variance of the error becomes smaller. Including additional explanatory variables reduces the error term. Case 1. Power Calculation 137 tH ˆ 0 0 ˆ ˆ 2 p (1 p ) N Suppose that I think a plausible effect size is 2 percent. I want to make sure that 2 percent is statistically significant. Given the variance of the error, a coefficient of 2 percent, and the proportion treated, I can vary N to determine the sample size needed to make 2 percent significant for a given statistical power. This exercise is called a power calculation. Optimal Design 138 Doing these power calculations by hand is tedious. The Smith Richardson Foundation funded the creation of a software which would do these power calculations. It is freely available on their website http://www.wtgrantfdn.org/resources/research-tools The manual is very easy to read. All of the ingredients are simple except the standard deviation of the error term. However, we can standardize our effect sizes by dividing effects by standard deviation of the error term. Optimal Design reports things in standard deviation units. Optimal Design also assumes that 50 percent are treated as this maximizes power. Optimal Design (Screenshot 1) 139 Optimal Design (Screenshot for Case 1) 140 Optimal Design (Screenshot for Case 1) 141 Case 1. Choose the 142 st 1 Option Case 1. Customizing 143 We next customize our calculations for what our case. For example, in Perry Preschool there were 123 observations. Effects on high school graduation were about 0.4 standard deviations. Let’s check on the statistical power for 3 alternatives. Case 1. Customizing 144 I’m also going to adjust the X axis so that we can get a better picture. Power calculations 145 So what kind of power did they have? 146 Generally, we want 70-80 percent power at a minimum. This was a risky study. I would have counseled them to get closer to 150 observations. Other ways to view it. 147 Let’s focus on the lower bound of the confidence interval. We will focus on the Minimal Detectable Effect Size (MDES). We choose the MDES versus sample size option. MDES 148 We no longer assign a treatment effect. Now we assume a level of power – 80 percent is the default. Effect size is on Y-axis. Sample size on X. 149 Case 2. Blocking (Multi-site Trials) 150 Oftentimes, it might be easier to run lotteries at multiple sites. For example, we examined the charter school paper. Lotteries were held at each school. In some sense, the charter paper is a combination of about 15 different lotteries or experiments. The authors could have used just one site, but they use all of the sites. Blocking can also be used to decrease the variance and the probability of bad randomization. Blocks could be groups of homogeneous students. Statistical Models with Blocks 151 Split between fields. Sociology: Yij = a + (b+uj)*Treatmentij + eij The model allows for heterogeneity in treatment effects. Good for trying to understand heterogeneous treatment effects (e.g. what percentage of schools have positive treatment effects) You could also include a site specific error. Economics: Yij = a + b*Treatmentij + uj + eij The models typically have fixed effects for each block or site. Good for trying to decrease the standard error. Variances with Blocks 152 Models with random effects (sociology model) V J 2 p(1 p)nJ There are J blocks and n people in each block. is the variance between sites of the treatment effect The equation is the same except for the new term. Implication of the new formula 153 Suppose that there is no variability between sites in the treatment effect. First term disappears First term is going to be much larger than the second term. Denominator of second term will be much larger. Variance of treatment effect across sites is hard to measure. You have to make an assumption as to what it might be. What about the economic model 154 The economics model basically assumes zero variability across sites. The focus is on how much of the variance the blocking variable can explain. You can estimate this by running a regression of the outcome on dummy variables for the blocks. An Example Bettinger and Baker (2011) 155 Study of “coaching” in college Students arrive on campus and receive coaches Coaches call students and help them figure out how to do homework, study, plan, and prepare for course events. The company providing coaches, Inside Track, wanted to prove itself. It conducted random evaluations each year at each school it helped. Bettinger and Baker (2011) cont. 156 In 2004 and 2007, Inside Track conducted 17 lotteries. Each lottery is a “block.” Randomization did not happen across sites but within sites. We used the economics model: Yij = a + b*Treatmentij + uj + eij Outcomes were staying in college for 1 year or 2 years. Baseline Results with Covariates Covars include age, gender, ACT, HS GPA, SAT, On Campus, Home Residence, Merit Scholarship, Pell Grant, Math Remediation, English Remediation, and controls for having missing values in any of the covars Model 6-month retention 12-month retention 18-month retention 24-month retention .580 .435 .286 .242 Treatment Effect (std error) .052*** (.008) .053*** (.008) .043*** (.009) .034** (.008) Lottery Controls Yes Yes Yes Yes 13,552 13,553 11,149 11,153 Lottery Controls .051*** (.008) Yes .052*** (.008) Yes .042*** (.009) Yes .033** (.008) Yes N 13,552 13,553 11,149 11,153 Control Mean 1. Baseline N 2. Baseline w/ Covariates Treatment Effect (std error) Variation in Treatment Effect 158 The economists assume there is no difference in treatment effects across sites. Is that a good assumption? Effects by Lottery? Lottery 12-month 24-month Persistence Persistence .078*** .020 1 (n=1583) Lottery 2 (n=1629) .057** 3 (n=1546) 10 (n=326) 12-month Persistence .052 24-month Persistence -- .039** 11 (n=479) .091** -- .043* .050** 12 (n=400) -.055 -- 4 (n=1552) .050** .050** 13 (n=300) .162*** .054 5 (n=1588) .040 .029 14 (n=600) .054 -.010 6 (n=552) .072* -- 15 (n=221) .136** -- 7 (n=586) .018 .066** 16 (n=176) .062 .047 8 (n=593) .023 -.017 17 (n=450) .000 .058 9 (n=974) .058** -- Standard deviation across sites in 12-month effect is about 0.098 if we were to convert the effects to be standard deviations rather than effect sizes. Go Back to Design 160 Suppose you were starting a new project and blocking seemed appropriate. Let’s go through the software. Blocking in Optimal Design 161 Options in Optimal Design 162 In order, the options are the following: alpha = type I error (for our confidence interval) P = Power sigma2 = Effect size variability (variance of treatment across sites) n = Number of units in each site B = Proportion of variance explained by Blocks R2 = Proportion of variance explained by other variables Graphs in Optimal Design 163 Design the Inside Track Evaluation 164 Suppose you were told that you had 17 blocks with about 700 people in each block What is your minimum detectable effect size? We can assume a treatment variability across sites of 0.1 We can also run a regression of the outcome on dummy variables for each block. How much variation do blocks pick up? 165 MDES for Inside Track 166 Summary on Case 2 167 Blocking can really help our standard errors especially when there are big differences across blocks. Especially in sociologists preferred model, variability across blocks alters our standard errors causing them to increase. Standard errors start to be more responsive to number of blocks than the overall sample size. Let’s move on to more complex designs 168 Bettinger (2011) examines student incentives in Coshocton, Ohio, United States. A donor gave students up to $100 per year based on their test scores. Students took five tests each year and could earn $20 per test. Coshocton had about 150 students in grades 3-6 each year. There were four schools in the city with two classrooms in each school. Teachers did not want to randomize within classes. Principals did not want to randomize by classroom since many teachers “team” teach and combine classes for some subjects. The donor did not want to bring in additional schools outside of Coshocton. How do you maximize power given the political constraints? Coshocton Randomization 169 Politically, the only option was to randomize at the grade-school level. There were four grades in each of the four schools. The district was willing to run the project for as many years as would be necessary (within reason). How would you do the power calculation? Somewhat limited because we are going to assume that individuals are independent over time. We won’t be able to control for this covariance. Case 3. Group or Cluster Randomization 170 Some treatments are implausible at the individual level. We need to randomize at the group level. Deworming paper Coshocton paper. Clusters or groups all receive the treatment at the same time Variances with Clusters 171 Cluster randomization V p(1 p) J 2 p(1 p)nJ There are J clusters and n people in each cluster. is the variance between clusters Case 2. Blocking (Multi-site Trials) 172 Oftentimes, it might be easier to run lotteries at multiple sites. For example, we examined the charter school paper. Lotteries were held at each school. In some sense, the charter paper is a combination of about 15 different lotteries or experiments. The authors could have used just one site, but they use all of the sites. Blocking can also be used to decrease the variance and the probability of bad randomization. Blocks could be groups of homogeneous students. An Example Bettinger and Baker (2011) 173 Study of “coaching” in college Students arrive on campus and receive coaches Coaches call students and help them figure out how to do homework, study, plan, and prepare for course events. The company providing coaches, Inside Track, wanted to prove itself. It conducted random evaluations each year at each school it helped. Bettinger and Baker (2011) cont. 174 In 2004 and 2007, Inside Track conducted 17 lotteries. Each lottery is a “block.” Randomization did not happen across sites but within sites. We used the economics model: Yij = a + b*Treatmentij + uj + eij Outcomes were staying in college for 1 year or 2 years. Baseline Results with Covariates Covars include age, gender, ACT, HS GPA, SAT, On Campus, Home Residence, Merit Scholarship, Pell Grant, Math Remediation, English Remediation, and controls for having missing values in any of the covars Model 6-month retention 12-month retention 18-month retention 24-month retention .580 .435 .286 .242 Treatment Effect (std error) .052*** (.008) .053*** (.008) .043*** (.009) .034** (.008) Lottery Controls Yes Yes Yes Yes 13,552 13,553 11,149 11,153 Lottery Controls .051*** (.008) Yes .052*** (.008) Yes .042*** (.009) Yes .033** (.008) Yes N 13,552 13,553 11,149 11,153 Control Mean 1. Baseline N 2. Baseline w/ Covariates Treatment Effect (std error) Variation in Treatment Effect 176 The economists assume there is no difference in treatment effects across sites. Is that a good assumption? Effects by Lottery? Lottery 12-month 24-month Persistence Persistence .078*** .020 1 (n=1583) Lottery 2 (n=1629) .057** 3 (n=1546) 10 (n=326) 12-month Persistence .052 24-month Persistence -- .039** 11 (n=479) .091** -- .043* .050** 12 (n=400) -.055 -- 4 (n=1552) .050** .050** 13 (n=300) .162*** .054 5 (n=1588) .040 .029 14 (n=600) .054 -.010 6 (n=552) .072* -- 15 (n=221) .136** -- 7 (n=586) .018 .066** 16 (n=176) .062 .047 8 (n=593) .023 -.017 17 (n=450) .000 .058 9 (n=974) .058** -- Standard deviation across sites in 12-month effect is about 0.098 if we were to convert the effects to be standard deviations rather than effect sizes. Go Back to Design 178 Suppose you were starting a new project and blocking seemed appropriate. Let’s go through the software. Blocking in Optimal Design 179 Options in Optimal Design 180 In order, the options are the following: alpha = type I error (for our confidence interval) P = Power sigma2 = Effect size variability (variance of treatment across sites) n = Number of units in each site B = Proportion of variance explained by Blocks R2 = Proportion of variance explained by other variables Graphs in Optimal Design 181 Design the Inside Track Evaluation 182 Suppose you were told that you had 17 blocks with about 700 people in each block What is your minimum detectable effect size? We can assume a treatment variability across sites of 0.1 We can also run a regression of the outcome on dummy variables for each block. How much variation do blocks pick up? 183 MDES for Inside Track 184 Summary on Case 2 185 Blocking can really help our standard errors especially when there are big differences across blocks. Especially in sociologists preferred model, variability across blocks alters our standard errors causing them to increase. Standard errors start to be more responsive to number of blocks than the overall sample size. Let’s move on to more complex designs 186 Bettinger (2011) examines student incentives in Coshocton, Ohio, United States. A donor gave students up to $100 per year based on their test scores. Students took five tests each year and could earn $20 per test. Coshocton had about 150 students in grades 3-6 each year. There were four schools in the city with two classrooms in each school. Teachers did not want to randomize within classes. Principals did not want to randomize by classroom since many teachers “team” teach and combine classes for some subjects. The donor did not want to bring in additional schools outside of Coshocton. How do you maximize power given the political constraints? Coshocton Randomization 187 Politically, the only option was to randomize at the grade-school level. There were four grades in each of the four schools. The district was willing to run the project for as many years as would be necessary (within reason). How would you do the power calculation? Somewhat limited because we are going to assume that individuals are independent over time. We won’t be able to control for this covariance. Case 3. Group or Cluster Randomization 188 Some treatments are implausible at the individual level. We need to randomize at the group level. Deworming paper Coshocton paper. Clusters or groups all receive the treatment at the same time Variances with Clusters 189 Cluster randomization V p(1 p) J 2 p(1 p)nJ There are J clusters and n people in each cluster. is the variance between clusters As in the prior case, J is more important than the overall sample size (nJ) Let’s take it to Optimal Design 190 What are the options? 191 In order, the options are the following: alpha = type I error (for our confidence interval) n = Size of cluster P = Power rho = correlation across clusters R12 = Proportion of variance explained by cluster-level variables Power in Coshocton? 192 In Coshocton, there were about 40 students in each cluster. There were 16 clusters each year for three years for a total of 48 clusters. At the end, we convinced them to do it one more year to get 64 clusters. We assume either a 0.05 or a 0.10 correlation between sites. 193 Why do we care so much about power? 194 Consider a very hypothetical scenario. A “partner” wants to run a new experiment. They want to randomize across 80 students. Only 40 will receive the treatment. In prior studies, the same treatment has generated only small effects – about 0.10 standard deviations. Is it worth your time to run this small experiment? 195 Is it worth it? 196 NO!!!! With 80 students, the minimal detectable effect size is 0.64. Given the prior studies, we can never find an effect. Suppose that the treatment generates effects of 0.50. We would not have the power to actually measure them. In the end, we would have to conclude that our estimated effect is no different from zero even though it could be very large. One more exercise… Costs. 197 Suppose that the mayor of a large city comes and asks you to conduct a randomized experiment in paying students for test scores. There are 300 schools with 4 classrooms in each school and 30 students per classroom. The expected cost per student treated will be 600руб. Suppose we use everyone. How much power do we have? How should we design it? 198 Randomize at the student, the class, or the school? For simplicity, let’s randomize each school. 150 schools in the treatment and 150 in the control We will have 120 children in each school. 18000 students in the treatment and 18000 in the control The total cost = 10,800,000 руб. 199 How should we design it? 200 Randomize at the student, the class, or the school? For simplicity, let’s randomize each school. 150 schools in the treatment and 150 in the control We will have 120 children in each school. 18000 students in the treatment and 18000 in the control The total cost = 10,800,000 руб. The MDES = 0.078 standard deviations. Just for fun… 201 How much more power would we have if we were able to randomize at the student level rather than the school? MDES = 0.03 (over 1/2 the prior MDES) Let’s go back to our new exercise… 202 The mayor is really frustrated when you bring the cost estimate. It is too expensive! He wants the costs to be half. The total cost = 10,800,000 руб. The MDES = 0.078 standard deviations. How can we save money? (Many possible answers) For simplicity, let’s pick three options which would reduce the costs by half. Any predictions? Randomize at school level but only choose 2 classrooms at each school. Treat schools as blocks and randomly choose 1 treatment and 1 control school in each school. Only use 150 schools. Randomize at school level with only two classes per school. 203 Schools as blocks with randomization of 2 classrooms within the block 204 Only 150 Schools 205 Summing it up 206 Treating everyone: Treating only 2 classrooms per school The total cost = 5,400,000 руб. The MDES = 0.083 standard deviations. Schools as blocks and randomize across two classes The total cost = 10,800,000 руб. The MDES = 0.078 standard deviations. The total cost = 5,400,000 руб. The MDES = 0.055 standard deviations. But it possibly changes the nature of the treatment Only using half of the schools The total cost = 5,400,000 руб. The MDES = 0.111 standard deviations. Where are we at in the course? 207 We chose a treatment We used theory or policy and enlisted a partner. We determined where to randomize and how many to randomize. Now we randomize. Random Assignment 208 Use a transparent form of randomization. Have witnesses. Random numbers are preferred. In children, lotteries or flipping coins is best. Make sure that partners understand the process. Keep control. Michigan voucher experience. Instead of using randomized wait list, an administrator chose “randomly.” She chose students who lived closest to her office. Unfortunately, richer students lived near her office. Even worse, we discovered it after the experiment. Colombia vouchers. One city cheated and gave the vouchers to political friends. Randomization in Coshocton 209 Verify the Randomization 210 If it works, then there should be few differences across control and treatment groups. 0 .2 .4 .6 .8 1 Pre-Lottery Math Test Scores in Coshocton (Regression Corrected) -3 -2 -1 0 1 Math Scores Pre-Lottery Treatment 2 Control 3 0 .2 .4 .6 .8 1 Pre-Lottery Reading Test Scores (Regression Corrected) -3 -2 -1 0 1 Reading Scores Pre-Lottery Treatment Control 2 3 Basic Descriptive Statistics & Balance Characteristic Control Group Mean Difference for Treatment (std error) Sample Size Number of Lotteries Female .488 .009 (.009) 12,525 15 Missing Gender .675 -.001 (.001) 13,555 17 Age 30.5 .123 (.209) 9,569 8 Missing Age .294 .0001 (.0010) 13,555 17 College Entrance Exam (SAT) 886.3 -11.01 (16.19) 1,857 4 Missing SAT .827 .001 (.002) 13,555 17 Living On Campus .581 -.005 (.017) 1,955 4 0 .01 .02 .03 .04 .05 Age Distributions 0 20 40 Age Treatment Age 60 Control Age 80 0 .0005 .001 .0015 .002 SAT Scores 0 500 1000 SAT Treatment Control 1500 0 .2 .4 .6 .8 High School GPA 0 1 2 HS GPA Treatment 3 Control 4 Significant Differences by Lottery? Lottery # Characteristics # Significant Diff (90%) Lottery # Characteristics # Significant Diff 10 (n=326) 6 0 1 (n=1583) 2 0 11 (n=479) 6 0 2 (n=1629) 2 0 12 (n=400) 2 0 3 (n=1546) 2 0 13 (n=300) 1 0 4 (n=1552) 2 0 14 (n=600) 1 0 5 (n=1588) 2 0 15 (n=221) 3 1 6 (n=552) 3 0 16 (n=176) 14 0 7 (n=586) 3 0 17 (n=450) 12 0 8 (n=593) 3 0 9 (n=974) 9 0 After we verify randomization. . . 218 We need to collect outcome data and estimate effects. Outcome data can be very expensive. To cost… Colombia Vouchers (American Economic Review 2002) Random assignment of high school voucher in Bogota, Colombia Hunting and gathering for lottery participants 16 college students from Javeriana University Phone and house visits Middle of Colombia’s civil war Police at times limited our data collection In the end, response rates were near 55 percent. Cost of contacting each observation was roughly $300 Very rough estimate since I was not involved in financials Data collection took 6 months, 4 trips to Colombia, and 1 full-time manager Publication in premiere economics journal … or not to cost Colombia Vouchers Part 2 (AER 2006) In 2000, we learned of administrative data on high school exit exam in Colombia One visit arranged the entire matching Given universe of coverage, 100 percent response among students whose data was valid. Cost per observation was roughly $6 Very rough estimate since I was not involved in financials Data collection took 2 months, 1 trip to Colombia, and 1 part-time manager Publication in premiere economics journal Lessons Learned Administrative data hold the keys to reducing cost of RCT evaluations Quality of initial data collection greatly influenced the probability that we could track students. Time to complete the evaluation was greatly reduced with administrative data. The trade-off to administrative data was that we have less information on underlying mechanisms. Threats to Validity 222 Internal Validity External Validity Does the experiment give unbiased answers for population being randomized? Does the experiment help us understand other populations? Threats to internal validity. . . Contamination between treatment and control Switching treatments Attrition Hawthorne or John Henry Effects Key goals this week: 223 1. 2. 3. 4. 5. 6. 7. Understand causal modeling Understand relationship of randomization to causal modeling Gain the statistical tools to analyze randomized experiments Become aware of underlying assumptions, strengths, and weaknesses of experimental approaches Become acquainted with key players in the development and implementation of randomized experiments Gain the statistical tools to design randomized experiments Understand other key issues in design and implementation of random experiments. Treatment Fidelity 224 I skipped this in Lecture 4 only mentioning it in class. Once the treatment starts, we need to verify that it is actually happening. Are people participating in new programs? Are individuals administering the treatment actually doing it? Examples: Professional Development and Subsequent Teacher Teaching. Do coaches call students? 90 percent get one call. 63 percent get more than one call. Student incentives, lightning strikes, and Kenya. Threats to Validity 225 Internal Validity External Validity Does the experiment give unbiased answers for population being randomized? Does the experiment help us understand other populations? Threats to internal validity. . . Contamination between treatment and control Switching treatments Attrition Hawthorne or John Henry Effects Contamination 226 Treatment Affects Control Peer effects are frequent culprit. Teachers talk to their peers. Students talk to their peers. Other spillovers can exist. Creative research designs (e.g. the deworming) can help control for contamination. Hawthorne or John Henry Effects 227 Hawthorne effects: Treatment group works harder than they normally would as a result of the treatment. Multiple papers provide some evidence that the class size experiment in Tennessee may have had Hawthorne effects. John Henry effects: The control group works harder than they normally would as a result of the experiment. Attrition 228 Attrition occurs when students leave the experiment at some point after the treatment. Attrition can be a tricky problem. Always focus on whether attrition is related to either being in the control or treatment. Consider how you are collecting outcome data and whether attrition favors one group over the other. Example: In the coaching example, attrition occurs if students leave the university. If coaching has positive effects, then treatment students are more likely to stay at the university. If we are collecting data through surveys or using records from the university, then we are more likely to find treatment kids. If our outcomes come from surveys, then we might have outcome for other students. More on Attrition 229 It is useful for us to imagine the treatment and control groups. For example, suppose that this High Ability box represents ability with high ability at the top and low at the bottom. Low Ability We will assume one of these exist for both the control and treatment groups. Outcome Data and Attrition? 230 Treatment Control Non-Attriters Non-Attriters Attriters If we rely on outcome data and cannot collect data on attriters, then our randomization is compromised. For example, if attriters come from the low end of the ability distribution, then we are going to have too many treatment people with low ability. Our results will be biased downward in this example. Outcome Data on Attrition 231 Choose data collection strategies that are neutral to treatment and control. Examples: Survey all students using information from initial contact lists. Why can’t we update information from the school from the end of the experiment? Rely on administrative data from government or other collection group. Always ask whether outcomes are related to attrition. In surveys, always check in survey response is symmetric across treatment and control groups. Attrition in “Downstream” Outcomes 232 Some outcomes that we wish to measure are “downstream” or they occur after another outcome we care about. For example, we may care if students take college entrance exams. In many countries only a fraction of students take these exams. We might be interested in two outcomes: Did the student take the exam? How did they score on the exam? We never observe score unless they complete the first outcome. Colombia Voucher Experiment 233 In Colombia Voucher Experiment, Colombian government provided educational vouchers (or coupons) so that low-income students could attend private school instead of government schools. We want to measure the impact on the likelihood that students take the college entrance exam and on their scores on the exam. Only about 35 percent of students take the college entrance exam. Outcomes After 3 Years Dependent Variable th Started 6 in Private Started 7th in Private Currently In Private School Highest Grade Completed Currently In School Finished 7th Grade (excludes Bog 97) Finished 8th Grade Repetitions of 6th Grade Sample Size Loser’s Means (1) .877 (.328) .673 (.470) .539 (.499) 7.5 (.960) .831 (.375) .847 (.360) .632 (.483) .194 (.454) 562 Bogota 95 No Ctls Basic Ctls (2) .063** (.017) .174** (.025) .160** (.028) .164** (.053) .019 (.022) .040** (.020) .112** (.027) -.066** (.024) (3) .057** (.017) .168** (.025) .153** (.027) .130** (.051) .007 (.020) .031 (.019) .100** (.027) -.059** (.024) 1147 Basic +19 Barrio Ctls (4) .058** (.017) .171** (.024) .156** (.027) .120** (.051) .007 (.020) .029 (.019) .094** (.027) -.059** (.024) Long Run Concerns Effects Observed after 3 Years 1/2 of Voucher Winners No Longer Using Voucher after 3 Years No Difference in Attendance Rates Ambiguity of repetition result. Reliance on Survey Data and Response Bias Response Rates around 55% but symmetric across treatment group Why Administrative Records? All college entrants and most high school grads take the ICFES college entrance exams. We use ICFES registration status and test scores as an outcome. Advantages: • Long-term outcome of major significance • No need to survey; no attrition Disadvantages: • Score outcomes may be hard to interpret • Differential record-keeping by win/loss status may generate a spurious treatment effect Vouchers and the Probability of ICFES Match Exact ID Match ID and City Match ID and 7-letter Name Match .354 .072 (.016) 3542 .339 .069 (.016) 3542 .331 .072 (.016) 3542 .387 .067 (.023) 1789 .372 .069 (.023) 1789 .361 .071 (.023) 1789 .320 .079 (.022) 1752 .304 .071 (.022) 1752 .302 .074 (.022) 1753 A. All Applicants Dep. Var. Mean Voucher Winner N B. Female Applicants Dep. Var. Mean Voucher Winner N C. Male Applicants Dep. Var. Mean Voucher Winner N Evaluation Strategy How do we think of effects on scores since voucher affected registration for the exam? We look at test scores among those who were tested. These are clearly contaminated by selection bias since vouchers affect testing probability. The resulting bias probably masks positive effects; we consider a number of selection corrections Figure 1a. Language Scores by Voucher Status (No Correction for Selection Bias) Voucher Winners Voucher Losers .1 .075 .05 .025 0 25 35 45 score 55 65 Figure 1a. Language Scores by Voucher Status (No Correction for Selection Bias) Voucher Winners Voucher Losers .1 Voucher winners here would never have taken the exam under the assumption that the voucher effect is monotonic. These individuals do not have “twins” in the control for test scores. .075 .05 .025 0 25 35 45 score 55 65 Solution for this type of attrition 241 Bounding exercises Make some reasonable assumptions about what test scores would look like. Apply the assumptions to generate “upper” and “lower” bounds. Alternatively, we can censor the results. Censoring means that we will assign a base value for everyone scoring below that certain value (or not taking the exam). Bounding in Colombia 242 For example, in Colombia, we tried made two assumptions. First, we assumed the voucher did not hurt you. So test scores should either not be affected or be positively affected. For the lower bound, we can assume that there is no selection into the test. Second, to get the upper bound, we assume that the voucher had a monotonic effect. Everyone’s test score was pushed upward. In this case, we can eliminate the bottom 5-6 percent of test scores since these students would not have taken the exam in the absence of the voucher. Lower Bound Figure 1a. Language Scores by Voucher Status (No Correction for Selection Bias) Voucher Winners Voucher Losers .1 .075 .05 .025 0 25 35 45 score 55 65 Upper Bound Figure 2a. Language Score Distribution by Voucher Status for Equal Proportions of Winners and Losers Voucher Winners Voucher Losers .1 .075 .05 .025 0 25 35 45 score 55 65 Censoring Figure 3a. Tobit Coefficients by Censoring Percentile in Language Score Distribution 13 12 11 Tobit Coefficients 10 9 8 7 6 5 4 3 2 1 0 0 10 20 30 40 50 Percentile 60 70 80 90 Summary on Attrition 246 Attrition can lead to biased results when it favors one group over the other. Attrition can happen in surveys in terms of nonresponse. Choose outcome collection techniques that do not favor treatment over control. On “downstream” outcomes, use bounding strategies. Switching Treatments 247 The treatment may be desirable to control group. The control group may try to get the treatment through private ways. Examples: Coaching. Control group might seek out their own coaches. Colombia vouchers. Schools might offer scholarships to control group students while not to treatment students. Paying kids for test scores. Parents may give money to kids not chosen. Switching Treatments or Compliance 248 Compliance means two key points: The treatment group participates in the treatment. The control group does not. So far, we have assumed 100% compliance. In practice, compliance is rarely 100%. Estimation when Compliance < 100% 249 Consider the following: A researcher conducts a randomized experiment in which students watch an additional lecture. The lecture is online. He knows from the login tracking that 1/3 of students did not watch and he can identify which students did not watch. He announces that he will exclude the students who did not participate. What is wrong? Examining Compliance 250 What if compliance is related to another variable which affects outcomes? High Ability For example, suppose that this box represents ability with high ability at the top and low at the bottom. Low Ability As before we will assume one of these exist for both the control and treatment groups. What if Ability is Related to Compliance? 251 Treatment Control Compliers Non Compliers ? Excluding noncompliers would exclude individuals with the lowest ability. Ability is unobserved, and we do not know which control group students would have complied. Bias from Excluding Non-Compliers 252 Treatment Control Compliers Non Compliers ? Excluding noncompliers would exclude individuals with the lowest ability. Ability is unobserved, and we do not know which control group students would have complied. Other Compliance Problems 253 Treatment ? Control Non Compliers Compliers Non compliers in the control may have sought out the treatment. Ability is unobserved, and we do not know which treatment group students would have been non compliers without the experiment. Summing Up Compliance 254 Treatment Control Always Takers (Compliers) Always Takers (Non Compliers) Compliers Never Takers (Non Compliers) Compliers Never Takers (Compliers) Comparing students who complied could lead to big biases. We never observe who would have been non-compliers in the absence of the experiment. Consider the treatment effect 255 We can think about what the estimated treatment effect looks like with randomization. E[Yi | Di=1] – E[Yi | Di=0] We can break this up into our three groups: always takers (AT), compliers, never takers (NT) E[Yi | Di=1] – E[Yi | Di=0] = PAT * (E[Yi | Di=1, AT] – E[Yi | Di=0, AT]) + PC * (E[Yi | Di=1, C] – E[Yi | Di=0, C]) + PNT * (E[Yi | Di=1, NT] – E[Yi | Di=0, NT]) Consider the treatment effect 256 E[Yi | Di=1] – E[Yi | Di=0] = PAT * Effect on Always Takers+ PC * Effect on Compliers + PNT * Effect on Never Takers But what is the effect on Always Takers? Never Takers? They should be zero. Why? Hence, the average difference between winners and losers is really just (PC * Effect on Compliers) Intention to Treat 257 The average difference in a comparison of treatment and control groups with randomization is just (PC * Effect on Compliers) We call this quantity the “intention to treat.” “Intention to treat” is likely the relevant policy variable. If we know PC , then we can divide “intention to treat” by the probability to get the “effect of the treatment on the treated” Summing Up Compliance 258 Treatment Control Always Takers (Compliers) Always Takers (Non Compliers) Compliers Never Takers (Non Compliers) Compliers Never Takers (Compliers) Comparing students who complied could lead to big biases. We never observe who would have been non-compliers in the absence of the experiment. Consider the treatment effect 259 We can think about what the estimated treatment effect looks like with randomization. E[Yi | Di=1] – E[Yi | Di=0] We can break this up into our three groups: always takers (AT), compliers, never takers (NT) E[Yi | Di=1] – E[Yi | Di=0] = PAT * (E[Yi | Di=1, AT] – E[Yi | Di=0, AT]) + PC * (E[Yi | Di=1, C] – E[Yi | Di=0, C]) + PNT * (E[Yi | Di=1, NT] – E[Yi | Di=0, NT]) Consider the treatment effect 260 E[Yi | Di=1] – E[Yi | Di=0] = PAT * Effect on Always Takers+ PC * Effect on Compliers + PNT * Effect on Never Takers But what is the effect on Always Takers? Never Takers? They should be zero. Why? Hence, the average difference between winners and losers is really just (PC * Effect on Compliers) Intention to Treat 261 The average difference in a comparison of treatment and control groups with randomization is just (PC * Effect on Compliers) We call this quantity the “intention to treat.” “Intention to treat” is likely the relevant policy variable. If we know PC , then we can divide “intention to treat” by the probability to get the “effect of the treatment on the treated” Estimating the Probability of Compliance 262 This is often called the 1st stage. We want to run the following regression: Participating in the Treatment = a + b*(Randomly Chosen) + e This is not as straightforward as it sounds. What is “participating in the treatment” entail? For vouchers, does it mean receiving a voucher, using a voucher, or attending private school. For coaching, does it mean talking to a coach, making an action plan, or completing goals. We need to monitor treatment fidelity if we want to measure participation in the treatment. Ethical and Practical Considerations 263 What treatments can we do? In Russia, is there a process for getting research approved? Can we withhold treatment if we know it works? The only exceptions are if the knowledge gained justifies the risk or we have budget constraints. With budget constraints, we still have an obligation to serve as many as our budget permits. What happens if we observe the treatment “hurting” people? Some underlying considerations Respect for persons, beneficence, justice Preserve them from harm, do not take advantage of people More Ethical Considerations from Gueron (2000) 264 Social experiments should not deny people access to services to which they are entitled, not reduce service levels, address important unanswered questions, include adequate procedures to inform program participants and assure data confidentiality, be used only if there is no less intrusive way to answer the questions adequately, have a high probability of producing results that will be used. Practical Considerations 265 The Judith Gueron paper on your reading list is a good practical guide for performing randomized experiments. http://www.mdrc.org/publications/45/workpaper.html Judith goes through a list of considerations which she deems as being key in working with people in the field. If you are ever going to run a randomized experiment, please go through her considerations before you start working with potential partners. Estimating Causal Mechanisms 266 In the case of vouchers suppose that the “real” treatment, the reason why treatment kids do better than control, is attendance at a private school. Hence, we are really interested in the regression: test scores = a + b*attend private + e Ideally, we would like to run this regression on the population, but we cannot remove selection bias. We estimate the voucher effect, but the voucher effect is a mix of compliance and the effect of the treatment on the treated. Can we estimate the effect of the treatment on the treated at once? The answer is instrumental variables or IV. Haavelmo’s problem Consider the following scatter plot of price and quantity Is it a demand or supply curve? Given the obvious downward slope, the temptation is to say a demand curve. But this is wrong. Why? Demand systems These points are equilibrium points. They are points where supply and demand intersect. There are many possible demand curves that are consistent with these points. Simple supply and demand So how do we solve the problem? We need more information. Once we have the information, we can use instrumental variables to help us find the supply curve. Consider a very simple supply and demand model: y a bp w d d y d fp v s s Simple Model In the simple model of supply and demand equilibrium will happen when demand (yd) is equal to supply (ys). If a positive shock hits demand (e), it will cause the demand curve to shift out and to the right. Because y and p and jointly determined, the shock (e) will also cause the price to shift. In OLS estimation, we assumed that our shocks (e) are uncorrelated with our X variables. The result is that our OLS estimates will be biased in this case. They will not reveal the true slope of the demand curve. This type of problem is often called the problem of “endogenous” regressors and it is related to omitted variable bias. How prevalent is the problem? If you have a model where the x variable reasonably affects y and that the y variable reasonably affects x, then you likely have this problem of endogeneity. For example, suppose we are comparing test scores to grades in a class. We had in mind to see if your grade predicts your test score. However, your test scores may also predict your grades and so we have an endogeneity problem. A more subtle example would be smoking and obesity. Which causes which? So how do you fix it? This is where extra information helps. Consider our same equation system: I have changed it so that y d a bpd cX w X now enters the demand y s d fp s v equation but not supply. X might be something like “fads.” They increase the demand directly. People want more y because X increased. These fads can only affect supply through their impact on demand. What happens as X changes? As X changes, it causes the demand curve to shift outward. For each shift in demand, we discover a new equilibrium point. These equilibrium points show us the shape of the supply curve. Hence, added information about demand helps us identify the slope of the supply curve. This added information is often called “an instrument.” Mathematical solution We could have showed the same thing mathematically. Let’s let supply equal demand and solve the model for p and y. We can do this through multiple steps. y a bp cX w y d fp v (b d ) p (d a) cX (v w) d p s ( d a ) (bd ) c (bd ) X yd f( ( d a ) (bd ) y [d f ( ( d a ) (b d ) ( v w) (bd ) c (bd ) )] X cf (bd ) ( v w) (bd ) X [ f )v ( v w) (bd ) ) v] Two Key Equations There are two key equations from this exercise: p ( d a ) (bd ) (b c d ) X ((bv wd )) y [d f ( ((bdda )) )] (bcfd ) X [ f These two equations show each endogenous variable (p and y) as functions of the one exogenous variable X. These two equations are called the “reduced form.” ( v w) (bd ) ) v] Two Key Equations Our previous graphical solution showed that we could solve for the coefficient “f” (i.e. the slope of the supply curve) by using the information in X To see this mathematically, look at the reduced form equations again: p ( d a ) (bd ) (b c d ) X ((bv wd )) y [d f ( ((bdda )) )] (bcfd ) X [ f ( v w) (bd ) ) v] Look at the coefficients on the X variable. Can we use this information to solve for the slope? Absolutely. Just take the ratio of the coefficients. You will notice, however, that there is no way for us to deduce what the slope of the demand curve is in this case. A few definitions This technique that we used here is called “Indirect Least Squares” and it is a close cousin of Instrumental Variables. The basic model of supply and demand is called the “structural model.” The model where the endogenous variables are functions of the exogenous variables is called the “reduced form.” Another example What if the added information is in the supply equation instead? y d a bp d w Now Z enters the supply y s d fp s gZ v curve but not demand. Z might be something like the weather that affects the supply without impacting individuals’ demands for y. We solve the system the same way. y a bp w y d fp gZ v (b d ) p (d a ) gZ (v w) d p s ( d a ) (bd ) y a b( g (bd ) ( d a ) (bd ) y [ a b( Z ( d a ) (bd ) ( v w) (bd ) g (bd ) )] Z bf (bd ) ( v w) (b d ) Z [b )w ( v w) (bd ) ) w] The new reduced form There are two key equations from this exercise: p ( d a ) (bd ) (b gd ) Z ((bv wd )) y [a b( ((bdda )) )] (bgb d ) Z [b ((bv wd )) ) w] These two equations show each endogenous variable (p and y) as functions of the one exogenous variable Z. Now the coefficients on Z can be used to figure out the slope of the demand curve. Identification In the very first structural model (i.e. no X or Z), we had one endogenous variable on the right-hand side and no exogenous variables in either regression. This situation is called “not identified.” In the second model (included X in demand curve), we have a different situation. In each equation of the structural model, we had one endogenous right-hand side variable. In the demand equation, we had no excluded exogenous variables. In the supply equation, we had one excluded exogenous variable (X). In this example, the demand equation is “not identified” while the supply equation is “just identified.” The supply equation is identified because we can trace out its slope using shifts in the demand equation. In the third model (included Z in supply curve), the supply equation is no longer identified but the demand equation is “just identified.” Identification - extensions There are some extra cases worth considering. Suppose that there had been two X variables in the demand equation. If you work this through, you would have two possible solutions for the slope. In this case, the supply curve is “over identified.” There are more exogenous excluded variables (the two X variables) than there are included right hand side endogenous variables (p). Suppose that X and Z were in the structural system, then both equations would have been just identified. Identification matters in that there are simple rules to really knowing if you can trace the slope. The key is to compare the number of EXCLUDED exogenous variables to the number of included, right-hand side endogenous variables. Instrumental Variables d d Let’s go back to the y a bp cX w simple model w/ s s y d fp v X. We know that X can predict price. Earlier, we solved for the reduced form of the price equation. Instrumental variables is a similar technique to what we used before except we have two distinct steps. Instrumental Variables First, we predict Price using X. We estimate the reduced form model. This equation is called the first stage. Using the equation, we predict price and save the predicted values. Second, we regress y on the PREDICTED price. We don’t use the actual price. We use the predicted price. This exercise gives us an UNBIASED estimate of the slope of the supply curve. Most software (e.g. Stata, SPSS) actually does this in one simultaneous step. It does it this way in order to get the correct standard errors. Instrumental Variables We don’t always use the reduced form as the first-stage estimate. In practice, you regress the included endogenous variable (price) on the EXCLUDED exogenous variables. The excluded exogenous variable is called the “instrument” Instruments satisfy two conditions: They are uncorrelated with the “y” variable of interest. In this case, X was uncorrelated with supply. They are correlated with the included endogenous variable of interest. In this case, X predicts price. We can test the second condition, but the first is only an assumption. A practical example Suppose you were examining the voucher program that we previously talked about. Consider the following equation system: PrivateSchool=a + b*voucher+e Testscores=c+d*private school + v In this case, we can’t run the second equation in OLS since private schooling is endogenous (i.e. higher test scores could “cause” private schooling). However, voucher predicts private schooling. Additionally, voucher satisfies the other condition that it does not affect test scores directly. Hence voucher is an instrument for private schooling and the test scores equation is just identified (one endogenous and one excluded exogenous). If there had been additional X variables that predict test scores, we would have included them in both equations so long as the variables were exogenous. Other examples of IV Studies Y College X Z Financial Aid Thresholds in Aid Test scores Class size Thresholds of maximum class size Health Heart surgery Proximity to hospital Earnings Year of school Quarter of birth Birth weight Maternal smoking State cigarette taxes enrollment Note: See Angrist & Krueger (2001) Slide from Hedges (2008) Let’s go back to compliers 289 Just as in our simple example of instrumental variables we can imagine a set of equations. Treatment = a + b(Randomly Selected) + e Outcome = c + d(Treatment) + u The variable (randomly selected) is an instrument for whether a person participates in the treatment. It is correlated with treatment but uncorrelated with the error term (u). We can then get an unbiased estimate of “d,” the effect of the treatment on the treated. What if our treatment has many dimensions? 290 We talked briefly about Progressa. It provided support to families depending on their children’s attendance at school. It also depended on regular health visits. Which “treatments” might it be providing? Health care Education Extra Income Here is the problem. 291 Treatment = a + b(Randomly Selected) + e Outcome = c + d(Treatment) + u Which variable represents the treatment? Health? Education? Income? All three treatments likely affect the outcome, and all three treatments are likely endogenous. The instrument can help us predict that someone uses it; however, we have four endogenous variables in this system (the outcome and the three treatments) and only one instrument (random assignment). We are not identified. Random assignment can give us unbiased estimates of the intention to treat parameter, but it can only give us unbiased estimates of the treatment on the treated if certain conditions are met. Missing compliance group 292 First, recall that we had three types of people: always takers, never takers, and compliers. There is one group that we left out: defiers. Defiers are people who defy the treatment status. If assigned to the treatment, they choose not to take it even though they might be an always taker if they were in the control. We assume that these people don’t exist. This assumption is called monotonicity. People move forward not back. When does IV give us causal mechanisms? 293 SUTVA has to be satisfied. Random assignment has to influence participation (i.e. some people need to comply). Random assignment cannot be related to other outcomes or treatments which may influence the main outcome of interest. Monotonicity (no defiers) Do we need randomization? 294 We just need our instrument to be exogenous to outcomes and related to our other endogenous variable. Randomization certainly helps, but there are many other possible instruments. The key is the exclusion restriction. You can’t test that it is really exogenous. It is a theoretical exercise and you have to think through stories which may or may not threaten its exogeneity. When Randomization is not Possible… We started the research process and began imagining the perfect experiment. If it was feasible and we could find the partners, we did the experiment. If not, we became creative and tried to find the right discontinuity or the right instrument. Oftentimes, these are not enough. Our data lack any exogenous variation to help us identify the causal impact of a treatment. What do we do? Observational Studies 296 Earlier we talked about observational studies. In these studies, we observe data where some have had the treatment and others have not. We had no control over who received the treatment and the treatment may have been assigned nonrandomly. Raw comparisons of people in the treatment and people not in the treatment are likely still biased. We have done nothing to eliminate the possibility of selection bias. So what do we do? 297 Matching is the most popular technique. There are multiple ways to match students. We could match on a single characteristics. For example, match everyone who has the same age. We can match cities and “control” for pre-existing differences. In recent years, propensity score matching has become the most popular matching technique. 298 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 1988 1987 1986 1985 1984 1983 Number of publications Publications in Pub Med with phrase "Propensity Score" 180 160 140 120 100 80 60 40 20 0 Year Source: Arbogast (2005) Propensity Score Matching 299 Propensity Score Matching (PSM) attempts to match students in the treatment to other students with similar characteristics. PSM is a two step process. We need to specify the propensity score. This is the probability that a person gets treatment given their observable characteristics. We then need to match students with similar propensities. Key idea: You and I have the same probability of being assigned treatment given our characteristics. We might be a good match. Compare two schools 300 Suppose School A uses a new curriculum while School B does not. We can imagine that School A is also, on average, a school with richer students. Our propensity score model would estimate the probability that each student attends School A. There are some students at School B which have very similar characteristics to some students at School A. Our analysis will focus on students in the treatment in School A for which we can find a suitable matching student in School B. Why PSM? 301 Prior to PSM, we typically matched students based on a small set of characteristics (e.g. age, gender, socioeconomic status). Matching could be cumbersome, and there were many possible confounding variables. For example, suppose that I match students in the same grade, same gender, and same income level between School A and School B. What other factors could separate students at these schools? Parent education levels, the types of courses that students take, prior academic achievement, and so on. Why PSM? 302 Rosenbaum and Rubin (1983) proposed using a single index rather than several characteristics. In our example, we can include age, gender, and socioeconomic status, but we also can control for many, many other characteristics. We start with specifying a linear probability model. Binary Treatment 303 In PSM, we start with a binary treatment variable. T = 1 if person is in the treatment = 0 if person is not. We want to model the probability that T=1. Typically we use one of three models: Linear probability model. Uses OLS to estimate the relationship. Probit. Models the probability using a Normal Distribution. Logit. Models the probability using a Logit Distribution. In each case, we will allow a set of control variables predict that T=1. Let’s do some examples. 304 Consider a job training program. Adults who lacked skills applied to the training program. The program initially used randomization to assign students. Dehejia and Wahba (2002) investigated whether a different control group could be constructed using PSM. Because randomization was initially used, they can compare PSM methods to randomization. Randomized Sample 305 Current Population Survey 306 CPS is large, household survey in the United States. They survey 50,000 households per month. The survey inquires about labor market experiences. Start with the Propensity Score 307 The propensity score is estimated using a Logit of treatment status on: Age, Age2, Age3, Years of Schooling, Yrs of School2 Dummy Variables for being Married, No degree, Black, Hispanic Real earnings in 1974, Real earnings in 1975 Dummy variables for being unemployed in 1974 and 1975 An interaction between years of schooling and earnings in 1974. All earnings are in 1982US$. Things start pretty bleakly 308 If we had stopped there… 309 Notice how different the PS are 310 So how do we refine our matches? 311 The most common techniques are Nearest Neighbor Match Caliper Matching Nearest Neighbor Match: For each person in the treatment, identify the control group person with the closest propensity score. This can be done with or without replacement. Caliper Matching: Choose all control group individuals whose propensity score are within a certain range of the treatment group. Use weights to downweight cases where multiple control group people are matched. Range is the caliper and the researcher decides how big it should be. Mahalanobis Metric Matching: Redefines nearest neighbor match by defining distance in a more nuanced way. Potential Problems with Matching 312 Incomplete Matches Some people may not match. They may not have a nearest neighbor. Caliper may be too wide. The wider the caliper, the less precise the match, the more likely we introduce selection bias. Some Individuals are Excluded at Both Ends of the Propensity Score Cases excluded Range of matched cases. Participants Nonparticipants Predicted Probability Source: Guo, Barth, Gibbons (2004) Propensity score distribution before trimming (example from Hill pre-K Study) .15 0 0 .05 .1 Comparison Group (n=1,144): 25th percentile: 0.30 50th percentile: 0.40 75th percentile: 0.53 Mean = 0.39 .15 1 0 .05 .1 Treatment Group (n=908): 25th percentile: 0.42 50th percentile: 0.52 75th percentile: 0.60 Mean = 0.50 0 1 propensity score Graphs by prek Source: Lipsey (2007) Propensity score distribution after trimming (example from Hill pre-K Study) .15 0 0 .15 1 .1 Treatment Group (n=908): 25th percentile: 0.42 50th percentile: 0.52 75th percentile: 0.60 Mean = 0.50 0 .05 Fraction .05 .1 Comparison Group (n=908): 25th percentile: 0.36 50th percentile: 0.45 75th percentile: 0.56 Mean = 0.46 0 1 propensity score Graphs by prek Source: Lipsey (2007) Estimate the treatment effect, e.g., by differences between matched strata Propensity Score Quintiles Treatment Group Matches Control Group Source: Lipsey (2007) Let’s go back to job training 317 • The sample is much more balanced and comparable to NSW. What about treatment effects? 318 Let’s go back to job training 319 • Again, the sample is much more balanced and comparable to NSW. What about treatment effects? 320 Two extensions 321 Can we refine the estimates even further? How can we assess validity and fit of PSM? Refinements 322 One common strategy to help reduce bias is to subclassify the propensity score. Divide the propensity score into multiple groups – quintiles and deciles are common. Within each group, we can assess balance between control and treatment group. For each group, we estimate a separate treatment effect. For simplicity, let’s assume deciles are used. These are interesting in themselves (e.g. effect of the treatment on individuals who are highly likely to participate). Combine group estimates by weighting each group by the percent of the treatment sample in each group. Smoking and Health 323 Don Rubin, “Using Propensity Score to Help Design Observational Studies: Application to Tobacco Litigation.” Health Services and Outcomes Research Methodology. 2001. Smoking is not randomly assigned. Goal is to measure the impact of smoking on health outcomes. The treatment is “smoking.” Specifying the Propensity Score 324 More Variables for the PS 325 Testing Validity and Fit 326 One potential problem is misspecification in the propensity score. How robust are results to alternative specifications? Dehejia and Wahba 327 Testing Validity and Fit 328 One potential problem is misspecification in the propensity score. How robust are results to alternative specifications? Examine Bias Researchers refer to “Bias” in PSM as being the difference in the respective propensity scores of the treatment and control group in standard deviation units. Evaluation of Project GRAD 329 Table 2.8. The Potential Bias of Propensity Score Matching Sample Mean(Propensity): Mean(Propensity): Control Group Treatment Group All Schools 0.091 0.652 Matched on African-American 0.248 0.652 Representation Matched on Pre-PG Achievement 0.378 0.682 Levels Matched on All Characteristics 0.357 0.652 including Free/Reduced Lunch Bias Term 1.832 1.319 0.994 0.963 Matching plus Stratification Reduces Bias 330 Table 2.8. The Potential Bias of Propensity Score Matching Sample Mean(Propensity): Mean(Propensity): Control Group Treatment Group Boys, All .1414 .2477 Boys, Matched 0.4963 0.5006 Boys, Matched, Median Averaged Boys, Matched, Quartile Averaged Girls, All 0.1391 0.2425 Girls, Matched 0.4996 0.5029 Girls, Matched, Median Averaged Girls, Matched, Quartile Averaged Bias Term 0.7733 0.0311 0.0118 0.0225 0.7493 0.0242 0.0070 0.0099 Testing Validity and Fit 331 One potential problem is misspecification in the propensity score. How robust are results to alternative specifications? Examine Bias Researchers refer to “Bias” in PSM as being the difference in the respective propensity scores of the treatment and control group in standard deviation units. Examine Ratio of Standard Deviations of the PS Samples should have similar standard deviations PS. Comparison of Standard Deviations 332 Table 2.9. Ratio of Standard Deviations in Propensity Score for Treatment and Control Groups Sample StdDev(Propensity): StdDev(Propensity): Control Group Treatment Group Boys, All 0.1061 0.1374 Boys, Matched 0.0337 0.0320 Boys, Matched, Median Averaged 0.0177 0.0153 Boys, Matched, Quartile Averaged 0.0127 0.0124 Girls, All 0.1038 0.1380 Girls, Matched 0.0296 0.0280 Girls, Matched, Median Averaged 0.0163 0.0144 Girls, Matched, Quartile Averaged 0.0153 0.0104 Ratio of StdDevs 0.7725 1.0534 1.1557 1.0264 .7522 1.0584 1.1319 1.4702 Testing Validity and Fit 333 One potential problem is misspecification in the propensity score. Examine Bias Researchers refer to “Bias” in PSM as being the difference in the respective propensity scores of the treatment and control group in standard deviation units. Examine Ratio of Standard Deviations of the PS How robust are results to alternative specifications? Samples should have similar standard deviations PS. Examine Ratio of Standard Errors for Each Covariate Regress Covariate on PS. Save the residual. Compare standard deviation of this residual for treatment and control groups. 334 Examine where this ratio falls for all of the variables. The goal is to have most of the ratios between .95 and 1.05. Table 2.10: Ratio of Standard Error Terms (r) for Control and Treatment Groups For Each Explanatory Variable in Analysis Sample r<0.9 0.9<r<0.95 0.95<r<1.05 1.05<r<1.10 r>1.10 Boys, All 0.5 0 0.3125 0 0.1875 Boys, Matched 0 0 0.875 0.0625 0.0625 Girls, All 0.4375 0.0625 0.3125 0 0.1875 Girls, Matched 0 0 0.875 0.0625 0.0625 Assessing Bias in Smoking Study Unmatched 335 Smoking Study – Matched Sample 336 Using Subclassifications 337 Limitations on Propensity Score 338 If individuals differ on unobserved characteristics, then the propensity score could induce bias. Propensity score in small samples can lead to imbalance. Including irrelevant covariates in PS can reduce efficiency (Guo, Barth, Gibbons 2005) PSM can reduce large biases but it can eliminate all bias (Bloom) PSM might help identify bias in short-run, but over time, quality of comparisons deteriorates (Bloom) We need overlap in the two groups. Context matters. Surveys that do not match context may have strong unobservables. Implementing in Stata 339 There is a nice introduction to PSM as Stata uses it. http://fmwww.bc.edu/RePEc/usug2001/psmatch.pdf Psmatch treated, on(score) cal(.01)