Transcript Evaluation - Stanford HCI Group
Evaluation
Eyal Ophir CS 376 4/28/09
Readings
Methodology Matters (McGrath, 1994) Practical Guide to Controlled Experiments on the Web (Kohavi et al., 2007)
Methodology Matters
Methodology Matters
Methods for Research in the Behavioral and Social Sciences Different methods have strengths and weaknesses Tradeoff between: Generalizability Precision Realism Credibility requires consistency, convergence across methods
Study Design
Find baserates, correlations, or differences Randomization of selection, assignment to conditions Statistical significance Validity (internal, statistical, construct, external)
Measures
Self report Trace measures Observation (by a visible or hidden observer) Archival records (public or private)
Manipulation
Selection Direct intervention Induction (indirect intervention: confederates, deception)
Case Study: Multitasking UI
Users play two simultaneous instantiations of a game Does making the two instantiations visually different make it easier to switch back and forth?
Case Study
Case Study
Case Study
• • • • • • Tradeoffs: Generalizability, Precision, Realism Design: baserates, correlations, differences Random selection, assignment Validity: internal, statistical, construct, external Measures: self-report, trace measures, observation, archival records Manipulation: selection, intervention, induction
General Question
Has social psychology resisted formal theory, and if so, why?
Practical Guide to Controlled Experiments on the Web
Web Experiments
OEC: Overall Evaluation Criterion
Web Experiments
Hypothesis testing and sample size Confidence, power Reducing the standard error Sufficiently large sample size OEC with inherently low variability Reduce variability by excluding irrelevant cases
Web Experiments
Extensions for Online Experiments Treatment ramp-up Automation Software Migration
Web Experiments
Limitations of web experiments No explanation of mechanism Focus on short term effects Primacy/newness Must implement treatments
Web Experiments
Implementation Randomization Pseudorandom with caching Hash and partition Assignment Traffic splitting Server-side Client-side
Lessons learned (i.e.- tips for the researcher):
Analysis Mine the Data Time matters Multi-factor experiments
Lessons Learned
Trust and Execution Run A/A tests (test your system) Ramp-up and abort Correct sample size Assign 50% to treatment Beware day of week effects
Lessons Learned
Culture and Business Agree on OEC upfront Beware “harmless” features Weigh performance vs. maintenance cost Data-driven (vs. opinion-driven) culture
Extended Case Study
Assume the game UI from the first case study was an actual gaming site The website is interested in promoting multiple simultaneous games between users, but users complain that it’s difficult to manage multiple games Design a web-based study informed by the reading to test the new design
Case Study
• • • • • • • OEC Sample size, reducing error Ramp-up, automation Mechanism explanation, short vs. long-term effects, primacy/newness Randomization/assignment Mine the data, multi-factor experiments A/A tests, sample size, day of week effects
Data-Oriented Culture
Pros?
Cons?
How can we best use user tests to inform design and innovation?
Trade-offs of experimentation vs. intuition Why the OEC? What are good measures for non-commerce sites?
Do online tests maximize all McGrath’s parameters?