Evaluation - Stanford HCI Group

Download Report

Transcript Evaluation - Stanford HCI Group


Eyal Ophir CS 376 4/28/09


 Methodology Matters (McGrath, 1994)  Practical Guide to Controlled Experiments on the Web (Kohavi et al., 2007)

Methodology Matters

Methodology Matters

 Methods for Research in the Behavioral and Social Sciences  Different methods have strengths and weaknesses  Tradeoff between:  Generalizability  Precision  Realism  Credibility requires consistency, convergence across methods

Study Design

 Find baserates, correlations, or differences  Randomization of selection, assignment to conditions  Statistical significance  Validity (internal, statistical, construct, external)


 Self report  Trace measures  Observation (by a visible or hidden observer)  Archival records (public or private)


 Selection  Direct intervention  Induction (indirect intervention: confederates, deception)

Case Study: Multitasking UI

 Users play two simultaneous instantiations of a game  Does making the two instantiations visually different make it easier to switch back and forth?

Case Study

Case Study

Case Study

• • • • • • Tradeoffs: Generalizability, Precision, Realism Design: baserates, correlations, differences Random selection, assignment Validity: internal, statistical, construct, external Measures: self-report, trace measures, observation, archival records Manipulation: selection, intervention, induction

General Question

 Has social psychology resisted formal theory, and if so, why?

Practical Guide to Controlled Experiments on the Web

Web Experiments

 OEC: Overall Evaluation Criterion

Web Experiments

 Hypothesis testing and sample size  Confidence, power  Reducing the standard error  Sufficiently large sample size  OEC with inherently low variability  Reduce variability by excluding irrelevant cases

Web Experiments

 Extensions for Online Experiments  Treatment ramp-up  Automation  Software Migration

Web Experiments

 Limitations of web experiments  No explanation of mechanism  Focus on short term effects  Primacy/newness  Must implement treatments

Web Experiments

 Implementation  Randomization  Pseudorandom with caching  Hash and partition  Assignment  Traffic splitting  Server-side  Client-side

Lessons learned (i.e.- tips for the researcher):

 Analysis  Mine the Data  Time matters  Multi-factor experiments

Lessons Learned

 Trust and Execution  Run A/A tests (test your system)  Ramp-up and abort  Correct sample size  Assign 50% to treatment  Beware day of week effects

Lessons Learned

 Culture and Business  Agree on OEC upfront  Beware “harmless” features  Weigh performance vs. maintenance cost  Data-driven (vs. opinion-driven) culture

Extended Case Study

 Assume the game UI from the first case study was an actual gaming site  The website is interested in promoting multiple simultaneous games between users, but users complain that it’s difficult to manage multiple games  Design a web-based study informed by the reading to test the new design

Case Study

• • • • • • • OEC Sample size, reducing error Ramp-up, automation Mechanism explanation, short vs. long-term effects, primacy/newness Randomization/assignment Mine the data, multi-factor experiments A/A tests, sample size, day of week effects

Data-Oriented Culture

 Pros?

 Cons?

 How can we best use user tests to inform design and innovation?

 Trade-offs of experimentation vs. intuition  Why the OEC? What are good measures for non-commerce sites?

 Do online tests maximize all McGrath’s parameters?