Evaluate—Quantitative Methods October 4, 2007 Turn in Project Proposal NEEDS DESIGN EVALUATE IMPLEMENT Today Quantitative methods – Scientific method – Aim for generalizable results Privacy issues when collecting data.
Download ReportTranscript Evaluate—Quantitative Methods October 4, 2007 Turn in Project Proposal NEEDS DESIGN EVALUATE IMPLEMENT Today Quantitative methods – Scientific method – Aim for generalizable results Privacy issues when collecting data.
Evaluate—Quantitative Methods October 4, 2007 Turn in Project Proposal NEEDS DESIGN EVALUATE IMPLEMENT 1 Today Quantitative methods – Scientific method – Aim for generalizable results 2 Privacy issues when collecting data Quantitative methods Reliably measure some aspect of interface – Especially to measurably compare Approaches – Controlled experiments Doing Psychology Experiments David W. Martin, 7th edition, 2007 – Collect usage data 3 Designing an experiment State hypothesis Identify variables – Independent – Dependent 4 Design experimental protocol Apply for human subjects review Select user population Conduct experiment Conducting experiment 5 Run pilot test Collect data from running experiment Perform statistical analysis Interpret data, draw conclusions Experiment hypothesis Testable hypothesis – Precise statement of expected outcome 6 More specifically, how you predict the dependent variable (i.e., accuracy) will depend on the independent variable(s) “Null” hypothesis (Ho) – Stating that there will be no effect – e.g., “There will be no difference in performance between the two groups” – Data used to try to disprove this null hypothesis Experiment design Independent variables – Attributes we manipulate / vary in condition – Levels, value of attribute Dependent variables – Outcome of experiment, measures to evaluate – Usually measure user performance 7 Time to completion Errors Amount of production Measures of satisfaction Experiment design (2) Control variables – Attributes that remain the same across conditions Random variables – Attributes that are randomly sampled – Can be used to increase generalizability 8 Avoiding confounds – Confounds are attributes that changed but were not accounted for – Confounds prevent drawing conclusions on independent variables Example: Person picker Picking from list of names to invite to use facebook application Bryan Tsao Christine Robson David Sun John Tang Jonathan Tong … 9 Bryan Tsao Christine Robson David Sun John Tang Jonathan Tong … Example: Variables Independent variables – Picture vs. no picture – Ordered horizontally or vertically – One column vs. 2 column Dependent variables – Time to complete – Error rate – User perception Control variables – Test setting – List to pick from Random variables – Subject demographics 10 Confound – Only one woman in list – List mostly Asians Experimental design goals Internal validity – Cause and effect: That change in independent variables change in dependent variables Eliminating confounds (turn them into independent variables or random variables) Replicability of experiment External validity – Results generalizable to other settings – Ecological validity—generalizable to the real-world 11 Confidence in results – Statistical power (number of subjects, at least 10) Experimental protocol 12 Defining the task(s) What are all the combinations of conditions? How often to repeat each condition combination? Between-subjects or within-subjects? Avoiding bias (instructions, ordering) Task Defining task to test hypothesis – Pictures will lead to less errors – Same time to pick users with and without pictures (Ho) – Pictures will lead to higher satisfaction How do you present the task? Task: Users must select the following list of people to share application with – Jonathan Tong – Christine Robson – David Sun 13 Motivating user tasks Create scenario, movie plot for task – Immerse subject in story that removes them from “user testing” situation – Focus subject on goal, system becomes tool (and more subject to critique) 14 Number of Conditions Consider all combinations to isolate effects of each independent variable: (2 order) * (2 columns) * (2 format) = 8 – – – – – – – – 15 Horizontal, 1 column Horizontal, 2 column Vertical, 1 column Vertical, 2 column Horizontal, 1 column Horizontal, 2 column Vertical, 1 column Vertical, 2 column picture + picture + picture + picture + text only text only text only text only text text text text Adding levels or factors exponential combinations This can make experiments expensive! Reducing conditions Vary only one independent variable at a time – But can miss interactions Factor experiment into series of steps – Prune branches if no significant effects found 16 Choosing subjects Balance sample reflecting diversity of target user population (random variable) – – – – 17 Novices, experts Age group Gender Culture Example – 30 college-age, normal vision or corrected to normal, with demographic distributions of gender, culture Population as variable Population as an independent variable – Identifies interactions – Adds conditions Population as controlled variable – Consistency across experiment – Misses relevant features 18 Statistical post-hoc analysis can suggest need for further study – Collect all the relevant demographic info Recruiting participants “Subject pools” – – – – – 19 Volunteers Paid participants Students (e.g., psych undergrads) for course credit Friends, acquaintances, family, lab members “Public space” participants - e.g., observing people walking through a museum Must fit user population (validity) Motivation is a big factor - not only $$ but also explaining the importance of the research Current events: Population sampling issue Currently, election polling conducted on land-line phones – – – – 20 Legacy Laws about manual dialing of cell phones Higher refusal rates Cell phone users pay for incoming phone callshave to compensate recipients What bias is introduced by excluding cell phone only users? – 7% of population (2004), growing to 15% (2008) – Which candidate claims polls underrepresent? http://www.npr.org/templates/story/story.php?storyId=14863373 Between subjects design Different groups of subjects use different designs – 15 subjects use text only – 15 subjects use text + pictures 21 Within subjects design All subjects try all conditions 15 subjects 22 15 subjects Within Subjects Designs More efficient: – Each subject gives you more data - they complete more “blocks” or “sessions” More statistical “power” – Each person is their own control, less confounds Therefore, can require fewer participants May mean more complicated design to avoid “order effects” – Participant may learn from first condition – Fatigue may make second performance worse 23 Between Subjects Designs 24 Fewer order effects Simpler design & analysis Easier to recruit participants (only one session, shorter time) Subjects can’t compare across conditions – Need more subjects – Control more for confounds Within Subjects: Ordering effects Countering order effects – Equivalent tasks (less sensitive to learning) – Randomize order of conditions (random variable) – Counterbalance ordering (ensure all orderings covered) – Latin Square ordering (partial counterbalancing) 25 Study setting Lab setting – Complete control through isolation – Uniformity across subjects Field study – Ecological validity – Variations across subjects 26 Before Study Always pilot test first – Reveals unexpected problems – Can’t change experiment design after collecting data Make sure they know you are testing software, not them – (Usability testing, not User testing) 27 Maintain privacy Explain procedures without compromising results Can quit anytime Administer signed consent form During Study 28 Always follow same steps—use checklist Make sure participant is comfortable Session should not be too long Maintain relaxed atmosphere Never indicate displeasure or anger After Study 29 State how session will help you improve system (“debriefing”) Show participant how to perform failed tasks Don’t compromise privacy (never identify people, only show videos with explicit permission) Data to be stored anonymously, securely, and/or destroyed Exercise: Quantitative test Pair up with someone who has computer, downloaded the files DO NOT OPEN THE FILE (yet) Make sure one of you has a stopwatch – Cell phone – Watch 30 Computer user will run test, observer will time event Exercise: Task 31 Open the file Find the item in the list Highlight that entry like this Example: Variables 32 Independent variables Dependent variables Control variables Random variables Confound Data Inspection Look at the results First look at each participant’s data – Were there outliers, people who fell asleep, anyone who tried to mess up the study, etc.? 33 Then look at aggregate results and descriptive statistics “What happened in this study?” relative to hypothesis, goals Descriptive Statistics For all variables, get a feel for results: Total scores, times, ratings, etc. Minimum, maximum Mean, median, ranges, etc. e.g. “Twenty participants completed both sessions (10 males, 10 females; mean age 22.4, range 18-37 years).” e.g. “The median time to complete the task in the mouse-input group was 34.5 s (min=19.2, max=305 s).” 34 What is the difference between mean & median? Why use one or the other? Subgroup Stats 35 Look at descriptive stats (means, medians, ranges, etc.) for any subgroups – e.g. “The mean error rate for the mouseinput group was 3.4%. The mean error rate for the keyboard group was 5.6%.” – e.g. “The median completion time (in seconds) for the three groups were: novices: 4.4, moderate users: 4.6, and experts: 2.6.” Plot the Data 36 Look for the trends graphically Other Presentation Methods Scatter plot Box plot low Middle 50% Age high Mean 0 37 20 Time in secs. Experimental Results 38 How does one know if an experiment’s results mean anything or confirm any beliefs? Example: 40 people participated, 28 preferred interface 1, 12 preferred interface 2 What do you conclude? Inferential (Diagnostic) Stats Tests to determine if what you see in the data (e.g., differences in the means) are reliable (replicable), and if they are likely caused by the independent variables, and not due to random effects – e.g. t-test to compare two means – e.g. ANOVA (Analysis of Variance) to compare several means – e.g. test “significance level” of a correlation between two variables 39 Means Not Always Perfect 40 Experiment 1 Experiment 2 Group 1 Mean: 7 Group 2 Mean: 10 Group 1 Mean: 7 Group 2 Mean: 10 1,10,10 3,6,21 6,7,8 8,11,11 Inferential Stats and the Data Ask diagnostic questions about the data Are these really different? What would that mean? 41 Hypothesis Testing Going back to the hypothesis—what do the data say? Translate hypothesis into expected difference in measure – If “First name” is faster, then TimeFirst < TimeLast – If “null hypothesis” there should be no difference between the completion times 42 H0: TimeFirst = TimeLast Hypothesis Testing “Significance level” (p): – The probability that your hypothesis was wrong, simply by chance – The cutoff or threshold level of p (“alpha” level) is often set at 0.05, or 5% of the time you’ll get the result you saw, just by chance – e.g. If your statistical t-test (testing the difference between two means) returns a t-value of t=4.5, and a p-value of p=.01, the difference between the means is statistically significant 43 Errors Errors in analysis do occur Main Types: – Type I/False positive - You conclude there is a difference, when in fact there isn’t – Type II/False negative - You conclude there is no different when there is 44 Drawing Conclusions Make your conclusions based on the descriptive stats, but back them up with inferential stats – e.g., “The expert group performed faster than the novice group t(1,34) = 4.6, p < .01.” 45 Translate the stats into words that regular people can understand – e.g., “Thus, those who have computer experience will be able to perform better, right from the beginning…” Feeding Back Into Design Your study was designed to yield information you can use to redesign your interface What were the conclusions you reached? How can you improve on the design? What are quantitative redesign benefits? – e.g. 2 minutes saved per transaction, 24% increase in production, or $45,000,000 per year in increased profit What are qualitative, less tangible benefit(s)? – e.g. workers will be less bored, less tired, and therefore more interested --> better customer service 46 Remote usability testing Telephone or video communication Screen-sharing technology – Microsoft NetMeeting https://www.microsoft.com/downloads/details.aspx?Fami lyID=26c9da7c-f778-4422-a6f4efb8abba021e&DisplayLang=en – VNC http://www.realvnc.com/ 47 Greater flexibility in recruiting subjects, environments Usage logging 48 Embed logging mechanisms into code Study usage in actual deployment Some code can even “phone home” facebook usage metrics Example: Rhythmic Work Activity • • Drawn from about 50 Awarenex (IM) users • Bi-coastal teams (3-hour time difference) • Work from home team members Based on up to 2 years of collected data Sun Microsystems Laboratories: James "Bo" Begole, Randall Smith, and Nicole Yankelovich 49 Activity Data Collected • • Activity information • Input device activity (1-minute granularity) • Device location (office, home, mobile) • Email fetching and sending Online calendar appointments Activity ≠ Availability 50 Time of Day Date Computer Activity 51 Actogram of an Individual's Computer Activity Computer Activity Aggregate Activity 52 T Appointment Computer Activity Aggregate Activity with Appointments 53 T Comparing Aggregates Among 3 Individuals a. b. c. Appointments 54 Computer Activity Project deployment issues May have to be careful about widespread deployment of application – We’re only looking for a usability study with 4 people – Widespread deployment would be cool BUT, widespread deployment may run into provisioning issues – Provide feedback on server provisioning 55 Quantitative study of your project What are your measures? – Task measures, performance time, errors – Usage measures (facebook utilities) Compute summary statistics – Discussion section 56 Identify independent, dependent, control variables Privacy issues in collecting user data 57 Collecting data involves respecting users’ privacy Informed consent 58 Legal condition whereby a person can be said to have given consent based upon an appreciation and understanding of the facts and implications of an action EULA? But what about actions in public places? What about recording in public places? Consent Why important? – People can be sensitive about this process and issues – Errors will likely be made, participant may feel inadequate – May be mentally or physically strenuous 59 What are the potential risks (there are always risks)? “Vulnerable” populations need special care & consideration – Children; disabled; pregnant; students Controlling data for privacy 60 What data is being collected? How will the data be used? How can I delete data? Who will have access to the data? How can I review data before public presentations? What if I have questions afterwards? Contact info for questions What activity observed Who can access data 61 How data will be used What data collected How to Delete data Review before show publicly Human subjects review, participants, & ethics Academic, government research must go through human subjects review process Committee for Protection of Human Subjects – http://cphs.berkeley.edu/ 62 Reviews all research involving human (or animal) participants Safeguarding the participants, and thereby the researcher and university Not a science review (i.e., not to assess your research ideas); only safety & ethics Complete Web-based forms, submit research summary, sample consent forms, training, etc. Practices in industry vary 63 64 The participant’s perspective User testing can be intimidating – Pressure to perform, please observer – Fear of embarassment – Fear of critiquing (cultural) 65 You must remain unbiased and inviting More tips in the “Conducting the Test” reading by Rubin Ethics Testing can be arduous Each participant should consent to be in experiment (informal or formal) – Know what experiment involves, what to expect, what the potential risks are 66 Must be able to stop without danger or penalty All participants to be treated with respect Assignment: Storyboard + Implementation Create storyboard for main tasks of application Test with at least one non-CS160 user Reflect on what you learned – How will you change interface? 67 Implement initiation of facebook application and database Next time 68 Lecture on implementing—hardware, sensors Tom Zimmerman, guest lecture