SIG-600: Usability Test of YFG-PBI

Download Report

Transcript SIG-600: Usability Test of YFG-PBI

HCI460: Week 7 Lecture
October 21, 2009
Outline
 Midterm & Project 2c: Data Collection
 Project 2d: Usability Testing Report
 Formative vs. Summative Studies
 Overview of Project 3
 How Many Participants Should I Test?
 Statistics 101
 Project 3a: Test Plan
– Group Work
– Groups Present Their Ideas to the Class
– Feedback
2
Midterm & Project 2c: Data Collection
3
Midterm & Project 2c: Data Collection
Status Check
 Midterm due by midnight but:
– “Late work will be accepted without penalty until 2pm on the day
following the due date.” (from the syllabus)
 Testing last week
– Feedback was provided to all groups on Wednesday and
individual feedback via email on Saturday.
• Feedback from the observers was very good.
– Assumption: Procedures have been modified based on
feedback.
 Data collection should be in progress at this point.
 Questions? Problems?
4
Project 2d: Usability Testing Report
5
Project 2d: Usability Testing Report
What Should We Turn In?
 Zip file with five documents in MS Word:
– Test Plan
– Screener
improved based on our feedback
– Moderator’s Guide
– Final Report
– Team Members’ Contributions to Project 2
• Detailed list of who did what.
 Include your group number in the file names.
 Submit to COL by midnight on the 28th (next Wednesday).
– DL students: Nov 1st (Sunday).
6
Project 2d: Usability Testing Report
When Writing Your Report, Assume That…
 This is a real work project.
– Do not think of the report as homework.
 Your audience are your stakeholders who paid for the study.
– They paid for the study so that they can improve the product.
– Make sure your report helps them do that.
 Not all stakeholders were able to observe the study, so they may not
be very familiar with:
– The product you tested
– Your methodology
– Or usability testing in general
7
Project 2d: Usability Testing Report
Report Writing Process (Formative Studies)
 STEP 1: Preliminary results / Key Findings Report
– Bulleted list including trends in findings but no
recommendations (or conservative recommendations).
– Start writing during testing and finish right after.
• Use breaks (e.g., lunch break discussion with stakeholders)
• Notetaker on the 2nd day of testing
• On the plane on the way back
– Deliver up to 24 hrs after testing
 STEP 2: Quantitative data aggregation and analysis
– Spreadsheet with accuracy data, ratings, # of participants who
did something (e.g., made a specific error, used a particular
method/path)
 STEP 3: Final report
8
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Common measures:
– Time on task
– Task accuracy (e.g., number of errors per task, number of
participants who completed the task successfully  be explicit
about what a success is)
– User ratings
 We usually do not put raw
data in the main report
(OK in the appendix). Data
needs to be described.
– But how do we describe
quantitative data in a
formative study report?
9
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Example 1: Task success
– Data:
– Description in the report:
• “4 out of 5 participants completed the task successfully.”
• Can we say 80%?
10
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Example 2: Ease of use ratings (1 – very difficult, 5 – very easy)
– Data:
– Description in the report:
• “On average, participants rated the ease of use of the device
for task X as 2 (somewhat difficult).”
• “On average, participants rated the overall ease of use of the
device as 2 (somewhat difficult).”
11
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Example 3: Ease of use ratings (1 – very difficult, 10 – very easy)
– Data:
– Description in the report:
• “On average, participants rated the overall ease of use of the
device as 2.5.”
– Is this really a good description of the data?
• “All participants but one rated the overall ease of use of the
device as 1 (very difficult).”
– This is a better description.
12
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Example 3: Ease of use ratings (1 – very difficult, 10 – very easy)
– Data:
MEDIAN
– Try the median instead of the average/mean.
• Median = middle value when all values are listed in
ascending order.
• “The median ease-of-use rating was 1 (very difficult).”
• But be careful: Stakeholders may not know what a median is.
13
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Example 4: Time on task
– Data:
– Description in the report:
• “All participants completed the task in less than 20 seconds.”
• “On average, participants completed the task in 13 seconds.”
• “Task completion time ranged from 9 to 18 seconds, with an
average of 13 seconds.”
14
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Example 5: Time on task
– Data:
MEDIAN
– Description in the report:
• “On average, participants completed the task in 50.2 s
seconds.”
– Not great, even if you provide the range.
• “All participants but one completed the task in less than 20 s.
One participant …. [explain what happened].”  OK
• “The median time on task was 14 seconds.”  OK
15
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 What if participants didn’t complete the task successfully?
– Should you report their time on task and ratings?
– Why / why not?
 It is a good practice to report:
– Number of participants who succeeded / failed
– Time on task for those who succeeded
– Ratings for those who succeeded
16
Project 2d: Usability Testing Report
Quantitative Data in a Formative Study
 Do not just describe your quantitative data.
– The quantitative data on their own are not useful.
 Remember to include the “SO WHAT?”
 Example:
– Step 1: Describe your quantitative data:
• “1 of 5 participants succeeded in completing task X.”
– Step 2: Include the description in a story:
• Task X was particularly difficult. Only 1 of 5 participants was
able to complete it successfully. The main reason why most
participants failed was… . Other difficulties that participants
encountered when attempting the task were…. .
17
Project 2d: Usability Testing Report
Qualitative Data
 Identify errors and difficulties.
For example:
– Participants didn’t know why
Levels 2, 3, and 4 had heart
icons and Level 1 did not.
– None of them realized that
the icon indicated plans with
an included continuing care
option.
– Participants associated the
icon with heartworm
coverage or, in general, with
more comprehensive
coverage.
18
Project 2d: Usability Testing Report
Qualitative Data
 Identify THE SOURCE of errors
and difficulties.
– Attribute a product-related
reason for participant
difficulties.
– WHY did the error/difficulty
occur?
 You cannot make any
recommendations if you haven’t
identified the source of error.
19
Project 2d: Usability Testing Report
Qualitative Data
 What is the source of error in
our example? Why didn’t
participants know that the heart
= continuing care option?
 Possible reasons:
– No clues in the interface that
the icon indicates continuing
care option.
• Lack of visual
association between the
icon and the link.
• Misleading icon (there is
nothing about a heart
and a plus that indicates
continuing care)
20
Project 2d: Usability Testing Report
How is UT Report Different From Eval Report?
 The finding description is reversed:
– Finding in an evaluation report:
• Description of a problem with the product.
• Justification: How can this affect users?
– Finding in a usability testing report:
• Description of a participant difficulty/error.
• Description of the source of the difficulty.
PRODUCT-ORIENTED
TASK/USER-ORIENTED
TASK/USER-ORIENTED
PRODUCT-ORIENTED
 Usability testing (UT) reports can contain quantitative data.
 UT reports should describe who the study participants were.
– Participant description should be included in the Methodology
section.
– Take a look at participants’ answers to your screener and warmup questions and summarize them.
21
Project 2d: Usability Testing Report
What Should Your Report Include?
 Executive Summary
 Introduction
– Describe what you tested and why (objectives of the study).
 Methodology
– Describe the participants and procedure (including your tasks
and questions).
 Findings and Recommendations
– Include severity ratings.
– Each finding should contain the description of participant
difficulty and the description of the source of this difficulty
(source in the product design).
– Illustrate your findings. Stakeholders may not be very familiar
with the product tested.
22
Project 2d: Usability Testing Report
Grading Criteria
 EXECUTIVE SUMMARY, INTRODUCTION, AND METHODOLOGY
Criteria
Executive summary summarizes the contents of the report
Yes / No Comments
Introduction appropriately describes:
- Product evaluated
- Objectives of the study
Methodology section appropriately describes:
-Participants
-Procedure
23
Project 2d: Usability Testing Report
Grading Criteria
 FINDINGS
Criteria
Report contains a sufficient number of findings
Yes / No Comments
Findings are organized in a way that makes sense and the organization is
explicit
Positive findings are included as well as usability issues
It is clear to which part of the interface each finding corresponds
Severity ratings are easy to understand
Appropriate severity ratings accompany each usability issue
Descriptions of the findings (descriptions of participants’
difficulties/errors) are appropriate, precise, and concise
Each finding is accompanied by a description of the source of the
difficulty/error. The source is product-oriented.
Quantitative findings are properly described.
Quantitative findings are a part of a “story.” They are accompanied by
qualitative findings.
24
Project 2d: Usability Testing Report
Grading Criteria
 RECOMMENDATIONS
Criteria
Recommendations accompany all usability issues
Yes / No Comments
Recommendations appropriately address the issues
Recommendations are specific and actionable
25
Project 2d: Usability Testing Report
Grading Criteria
 QUALITY OF PRESENTATION
Criteria
Report is well structured, well laid out, visually pleasing, and easy to
read
Language used throughout the report is professional
Yes / No Comments
Report is free of grammatical and spelling errors
26
Formative vs. Summative Studies
27
Formative vs. Summative Studies
Formative Studies
 Diagnostic
 Qualitative
 Study goal: Determine key
strengths and weaknesses of the
system
 Ultimate goal: Improve the user
experience
 E.g., “We want to improve the
website we are developing, so that
the users can easily find what they
are looking for.”
28
Formative vs. Summative Studies
Summative Studies
 Verification
 Quantitative
 Study goal: Determine how the system
compares to usability standards,
benchmarks or competitors
 Ultimate goal: To make sure the
product is ready for launch (can involve
making legally defensible statistical,
marketing, or regulation claims)
 E.g., “We need to be sure that the new
drug label will not increase the number
of dispensing errors.”
 Results should often be statistically
valid.
29
Overview of Project 3
30
Overview of Project 3
Summative Study: A – B Testing
 Compare a product to:
– Another version of the product or
– A competitor product
 Compare the products in terms of (pick one or more):
– Effectiveness?
– Efficiency?
– User satisfaction?
 Quantitative measures (match to objectives)
 Participants:
– Higher sample size than in our formative study
 Tasks
– Short but at least 3
31
Overview of Project 3
Example Idea: Insurance Policy Declaration
 Old page vs. new page
 Research question: Does the new help users find information more
efficiently than the old page?
 Tasks:
– Find the name of the person who is not covered under this
policy?
– Find….
 Participant finds correct element/section, points to it, & says “done.”
 Measure:
– Time on task (measured from when the moderator says “start” to
when the participant says “done”).
 30 participants, between-subjects design
– 15 get old and 15 get new
32
Overview of Project 3
Example Idea: Weather Watcher Icons
 Icons in ver 5.4b vs. icons in ver 5.0
 Research question:
– Does the new set of icons help users
associate them with their functions?
ver 5.4b
ver 5.0
33
Overview of Project 3
Example Idea: Weather Watcher Icons
 32 participants, between-subjects study design
– 16 saw ver 5.4 and 16 saw ver 5.0
34
Overview of Project 3
Example Idea: Weather Watcher Icons
 24 tasks:
 Each of the seven icons was
addressed by at least one task
and there was only one correct
answer for each task.
 Participants had to point to the
icon that they would click on to
complete each of the tasks.
 Measure: percentage of errors.
– Error for icon x = selecting
an icon other than icon x for
a task that could only be
completed by selecting icon
x.
35
Overview of Project 3
Example Idea: Weather Watcher Icons
 Individual
error
percentages
36
Overview of Project 3
Example Idea: Weather Watcher Icons
 Average error
percentages
and statistics
37
How Many Participants Should I Test?
38
How Many Participants Should I Test?
Why Is This an Important Question?
 Participants are expensive:
– Recruiting cost
– Compensation
– Time to conduct session
– Time to analyze/synthesize the data
 Goal:
– To learn what we need to learn using the minimum number of
participants
39
How Many Participants Should I Test?
So, How Many?
 It depends.
 What questions are we trying to answer?
 What type of data do we need?
 What type of usability test is it?
– Formative
– Summative
40
How Many Participants Should I Test?
Formative Testing
41
How Many Participants Should I Test?
Sample Size for Formative Testing
 In general, formative testing requires fewer
participants than summative testing.
– But how many?
 You need enough participants to find a
substantial number of usability issues.
– Ideally, you would like to find most severe
issues.
 There is a debate about the perfect
number of participants:
– 5 participants (Jakob Nielsen*)
– Definitely more than 5 (Jared Spool**, Rolf
Molich)
*Nielsen, J. & Landauer, T.K. (1993). A Mathematical Model of the Finding of Usability Problems. Proceedings
of ACM INTERCHI 1993 Conference.
**Spool, J. & Schroeder W. (2001). Testing Web Sites: Fives Users Is Nowhere Near Enough. Proceedings of
ACM CHI 2001 Conference.
42
How Many Participants Should I Test?
The Magic Number 5
 Zero participants = zero insights
 One participant provides almost
a third of all there is to know
about the usability of the design.
 Second participant will have
some of the same problems
and some new ones.
 Third participant will have many of the problems you have seen and
a few new ones.
 As you add more and more participants, you learn less and less
because you will keep seeing the same things again and again.
 After the fifth participant, you are wasting your time by observing the
same findings repeatedly but not learning much new.
© July 17, 2015 – HCI460: Week 3
43
How Many Participants Should I Test?
Why Not 15?
 You need 15 participants to
uncover 100% of usability
problems.
– So why 5 and not 15?
 Not enough benefit for the
cost
incurred.
 Iterative testing is much more
beneficial:
– Test 5 > improve design > test 5
> improve design > test 5 > improve design.
– Second test will tests the improvements made based on the first
test and find additional issues etc.
© July 17, 2015 – HCI460: Week 3
44
How Many Participants Should I Test?
Sample Size Calculator for Formative Studies
 Jeff Sauro’s Sample Size Calculator for Discovering Problems in a
User Interface:
http://www.measuringusability.com/problem_discovery.php
45
How Many Participants Should I Test?
When to Test More Participants?
 If there is more than one distinct
user group (e.g., if a product will be
used by both patients and nurses).
– Test 3-4 participants from each
user group if there are two
groups.
– Test 3 participants from each
user group if there are three or
more groups.
– There will still be overlap
between the groups (we are all
human).
 However, if the user groups will be
doing tasks in different sections of
the site, you will still need 5
participants per group.
46
How Many Participants Should I Test?
When to Test More Participants?
 If the product is complex
and there are too many tasks
for one person to do.
– You can divide tasks
between participants,
e.g.:
• Test tasks 1 – 10 with
one group of 5
participants.
• Test tasks 11 – 20
with another group of
5 participants.
47
How Many Participants Should I Test?
When to Test More Participants?
 To pass a “smell check.”
– We may need more
participants to convince
stakeholders that the
problem is “really there.”
– You know your
stakeholders. They may
not want to change a
product based on what 5
people did but they will
based on what 20 people
did.
– Higher sample size gives
them more comfort and
confidence that the
people are not outliers.
48
How Many Participants Should I Test?
Summative Testing
49
How Many Participants Should I Test?
Sample Size for Summative Research
 The last test before launch
 Is the product ready for launch? (good enough, acceptable, etc.)
– Do we want to know directionally?
– Or definitively?
 We are always collecting measures (e.g., success rate, time on task)
– Directionally  descriptive statistics (means, %)
• 97% succeeded
• Sample size needs to pass a “smell check” (e.g., 30, 40)
– Definitively  inferential statistics
50
How Many Participants Should I Test?
Knowing “Definitively”
 For a “definitive” answer, sample size can be based on:
– Precision
• Think “score”
• E.g., does the 85% success rate generalize to the
population?
• Is the number itself “real?”
• More frequent in surveys than usability testing
– Power for a hypothesis test
• Think “difference”
• E.g., is the difference between success rate with Device A
(85%) and Device B (60%) “real?”
51
How Many Participants Should I Test?
Summative Testing
PRECISION TESTING
52
How Many Participants Should I Test?
Sample Size for Precision Testing
 We need sufficient sample size to be able to generalize the results
to the population.
 Sample size for precision testing depends on:
– Confidence level (usually 95% or 99%)
 Example:
– Assume your study found a score of 80 (we commonly stop at
the score).
– How confident are we that the score of 80 can generalize to the
population?
53
Usability Testing: How Many Participants to Test?
Think of Data as Darts
 The actual population is shown
 This yellow region = 95% of population data
 20 darts represent the results from 20 studies—just like your one
study
How confident are you that your
study’s results is not this dart?
© July 17, 2015 – HCI460: Week 3
54
How Many Participants Should I Test?
Sample Size for Precision Testing
 We need sufficient sample size to be able to generalize the results
to the population.
 Sample size for precision testing depends on:
– Confidence level (usually 95% or 99%)
– Desired level of precision
• Acceptable sampling error (+/- 5%)
55
How Many Participants Should I Test?
Sample Size for Precision Testing
 We need sufficient sample size to be
able to generalize the results to the
population.
 Sample size for precision testing
depends on:
– Confidence level (usually 95% or
99%)
– Desired level of precision
• Acceptable sampling error
(+/- 5%)
– Size of population to which we
want to generalize the results
56
How Many Participants Should I Test?
Sample Size Calculators
 There are many sample size calculators, e.g*.:
Sampling Error:
* Free online sample size calculator from Creative Research Systems: http://www.surveysystem.com/sscalc.htm
57
How Many Participants Should I Test?
Example 1: Presidential Election Poll
 Assumptions:
– Confidence level: 95%
– Acceptable sampling error: +/- 3%
– Population of 100M voters
Sampling Error:
© July 17, 2015 – HCI460: Week 3
58
How Many Participants Should I Test?
Example 1: Presidential Election Poll
 Assumptions:
– Confidence level: 95%
– Acceptable sampling error: +/- 3%, +/- 5%
– Population of 100M voters
Sampling Error:
Sampling Error:
© July 17, 2015 – HCI460: Week 3
59
How Many Participants Should I Test?
Example 1: Presidential Election Poll
 Assumptions:
– Confidence level: 95%
– Acceptable sampling error: +/- 3%, +/- 5%, +/- 7%
– Population of 100M voters
Sampling Error:
Sampling Error:
Sampling Error:
© July 17, 2015 – HCI460: Week 3
60
How Many Participants Should I Test?
Example 2: Survey of Insulin Pump Users
 Assume population of 1M users.
– How does the sample size change?
61
How Many Participants Should I Test?
Example 2: Survey of Insulin Pump Users
 Assumptions:
– Confidence level: 95%
– Acceptable sampling error: +/- 3%, +/- 5%, +/- 7%
– Population of 1M users
Sampling Error:
Sampling Error:
Sampling Error:
 Sample size for 100M population =
sample size for 1M population
62
How Many Participants Should I Test?
Sample Size for Precision Testing
 Confidence interval: 95%
Population of:
+/- 3%
+/- 5%
+/- 7%
100M
1067
384
196
1M
1066
384
196
100,000
1056
383
196
10,000
964
370
192
1,000
516
278
164
100
92
80
66
 When generalizing a score to the population, high sample size is
needed.
 However, the more the better is not true.
– Getting 2000 participants is a waste.
63
How Many Participants Should I Test?
Summative Testing
HYPOTHESIS TESTING (comparing means)
64
How Many Participants Should I Test?
Sample Size for Comparing Two Means
 Hypothesis Testing
– E.g., accuracy of typing on Device A is significantly better than it
is on Device B.
– Inferential statistics
 Necessary sample size is derived from a calculation of power.
– Under assumed criteria, the study will have a good chance of
detecting a significant difference if the difference indeed exists.
 Sample size depends on:
– Assumed confidence level (e.g., 95%, 99%)
– Acceptable sampling error (e.g., +/- 5%)
– Expected effect size
– Power
– Statistical test (e.g., t-test,
correlation, ANOVA)
65
How Many Participants Should I Test?
It Also Depends on Stats
 Many different power analyses
depending on statistical test
 “Practical” sample size
discussion
 Tools
 So, how many participants
should we test?
66
How Many Participants Should I Test?
Hypothesis Testing: Sample Size Table*
*Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112 (1), http://www.math.unm.edu/~schrader/biostat/bio2/Spr06/cohen.pdf.
67
How Many Participants Should I Test?
Reality
 Usability tests do not typically require statistical significance.
 Objectives dictate type of study and reasonable sample sizes
necessary.
 Sample size used is influenced by many factors—not all of them
statistically driven..
 Power analysis provides an estimate of sample size necessary to
detect a difference, if it does indeed exist
 Risk of not performing power analysis?
– Too few  Low power  Inability to detect difference
– Too many  Waste (and possibly find differences that are not
real)
 What if you find significance even with a small sample size?
– It is probably really there (at a certain p level)
68
How Many Participants Should I Test?
Summary
How many participants?
Formative test?
Nielsen’s 5 or more?
Sauro’s calculator?
How complex is the system?
How many tasks?
How many distinct user
groups?
Sample size must pass a
“smell check”
Summative test?
Need a directional
answer?
Need a definitive answer?
Precision
testing?
(generalize
score to
population)
Hypothesis
testing? (A
different than
B?)
Project 3
69
Statistics 101: Part A
70
Statistics 101: Part A
Why Know Stats?
 Because YOU are the EXPERT
 When used appropriately, they can answer questions really well
 When used inappropriately, they are dangerous and can be a huge
time suck without benefit
 Learn when to do things and when to not
– It is about providing you with information to make decisions and
support the decisions made
71
Statistics 101: Part A
Foundation
 Statistics and experimental design go hand in hand
 In fact, it is about the scientific method
– Question: Is what I see REALLY TRUE?
– Observe something:
• Gavin comes in and eats ice cream during lecture
– Does this behavior generalize to all lecturers?
– Does this predict a good lecture?
– Statistics can separate the silly from reality
 When we observe something, people will try to generalize to the
population (global, US, mobile phone users, shoe salesmen, etc.)
– In some ways, it is about learning and ultimately about survival
72
Statistics 101: Part A
Definitions
 Definition of statistics:
– Method to handle data. A set of procedures to for describing
measurements and for making inferences on what is generally
true
 Common statistical tests
– T-Test
– ANOVA
– Chi-square
– Regression / correlation
– Multivariate analyses
– Factor analysis
73
Statistics 101: Part A
Observation = Measurement (1 of 2)
 Ratio / interval scale
– Device A is 72 inches in length while Device B is 36 inches
• Can say this: One is twice as tall as the other
– Differences are comparable
• Yao Ming is 7’6” is a foot taller than Kobe Bryant (6’6”)
• Kobe Bryant is a foot taller than Tom Cruise
– Doing this allows for powerful stats
 Absolute “zero”
– Ratio scales have a zero
• e.g., measuring height
– Interval scales do not have a zero
• e.g., measuring temperature
– Difference between 40F & 50F = 100F – 90F
74
Statistics 101: Part A
Observation = Measurement (2 of 2)
 Ordinal scales
– Rank data
– Can’t say that the difference between Miss America and the
runner-up is the same difference between #29 and #30
• Yao Ming > Kobe Bryant > Tom Cruise may be equal, but
substitute other names and the differences can be different
 Nominal scales
– Classify into groups
– Count data
 Where do Likert scales fit in?
– Ordinal (Rank) that allows ties
– Different schools of thought that say it *could* be analyzed as if it
were interval or ratio
© July 17, 2015
75
Statistics 101: Part A
More Common Statistical Tests
 You are actually well aware of statistics – Descriptive statistics!
 Measures of central tendency
– Mean
– Median
– Mode
 Definitions?
– Mean = ?
• Average
– Median = ?
• The exact point that divides the distribution into two parts
such that an equal number fall above and below that point
– Mode = ?
• Most frequently occurring score
© July 17, 2015
76
Statistics 101: Part A
When In Doubt, Plot
3
4
4
4
5
5
Frequency
 Take scores
– 1
2
– 2
3
– 3
3
5
1
3
Normal Distribution
1
2
3
4
Score
© July 17, 2015
5
77
Statistics 101: Part A
Kurtosis
 Mesokurtic
 Leptokurtic
 Platykurtic
 Descriptive stats
– Mean = 3
– Median = 3
– Mode = 3
© July 17, 2015
78
Statistics 101: Part A
Skewed Distributions
 Positive skew  Tail on the right
 Negative skew  Tail on the left
 Impact to measures of central tendency?
– Mode
– Median
– Mean
 “Central tendency”
© July 17, 2015
79
Statistics 101: Part A
Measurement Variability
 We must first understand variability
– i.e., what goes into the score as it nothing is perfect…
– Goal is to test if A is better than B, but what creates problems
besides the typical confounds of us messing up the study!
 Consider a time on task as a metric
– Measurement error
• The tool or procedure can be an imprecise device
– Starting and stopping the stop watch
– Individual differences
• Participants are different, so some get different scores than
others on the same task. Since we are testing for differences
between A and B, this can be a problem
– Unreliable
• We are human, so if you test the same participant on
different days and you might get a different time!
© July 17, 2015
80
Statistics 101: Part A
Between-Groups Designs
 Between-groups study splits sample into two or more groups
 Each group only interacts with one device
 What causes variability?
– Measurement error
• The tool or procedure can be an imprecise device
– Starting and stopping the stop watch
– Unreliable
• We are human, so if you test the same participant on
different days and you might get a different time!
– Individual differences
• Participants are different, so some get different scores than
others on the same task. Since we are testing for differences
between A and B, this can be a problem
© July 17, 2015
81
Statistics 101: Part A
What About Within-Groups Designs?
 Within-Groups study has participants interact with all devices
 What causes variability?
– Measurement error
• The tool or procedure can be an imprecise device
– Starting and stopping the stop watch
– Unreliable
• We are human, so if you test the same participant on
different days and you might get a different time!
– Individual differences
• Participants are different, so some get different scores than
others on the same task. Since we are testing for differences
between A and B, this can be a problem
• No longer applies
 Thus, less causes for variability results in statistical power
© July 17, 2015
82
Statistics 101: Part A
Variability is Important to Understand
 Scores are scores, but the variability is important to inferential
statistics
 Descriptive statistics numerically describe the distribution
– “Central tendency” of what was observed (fact)
 Inferential statistics
– Provides guidance to answer real world questions (infer)
– A vs. B
© July 17, 2015
83
Statistics 101: Part A
Consider SAT Scores
 Your score is in the 50th percentile
– Ethan and Madeline are the smartest kids in class
 AT scores, you saw your score—how did they get a percentile?
– Distribution is normal
– Numerically, the distribution can be described by only:
• Mean and standard deviation
1 Std Dev
2 Std Dev
1 Std Dev
2 Std Dev
3 Std Dev
© July 17, 2015
84
Statistics 101: Part A
General Thinking
 Consider this bell curve
– What regions form?
– Gavin’s scored the exact mean
– Bob did better
– Aga did better
© July 17, 2015
85
Statistics 101: Part A
Empirical Rule
 Empirical rule = 68/95/99
– 68%
Mean +/- 1 std dev
– 95%
Mean +/- 2 std dev
– 99%
Mean +/- 3 std dev
1 Std Dev
2 Std Dev
1 Std Dev
2 Std Dev
3 Std Dev
3 Std Dev
© July 17, 2015
86
Statistics 101: Part A
Clear on Normal Curves?
 Represent a single dataset on a single measure for a single sample
 Once data are normalized, you can describe dataset simple by
– Mean and standard deviation
• 60% success rate with a std dev of 10%
40%
50%
60%
© July 17, 2015
70%
87
Statistics 101: Part A
Population of Alarm Clock Users
 This is the population of alarm clock users
– Success data is shown as Pass and Fail users
– Overall, 60% Success
© July 17, 2015
88
Statistics 101: Part A
Study Results
 Completed a study with N=20
– Design A scores: 7 Pass, 3 Fail or 70% success
– Design B scores: 9 Pass, 1 Fail or 90% success
 This is really two samples of 10 drawn from populations who may
have these characteristics
© July 17, 2015
89
Statistics 101: Part A
But, How Do We Know?
 Is this is really two samples of 10 drawn from a population who may
have these characteristics
© July 17, 2015
90
Statistics 101: Part A
Tasting Soup
 You don’t need to drink half a pot to see if the soup is done / tasty
 As long as the soup is well mixed, you should be able to just take
one or two tastes with “well mixed” as
– Randomly sampled
– Normally distributed
– Variances are equal
– Samples are independent
40%
50%
60%
70%
80%
90%
100%
 Can you say these are different?
– Missing confidence intervals?
© July 17, 2015
91
Statistics 101: Part A
Confidence Intervals Matter
 Confidence interval of 95% says that that dot falls within that range
95% of the time
 With confidence intervals attached, can you say these are different?
40%
50%
60%
70%
80%
90%
100%
 These lines are affected by:
– Confidence interval itself
– Variability
– Sample size
© July 17, 2015
92
Statistics 101: Part A
What If Variability Were Better Controlled?
 What if we could be more precise or reduce variability like in a
between-groups design?
– Confidence interval still stays the same at 95%
– But, the variability is reduced
 With confidence intervals attached, can you say these are different?
40%
50%
60%
70%
80%
90%
100%
 Yes! And in fact, you do not need stats
© July 17, 2015
93
Statistics 101: Part A
Usually, Inferential Stats are Needed
 The world is usually dirty
– There is typically overlap
 Statistics can determine if these are significantly different
40%
50%
60%
© July 17, 2015
70%
80%
90%
100%
94
Project 3a: Test Plan
95
Project 3a: Test Plan
Group Work
 In your project groups, come up with a plan of action for Project 3.
– Objective (just one)
– Stimuli (A and B)
– Tasks (short but at least 3)
– Measure(s)
– Study design
• Between-subjects or within-subjects?
• All tasks with design A and then all tasks with design B or
Task 1 A – B, Task 2 A – B etc.?
• Counterbalanced or randomized version/task presentation?
 Be ready to present your ideas to the class in 20 minutes.
96
Project 3a: Test Plan
Next Steps
 Write up your idea and submit through COL
– Be concise (half a page to a page)
– Be clear
– This is not a formal Test Plan for stakeholders
• Goal: For Aga & Gavin to make sure you have a solid
research plan for Project 3
97