Evaluate—Quantitative Methods October 4, 2007 Turn in Project Proposal NEEDS DESIGN EVALUATE IMPLEMENT Today  Quantitative methods – Scientific method – Aim for generalizable results  Privacy issues when collecting data.

Download Report

Transcript Evaluate—Quantitative Methods October 4, 2007 Turn in Project Proposal NEEDS DESIGN EVALUATE IMPLEMENT Today  Quantitative methods – Scientific method – Aim for generalizable results  Privacy issues when collecting data.

Evaluate—Quantitative
Methods
October 4, 2007
Turn in Project Proposal
NEEDS
DESIGN
EVALUATE
IMPLEMENT
1
Today

Quantitative methods
– Scientific method
– Aim for generalizable results

2
Privacy issues when collecting data
Quantitative methods

Reliably measure some aspect of
interface
– Especially to measurably compare

Approaches
– Controlled experiments

Doing Psychology Experiments
David W. Martin, 7th edition, 2007
– Collect usage data
3
Designing an experiment


State hypothesis
Identify variables
– Independent
– Dependent



4

Design experimental protocol
Apply for human subjects review
Select user population
Conduct experiment
Conducting experiment




5
Run pilot test
Collect data from running experiment
Perform statistical analysis
Interpret data, draw conclusions
Experiment hypothesis

Testable hypothesis
– Precise statement of expected outcome


6
More specifically, how you predict the
dependent variable (i.e., accuracy) will
depend on the independent variable(s)
“Null” hypothesis (Ho)
– Stating that there will be no effect
– e.g., “There will be no difference in performance
between the two groups”
– Data used to try to disprove this null hypothesis
Experiment design

Independent variables
– Attributes we manipulate / vary in condition
– Levels, value of attribute

Dependent variables
– Outcome of experiment, measures to evaluate
– Usually measure user performance




7
Time to completion
Errors
Amount of production
Measures of satisfaction
Experiment design (2)

Control variables
– Attributes that remain the same across conditions

Random variables
– Attributes that are randomly sampled
– Can be used to increase generalizability

8
Avoiding confounds
– Confounds are attributes that changed but were not
accounted for
– Confounds prevent drawing conclusions on
independent variables
Example: Person picker

Picking from list of names to invite to
use facebook application
Bryan Tsao
Christine Robson
David Sun
John Tang
Jonathan Tong
…
9
Bryan Tsao
Christine Robson
David Sun
John Tang
Jonathan Tong
…
Example: Variables

Independent variables
– Picture vs. no picture
– Ordered horizontally or vertically
– One column vs. 2 column

Dependent variables
– Time to complete
– Error rate
– User perception

Control variables
– Test setting
– List to pick from

Random variables
– Subject demographics

10
Confound
– Only one woman in list
– List mostly Asians
Experimental design
goals

Internal validity
– Cause and effect: That change in independent
variables  change in dependent variables



Eliminating confounds (turn them into independent
variables or random variables)
Replicability of experiment
External validity
– Results generalizable to other settings
– Ecological validity—generalizable to the real-world

11
Confidence in results
– Statistical power (number of subjects, at least 10)
Experimental protocol





12
Defining the task(s)
What are all the combinations of
conditions?
How often to repeat each condition
combination?
Between-subjects or within-subjects?
Avoiding bias (instructions, ordering)
Task

Defining task to test hypothesis
– Pictures will lead to less errors
– Same time to pick users with and without
pictures (Ho)
– Pictures will lead to higher satisfaction


How do you present the task?
Task: Users must select the following list of
people to share application with
– Jonathan Tong
– Christine Robson
– David Sun
13
Motivating user tasks

Create scenario, movie plot for task
– Immerse subject in story that removes
them from “user testing” situation
– Focus subject on goal, system becomes
tool (and more subject to critique)
14
Number of Conditions


Consider all combinations to isolate effects of each
independent variable:
(2 order) * (2 columns) * (2 format) = 8
–
–
–
–
–
–
–
–


15
Horizontal, 1 column
Horizontal, 2 column
Vertical, 1 column
Vertical, 2 column
Horizontal, 1 column
Horizontal, 2 column
Vertical, 1 column
Vertical, 2 column
picture +
picture +
picture +
picture +
text only
text only
text only
text only
text
text
text
text
Adding levels or factors  exponential combinations
This can make experiments expensive!
Reducing conditions

Vary only one independent variable at
a time
– But can miss interactions

Factor experiment into series of steps
– Prune branches if no significant effects
found
16
Choosing subjects

Balance sample reflecting diversity of target
user population (random variable)
–
–
–
–

17
Novices, experts
Age group
Gender
Culture
Example
– 30 college-age, normal vision or corrected to
normal, with demographic distributions of
gender, culture
Population as variable

Population as an independent variable
– Identifies interactions
– Adds conditions

Population as controlled variable
– Consistency across experiment
– Misses relevant features

18
Statistical post-hoc analysis can
suggest need for further study
– Collect all the relevant demographic info
Recruiting participants

“Subject pools”
–
–
–
–
–


19
Volunteers
Paid participants
Students (e.g., psych undergrads) for course credit
Friends, acquaintances, family, lab members
“Public space” participants - e.g., observing people
walking through a museum
Must fit user population (validity)
Motivation is a big factor - not only $$ but also
explaining the importance of the research
Current events:
Population sampling issue

Currently, election polling conducted on
land-line phones
–
–
–
–

20
Legacy
Laws about manual dialing of cell phones
Higher refusal rates
Cell phone users pay for incoming phone
callshave to compensate recipients
What bias is introduced by excluding cell
phone only users?
– 7% of population (2004), growing to 15% (2008)
– Which candidate claims polls underrepresent?
http://www.npr.org/templates/story/story.php?storyId=14863373
Between subjects design

Different groups of subjects use
different designs
– 15 subjects use text only
– 15 subjects use text + pictures
21
Within subjects design

All subjects try all conditions
15 subjects
22
15 subjects
Within Subjects Designs

More efficient:
– Each subject gives you more data - they complete
more “blocks” or “sessions”

More statistical “power”
– Each person is their own control, less confounds


Therefore, can require fewer participants
May mean more complicated design to avoid
“order effects”
– Participant may learn from first condition
– Fatigue may make second performance worse
23
Between Subjects
Designs




24
Fewer order effects
Simpler design & analysis
Easier to recruit participants (only one
session, shorter time)
Subjects can’t compare across
conditions
– Need more subjects
– Control more for confounds
Within Subjects: Ordering
effects

Countering order effects
– Equivalent tasks (less sensitive to
learning)
– Randomize order of conditions (random
variable)
– Counterbalance ordering (ensure all
orderings covered)
– Latin Square ordering (partial
counterbalancing)
25
Study setting

Lab setting
– Complete control through isolation
– Uniformity across subjects

Field study
– Ecological validity
– Variations across subjects
26
Before Study

Always pilot test first
– Reveals unexpected problems
– Can’t change experiment design after collecting
data

Make sure they know you are testing
software, not them
– (Usability testing, not User testing)



27

Maintain privacy
Explain procedures without compromising
results
Can quit anytime
Administer signed consent form
During Study





28
Always follow same steps—use
checklist
Make sure participant is comfortable
Session should not be too long
Maintain relaxed atmosphere
Never indicate displeasure or anger
After Study




29
State how session will help you improve
system (“debriefing”)
Show participant how to perform failed
tasks
Don’t compromise privacy (never identify
people, only show videos with explicit
permission)
Data to be stored anonymously, securely,
and/or destroyed
Exercise: Quantitative test



Pair up with someone who has
computer, downloaded the files
DO NOT OPEN THE FILE (yet)
Make sure one of you has a stopwatch
– Cell phone
– Watch

30
Computer user will run test, observer
will time event
Exercise: Task



31
Open the file
Find the item in the list
Highlight that entry like this
Example: Variables





32
Independent variables
Dependent variables
Control variables
Random variables
Confound
Data Inspection


Look at the results
First look at each participant’s data
– Were there outliers, people who fell
asleep, anyone who tried to mess up the
study, etc.?


33
Then look at aggregate results
and descriptive statistics
“What happened in this study?”
relative to hypothesis, goals
Descriptive Statistics




For all variables, get a feel for results:
Total scores, times, ratings, etc.
Minimum, maximum
Mean, median, ranges, etc.
e.g. “Twenty participants completed both
sessions (10 males, 10 females; mean age 22.4,
range 18-37 years).”
 e.g. “The median time to complete the task in
the mouse-input group was 34.5 s (min=19.2,
max=305 s).”

34
What is the
difference
between mean
& median? Why
use one or the
other?
Subgroup Stats

35
Look at descriptive stats (means,
medians, ranges, etc.) for any
subgroups
– e.g. “The mean error rate for the mouseinput group was 3.4%. The mean error
rate for the keyboard group was 5.6%.”
– e.g. “The median completion time (in
seconds) for the three groups were:
novices: 4.4, moderate users: 4.6, and
experts: 2.6.”
Plot the Data

36
Look for the trends graphically
Other Presentation
Methods
Scatter plot
Box plot
low
Middle 50%
Age
high
Mean
0
37
20
Time in secs.
Experimental Results



38
How does one know if an experiment’s
results mean anything or confirm any
beliefs?
Example: 40 people participated,
28 preferred interface 1,
12 preferred interface 2
What do you conclude?
Inferential (Diagnostic)
Stats

Tests to determine if what you see in the
data (e.g., differences in the means) are
reliable (replicable), and if they are likely
caused by the independent variables, and
not due to random effects
– e.g. t-test to compare two means
– e.g. ANOVA (Analysis of Variance) to compare
several means
– e.g. test “significance level” of a correlation
between two variables
39
Means Not Always Perfect
40
Experiment 1
Experiment 2
Group 1
Mean: 7
Group 2
Mean: 10
Group 1
Mean: 7
Group 2
Mean: 10
1,10,10
3,6,21
6,7,8
8,11,11
Inferential Stats and the
Data

Ask diagnostic questions about the data
Are these really
different? What
would that mean?
41
Hypothesis Testing


Going back to the hypothesis—what
do the data say?
Translate hypothesis into expected
difference in measure
– If “First name” is faster, then
TimeFirst < TimeLast
– If “null hypothesis” there should be no
difference between the completion times
42
H0: TimeFirst = TimeLast
Hypothesis Testing

“Significance level” (p):
– The probability that your hypothesis was wrong,
simply by chance
– The cutoff or threshold level of p (“alpha” level)
is often set at 0.05, or 5% of the time you’ll get
the result you saw, just by chance
– e.g. If your statistical t-test (testing the
difference between two means) returns a t-value
of t=4.5, and a p-value of p=.01, the difference
between the means is statistically significant
43
Errors


Errors in analysis do occur
Main Types:
– Type I/False positive - You conclude there
is a difference, when in fact there isn’t
– Type II/False negative - You conclude
there is no different when there is
44
Drawing Conclusions

Make your conclusions based on the
descriptive stats, but back them up with
inferential stats
– e.g., “The expert group performed faster than
the novice group t(1,34) = 4.6, p < .01.”

45
Translate the stats into words that regular
people can understand
– e.g., “Thus, those who have computer
experience will be able to perform better, right
from the beginning…”
Feeding Back Into Design




Your study was designed to yield information you
can use to redesign your interface
What were the conclusions you reached?
How can you improve on the design?
What are quantitative redesign benefits?
– e.g. 2 minutes saved per transaction, 24% increase in
production, or $45,000,000 per year in increased profit

What are qualitative, less tangible benefit(s)?
– e.g. workers will be less bored, less tired, and therefore
more interested --> better customer service
46
Remote usability testing


Telephone or video communication
Screen-sharing technology
– Microsoft NetMeeting
https://www.microsoft.com/downloads/details.aspx?Fami
lyID=26c9da7c-f778-4422-a6f4efb8abba021e&DisplayLang=en
– VNC
http://www.realvnc.com/

47
Greater flexibility in recruiting subjects,
environments
Usage logging




48
Embed logging mechanisms into code
Study usage in actual deployment
Some code can even “phone home”
facebook usage metrics
Example: Rhythmic Work Activity
•
•
Drawn from about 50 Awarenex (IM) users
• Bi-coastal teams (3-hour time difference)
• Work from home team members
Based on up to 2 years of collected data
Sun Microsystems Laboratories: James "Bo" Begole,
Randall Smith, and Nicole Yankelovich
49
Activity Data Collected
•
•
Activity information
• Input device activity (1-minute
granularity)
• Device location (office, home, mobile)
• Email fetching and sending
Online calendar appointments
Activity ≠ Availability
50
Time of Day
Date
Computer
Activity
51
Actogram of an Individual's Computer Activity
Computer
Activity
Aggregate Activity
52
T
Appointment
Computer
Activity
Aggregate Activity with
Appointments
53
T
Comparing Aggregates Among 3
Individuals
a.
b.
c.
Appointments
54
Computer
Activity
Project deployment
issues

May have to be careful about
widespread deployment of application
– We’re only looking for a usability study
with 4 people
– Widespread deployment would be cool

BUT, widespread deployment may run
into provisioning issues
– Provide feedback on server provisioning
55
Quantitative study of
your project

What are your measures?
– Task measures, performance time, errors
– Usage measures (facebook utilities)

Compute summary statistics
– Discussion section

56
Identify independent, dependent,
control variables
Privacy issues in
collecting user data
57
Collecting data involves respecting users’ privacy
Informed consent



58

Legal condition whereby a person can
be said to have given consent based
upon an appreciation and
understanding of the facts and
implications of an action
EULA?
But what about actions in public
places?
What about recording in public places?
Consent

Why important?
– People can be sensitive about this process and
issues
– Errors will likely be made, participant may feel
inadequate
– May be mentally or physically strenuous


59
What are the potential risks (there are
always risks)?
“Vulnerable” populations need special care
& consideration
– Children; disabled; pregnant; students
Controlling data for
privacy






60
What data is being collected?
How will the data be used?
How can I delete data?
Who will have access to the data?
How can I review data before public
presentations?
What if I have questions afterwards?
Contact info
for questions
What activity
observed
Who can
access data
61
How data
will be used
What data
collected
How to
Delete data
Review
before show
publicly
Human subjects review,
participants, & ethics


Academic, government research must go through
human subjects review process
Committee for Protection of Human Subjects
– http://cphs.berkeley.edu/




62

Reviews all research involving human (or animal)
participants
Safeguarding the participants, and thereby the
researcher and university
Not a science review (i.e., not to assess your research
ideas); only safety & ethics
Complete Web-based forms, submit research
summary, sample consent forms, training, etc.
Practices in industry vary
63
64
The participant’s
perspective

User testing can be intimidating
– Pressure to perform, please observer
– Fear of embarassment
– Fear of critiquing (cultural)


65
You must remain unbiased and inviting
More tips in the “Conducting the Test”
reading by Rubin
Ethics


Testing can be arduous
Each participant should consent to be
in experiment (informal or formal)
– Know what experiment involves, what to
expect, what the potential risks are


66
Must be able to stop without danger or
penalty
All participants to be treated with
respect
Assignment: Storyboard
+ Implementation



Create storyboard for main tasks of
application
Test with at least one non-CS160 user
Reflect on what you learned
– How will you change interface?

67
Implement initiation of facebook
application and database
Next time


68
Lecture on implementing—hardware,
sensors
Tom Zimmerman, guest lecture