Toe to Toe: What’s Better? Moderated Lab Testing or

Transcript Toe to Toe: What’s Better? Moderated Lab Testing or

What’s Better?
Moderated Lab Testing or
Unmoderated Remote Testing?
Susan Fowler
FAST Consulting
718 720-1169  [email protected]
What’s in this talk
• Definitions & differences between moderated
and unmoderated tests
• What goes into a remote unmoderated test
script?
• What goes into the remote-study report?
• Comparisons between moderated and
unmoderated tests
2
Definition of terms
• Moderated: In-lab studies and studies using
online conferencing software with a moderator.
Synchronous.
• Unmoderated: Web-based studies using online
tools and no moderator. Asynchronous.
–
Keynote Systems
–
SurveyMonkey
–
UserZoom
–
Zoomerang
–
WebSort.net
3
Differences
• Rewards:
– Moderated—$50 cash, gifts.
– Unmoderated—$10 online gift certificates, coupons
or credits, raffles.
• Finding participants:
– Moderated—use a marketing/recruiting company or
a corporate mail or email list.
– Unmoderated—send invitations to a corporate email
list, intercept people online, or use a pre-qualified
panel.
4
Differences
• Qualifying participants:
– Moderated—ask them; have them fill in a
questionnaire at start.
– Unmoderated—ask them in a screener section and
knock out anyone who doesn’t fit (age, geography,
disease, etc.).
5
Differences
• Test scripts:
– Moderated—the moderator has tasks he or she wants
the participant to do, and the moderator and the
notetakers track the questions and difficulties
themselves.
– Unmoderated—the script contains both the tasks and
the questions that the moderator wants to address.
6
Differences
• What you can test:
– Moderated—anything that you can bring into the
lab.
– Unmoderated—only web-based software or web
sites.
Other
Web Sites,
Applications
Hardware
Desktop
Applications
7
How Keynote Systems’ tool works
1 Client formulates research
strategy & objectives
2 A large, targeted
6 Analyst delivers
sample of
prescreened panelists
is recruited
actionable insights
and recommendations
3 Panelists access
5 The tool captures
the web test from
their natural home or
office environment
panelists’ real-life
behavior, goals,
thoughts & attitudes
4 Panelists perform tasks
& answer questions with
the browser tool
8
Creating an unmoderated test script
• Screener: “Do you meet the criteria for this
test?”
• For each task: “Were you able to…?”
– Ask “scorecard” questions--satisfaction, ease of use,
organized
– Ask “what did you like?” and “what did you not
like?”
– Provide a list of frustrations with an open-ended
“other” option at end.
• Wrap-up:
– Overall scorecard, “would you return,” “would you
recommend,” email address for gift
9
What a test looks like: Screen
The first few slides ask
demographic questions. They
can be used to eliminate
participants from the test.
10
What a test looks like: Task
11
For your first task, we would like your feedback on
the tugpegasus.org home page. Without clicking
anywhere, please spend as much time as you
would in real life learning about what
tugpegasus.org offers from the content on the
home page. When you have a good understanding
of what tugpegasus.org offers, please press
'Answer.'
12
What a test looks like: Task
You can create single-select
questions as well as Likert scales.
13
What a test looks like: Task
You can tie probing questions to earlier answers. The
questions can be set up to respond to the earlier
answer, negative or positive.
14
What a test looks like: Task
You can have multi-select questions that turn off multiple
selection if the participant picks a “none of the above” choice.
15
What a test looks like: Task
You can make participants pick three (or any
number) of characteristics. You can also randomize
the choices, as well as the order of the tasks and
the questions.
16
What a test looks like: Wrap-up
The last set of questions in a study are score-card type
questions: Did the participant think the site was easy, was
she satisfied by the site, was it well-organized?
Usability = credibility
17
What a test looks like: Wrap-up
A participant might be forced to
return to the site for business
reasons, but if he’s willing to
recommend it, then he’s probably
happy with the site.
18
What a test looks like: Wrap-up
Answers to these exit questions
often contain gems. Don’t
overlook the opportunity to ask
for last-minute thoughts.
19
Reports: Analyzing unmoderated results
• Quantitative data: Satisfaction, ease of use,
and organization scorecards, plus other Likert
results, are analyzed for statistical significance
and correlations
• Qualitative data: Lots and lots of responses to
open-ended questions
• Clickstream data: Where did the participants
actually go? First clicks, standard paths, fall-off
points
20
How do moderated and unmoderated
results compare?
•
•
•
•
•
•
•
•
•
•
Statistical validity
Shock value of participants’ comments
Quality of the data
Quantity of the data
Missing information
Cost
Time
Subjects
Environment
Geography
21
Comparisons: Statistical validity
• What’s the real difference between samples of
10 (moderated) and 100 (unmoderated)?
– “The smaller number is good to pick up the main
issues, but you need the larger sample to really
validate whether the smaller sample is
representative.
– “I’ve noticed the numbers swinging around as we
picked up more participants, at the level between
50 and 100 participants. At 100 or 200 participants,
the data were completely different.” Ania
Rodriguez, ex-IBM, now Keynote director
22
Comparisons: Statistical validity
• It’s just math…
Sample of 100
Sample of 10
480
440
400
360
320
280
240
200
160
120
80
40
0
60
50
40
30
20
10
0
Problem 1 Problem 2 Problem 3 Problem 4 Problem 5
Problem
1
Problem
2
Problem
3
Problem
4
Problem
5
23
Key Customer Experience Metrics
Club Med trailed Beaches on nearly all key metrics (especially page load times).
Q85 – 88. Overall, how would you rate your experience on the Club Med site.
Overall
Organization
Ease of use
12%
4%
Level of
Frustration
2%
4%
14%
30%
Perception of
Page Load Times
10%
2%
18%
20%
34%
40%
36%
44%
52%
46%
78%
66%
64%
56%
54%
44%
36%
34%
Club Med
Beaches
Club Med
Beaches
Positive (6-7)
n=50 per site
Significantly higher or lower than
Club Med at 90% CI
Club Med
Neutral (3-5)
Beaches
Club Med
Beaches
Negative (1-2)
“Site was slow site kept losing my information and had to be retyped.” – Club Med
“I could not get an ocean view room because the pop up window took too long to
wait for.” – Club Med
24
Comparisons: Statistical validity
• What’s the real difference between samples of
10 (moderated) and 100 (unmoderated)?
– “In general, quantitative shows you where issues are
happening. For why, you need qualitative.”
– But “to convince the executive staff, you need
quantitative data.
– “We also needed the quantitative scale to see how
people were interacting with eBay Express. It was a
new interaction paradigm [faceted search]—we
needed click-through information, how deep did
people go, how many facets did people use?”
Michael Morgan, eBay usability group manager;
uses UserZoom & Keynote
25
Comparisons: Statistical validity
• “How many users are enough?”
– There is no magical number.
– Katz & Rohrer in UX (vol. 4, issue 4, 2005):
• Is the goal to assess quality? For benchmarking
and comparisons, high numbers are good.
• Or is to address problems and reduce risk before
the product is released? To improve the product,
small, ongoing tests are better.
26
Comparisons: Shock value
• Are typed comments as useful as audio or video
in proving that there’s a problem?
• Ania:
– “Observing during the session is better than audio or
video. While the test is happening, the CEOs can ask
questions. They’re more engaged.”
– That being said, “You can create a powerful stopaction video using Camtasia and the clickstreams.”
27
Comparisons: Shock value
• Are typed comments as useful as audio or video
in proving that there’s a problem?
• Michael:
– “The typed comments are very useful—top of mind.
However, they’re not as engaging as video.” So, in
his reports, he combines qualitative Morae clips with
the quantitative UserZoom data.
– “We also had click mapping—heat maps and first
clicks,” and that was very useful. “On the first task,
looking for laptops, we found that people were going
to two different places.”
28
Comments are backed by heatmaps
29
Comparisons: Quality of the data
• Online and in the lab, what are the temptations
to be less than honest?
– In the lab, some participants want to please the
moderator.
– Online, some participants want to steal your money.
30
Comparisons: Quality of the data
• How do you prompt participants to explain why
they’re stuck if you can’t see them getting
stuck?
– In the task debriefing,
include a general set of
explanations from which
people can choose. For
example, “The site was
slow,” “Too few search
results,” “Page too
cluttered.”
31
Comparisons: Quality of the data
• How do you prompt participants to explain why
they’re stuck if you can’t see them get stuck?
– Let people stop doing a task, but ask them why they
quit.
32
Comparisons: Quantity of data
• What is too much data? What are the trade-offs
between depth and breadth?
– “I’ve never found that there was too much data. I
might not put everything in the report, but I can
drill in 2 or 3 months later if the client or CEO asks
for more information about something.”
– With more data, “I can also do better segments” (for
example, check a subset like “all women 50 and
older vs. all men 50 and older). —Ania Rodriguez
33
Comparisons: Quantity of data
• What is too much data? What are the trade-offs
between depth and breadth?
– “You have to figure out upfront how much you want
to know. Make sure you get all the data you need for
your stakeholders.
– “You won’t necessarily present all the data to all the
audiences. Not all audiences get the same
presentation.” The nitty-gritty goes into an
appendix.
– “You also don’t want to exhaust the users by asking
for too much information.” —Michael Morgan
34
Comparisons: Missing data
• What do you lose if you can’t watch someone
interacting with the site?
– Some of the language they use to describe what they
see. “eBay talk is ‘Sell your item’ and ‘Buy it now.’
People don’t talk that way. They say, ‘purchase an
item immediately.’” —Michael Morgan
– Reality check. “The only way to get good data is to
test with 6 live users first. We find the main issues
and frustrations, and then we validate them by
running the test with 100 to 200 people.” —Ania
Rodriguez
– Body language, tone of voice, and differences
because of demographics…
35
Comparisons: Missing data
36
Comparisons: Missing data
37
Comparisons: Relative expense
• What are the relative costs of moderated
vs. unmoderated tests?
– What’s your experience?
38
Comparisons: Time
• Which type of test takes longer to set up
and analyze: moderated or unmoderated?
– What’s your experience?
39
Comparisons: Subjects
• Is it easier or harder to get qualified subjects
for unmoderated testing?
– Keynote and UserZoom offer pre-qualified panels.
– If you want to pick up people who use your site, an
invitation on the site is perfect.
– If you do permission marketing and have an email
list of customers or prospects already, you can use
that.
• How do you know if the subjects are actually
qualified?
– Ask them to answer screening questions. Hope they
don’t lie. Don’t let them retry (by setting a cookie).
40
Comparisons: Environment
• In unmoderated testing, participants use their
own computers in their own environments.
However, firewalls and job rules may make it
difficult to get business users as subjects.
• Also, is taking people out of their home or
office environments ever helpful—for example,
by eliminating interruptions and distractions?
41
Comparisons: Geography
• Remote unmoderated testing makes it
relatively easy to test in many different
locations, countries, and time zones.
• However, moderated testing in different
locations may help the design team understand
the local situation better.
42
References
Farnsworth, Carol. (Feb. 2007) “Using Quantitative/Qualitative Customer Research to
Improve Web Site Effectiveness.” http://www.nycupa.org/pastevent_07_0123.html
Fogg, B. J., Cathy Soohoo, David R. Danielson, Leslie Marable, Julianne Stanford, Ellen
R. Tauber. (June 2003) Focusing on user-to-product relationships: How do users
evaluate the credibility of Web sites?: a study with over 2,500 participants.
Proceedings of the 2003 conference on Designing for user experiences DUX '03.
Fogg, B. J., Jonathan Marshall, Othman Laraki, Alex Osipovich, Chris Varma, Nicholas
Fang, Jyoti Paul, Akshay Rangnekar, John Shon, Preeti Swani, Marissa Treinen. (March
2001) What makes Web sites credible?: a report on a large quantitative
study Proceedings of the SIGCHI conference on Human factors in computing systems
CHI '01.
Katz, Michael A., Christian Rohrer. (2005) “What to report: Deciding whether an issue is
valid.” User Experience. 4(4):11-13.
Tullis, T. S., Fleischman, S., McNulty, M., Cianchette, C., and Bergel, M. (2002) An
Empirical Comparison of Lab and Remote Usability Testing of Web Sites (PDF).
Usability Professionals Association Conference, July 2002, Orlando, FL.
(http://members.aol.com/TomTullis/prof.htm)
University of British Columbia Visual Cognition Lab. (Undated) Demos.
(http://www.psych.ubc.ca/~viscoglab/demos.htm)
43
Commercial tools
Keynote Systems (online usability testing)
– Demo: “Try it now” on
http://keynote.com/products/customer_experience/web_ux_re
search_tools/webeffective.html
UserZoom (online usability testing)
– http://www.userzoom.com/index.asp
WebSort.net (online card sorting tool)
SurveyMonkey.com (online survey tool—basic level is free)
Zoomerang.com (online survey tool)
44
Statistics
Darrell Huff, “How to Lie With Statistics,” W. W. Norton &
Company (September 1993)
http://www.amazon.com/How-Lie-Statistics-DarrellHuff/dp/0393310728/ref=pd_bbs_sr_1/102-06635070637745?ie=UTF8&s=books&qid=1190492483&sr=1-1
Julian L. Simon, “"Resampling: The New Statistics,” 2nd
ed., October 1997,
http://www.resample.com/content/text/index.shtml
Michael Starbird, “What Are the Chances? Probability
Made Clear & Meaning from Data,” The Teaching
Company, http://www.teach12.com/store/course.asp?
id=1475&pc=Science%20and%20Mathematics
45
Questions?
Contact us anytime!
Susan Fowler has been an analyst for Keynote Systems,
Inc., which offers remote unmoderated user-experience
testing. She is currently a consultant at FAST Consulting
and an editorial board member of User Experience
magazine. With Victor Stanwick, she is an author of the
Web Application Design Handbook (Morgan Kaufmann
Publishers).
718 720-1169; cell 917 734-3746
http://fast-consulting.com
[email protected]
46

Toe to Toe: What’s Better? Moderated Lab Testing or

Transcript Toe to Toe: What’s Better? Moderated Lab Testing or

Directory