Separating the Swarm Categorization Methods for User Sessions on the Web

Download Report

Transcript Separating the Swarm Categorization Methods for User Sessions on the Web

Separating the Swarm
Categorization Methods for
User Sessions on the Web
Jeffrey Heer, Ed H. Chi
Palo Alto Research Center
2002-04-24
CHI Patterns
Web Behavior Patterns
2002.04.24 – CHI Web Behavior
1
Web Analytics: What can you measure?
Want to improve site
design, content, and
performance
- content
- page traffic
Infrastructure
Marketing
- load testing
- user intent
- usability
- user
experience
Site Design
2002-04-24
CHI Web Behavior Patterns
2
The Change in Web Sites:
What should you measure?
USER EXPERIENCE
Activity-based websites
I’d like information on used cars.
Site Complexity
Search for a car dealer in my
neighborhood.
TRAFFIC
Page-based websites
Products
Management
Team
Time
2002-04-24
CHI Web Behavior Patterns
3
Motivation
What are users’
information
goals?
Strategy: Use all available
data to discover user goals.
(Content, Usage, Topology)



Understanding the
composition of web
user traffic.
2002-04-24

CHI Web Behavior Patterns
System Description
Evaluation
Implications
Conclusion
4
System Description

Generate a user profile for each user session.
– How: Use access logs and site content to to build
a multi-featured model of user activity (multi-modal
clustering).

Group user profiles into common activities
like “product browsing” and “job seeking”
– How: Apply clustering algorithms to user profiles
2002-04-24
CHI Web Behavior Patterns
5
System Description
Access Logs
Web Crawl
Steps:
User
Sessions
Document
Model
User Profiles
1. Process Access Logs
2. Crawl Web Site
3. Build Document Model
4. Extract User Sessions
Clustered
Profiles
2002-04-24
CHI Web Behavior Patterns
5. Build User Profiles
6. Cluster Profiles
6
Access Logs
Web Crawl
User
Sessions
Document
Model
Document Model
User Profiles

Site is crawled
– Pay special attention to pages in logs.

Documents described by feature vectors:
Content:
URL:
Inlinks:
Outlinks:

Clustered
Profiles
TF.IDF weighted keyword vector
Tokenized and TF.IDF weighted
Column vectors in topology matrix
Row vectors in topology matrix
Vectors are concatenated to form a single
multi-modal vector Pd for each document.
2002-04-24
CHI Web Behavior Patterns
7
Access Logs
Web Crawl
User
Sessions
Document
Model
User Sessions
User Profiles

Sessions extracted and represented by a
vector s:
Clustered
Profiles
– For path i = ABD, si = <1,1,0,1,0>
(For site with 5 documents <A,B,C,D,E>)

Different weightings can be employed in
creating the session vector s:
Frequency: number of times each page is accessed.
ABD, s = <1,1,0,1,0>
TF.IDF:
 hits / # paths including page
Position:
Use order of pages within surfing path.
ABD, s = <1,2,0,3,0>
View Time: Use time spent viewing pages.
A10sB20sD15s, s = <10,20,0,15,0>
2002-04-24
CHI Web Behavior Patterns
8
User Profiles
Access Logs
Web Crawl
User
Sessions
Document
Model
User Profiles

User profiles are linear combination
of the viewed pages.
Clustered
Profiles
– “You are what you see.”
N
UPi   sid Pd
User Profiles
d 1
Document Vectors
Session weights
2002-04-24
CHI Web Behavior Patterns
9
Access Logs
Web Crawl
User
Sessions
Document
Model
Clustering
User Profiles

Clustering is a form of statistical analysis which
organizes data into individual clusters.
– Groupings are determined by a shared similarity.
– Similarity is defined by a computable similarity metric.
d (UPi ,UPj ) 
Clustered
Profiles
m
m
w
cos(
UP
,
UP
 m
i
j )
mModalites
weights wm specify the
contribution of each modality

Clustering proceeds by recursive bisection, using
K-Means to perform the bisections [Zhao01].
2002-04-24
CHI Web Behavior Patterns
10
User
population
breakdown
Keywords
describing
user groups
Frequent
documents
accessed
by group
Detailed
stats
2002-04-24
CHI Web Behavior Patterns
11
Clustering Results
http://www.diamondreview.com
Users reached end of tutorial, had nowhere to go.
2002-04-24
CHI Web Behavior Patterns
12
System Evaluation
Does the system correctly infer user intentions?
User Intent
Logs
Compare
System
User Intent Groupings
2002-04-24
CHI Web Behavior Patterns
13
User Study

Asked users to surf specific tasks on
www.xerox.com
– captured actions using the WebQuilt proxy logger [Hong01]
– done at their leisure.

15 unique tasks:
– Tasks developed after exploring xerox.com and reading user
e-mail feedback
– 5 task groups with 3 tasks per group.
– Products, TechSupport, Supplies, Company Info, and Jobs

Participation:
– 21 users signed up, 18 went through, 104 usable sessions.
2002-04-24
CHI Web Behavior Patterns
14
Results: 340 combinations of clustering schemes
Outlink-based schemes performed poorly (omitted).
2002-04-24
CHI Web Behavior Patterns
15
Analysis: Modalities
Analysis of Modalities in Unim odal Cases
100.0%
Content is King!
Mean=0.96,
StdDev=0.07
80.0%
70.0%
60.0%
50.0%
40.0%
RAW PATH
30.0%
CONTENT
20.0%
URL
10.0%
INLINK
OUTLINK
po
si
ti o
n
tfi
df
,ti
m
e
tfi
df
,p
os
ti m
e
tfi
df
un
i fo
rm
0.0%
ti m
e,
po
ti m
s
e,
po
s,
tfi
df
% correctly clustered
90.0%
Path Weighting Schem es
Linear Contrast shows Content sig. different:
(unimodal) F(1,105)=32.51, MSE=.005361, p<0.0001
(multimodal) F(1,35)=33.36, MSE=.007332, p<0.0001
2002-04-24
CHI Web Behavior Patterns
16
Analysis: Path Weighting
View Time
is best!
Paired t-Test between Time-based and non-Time based weightings:
n=60, t(59)=4.85, p=4.68e-6
V.T.mean=89.5%, s.d.=12.7%, non-V.T.mean=83.2%, s.d.=12.0%
2002-04-24
CHI Web Behavior Patterns
17
Observation: Multi-Modal vs. Unimodal

In practice, Multi-Modal should be more robust
– Some pages don’t have much content
» Images, Audio, Video
» PDF, PS (if you don’t have necessary software)
– URL Tokens: All pages have URLs.
– Inlinks: don’t depend on any features of a page!

In our experience, Content-based Multi-Modal
Clustering retains accuracy.
Linear Contrast shows no significant difference between multimodal and uni-modal schemes:
F(1,77)=1.63, MSE=.004407, p=.21
2002-04-24
CHI Web Behavior Patterns
18
Findings




Incorporating View Time improves
clustering accuracy.
Though it involves extra work, extracting
Content can provide very high accuracy.
Adding other modalities make clustering
more robust.
Modalities should be chosen carefully, and
tailored for each specific site.
2002-04-24
CHI Web Behavior Patterns
19
Implications for Designers


Good design means understanding your users.
It’s possible to understand trends of user
activities accurately.
– Requires well-defined user tasks doable on the site.

Now you can design and tailor user experience.
– Address discovered usability issues.
– Update design to facilitate common tasks.
2002-04-24
CHI Web Behavior Patterns
20
Summary: “You are what you see.”
Users follow the best Information Scent to accomplish their goals.
Web site
Page
Content
User
Information
Goals
2002-04-24
Topology
InfoScent Clustering
CHI Web Behavior Patterns
Observed
Usage
21
Future Work

Determining # of clusters
– Currently done semi-manually



Model unstructured task more directly
Directly recommend design changes
Integrate with
– Clustering Visualization
– User Path Visualization

Lots of Commercial Interest, Licensing
2002-04-24
CHI Web Behavior Patterns
22
Conclusion




Performed first known user study to characterize the
analytic space of session clustering techniques.
Found that session clustering can be highly accurate
with respect to user intentions.
Demonstrated our method is scalable and useful in
real-world scenarios.
This should prove to be a useful tool for web
designers and researchers!
2002-04-24
CHI Web Behavior Patterns
23
Acknowledgements




Peter Pirolli, Stu Card, Adam Rosien,
Pam Schraedley and the the UIR and
Bloodhound Team at PARC.
George Karypis for CLUTO software
Participants in our user study
Office of Naval Research
Separating the Swarm
Categorization Methods for
User Sessions on the Web
Contact:
Jeff Heer ([email protected])
Ed H. Chi ([email protected])
2002-04-24
CHI Web Behavior Patterns
24