Transcript slides

Better Logging to Improve
Interactive Data Analysis Tools
Sara Alspaugh . . . . . . . . .
[email protected]
Archana Ganapathi . . . . . . .
[email protected]
Marti Hearst . . . . . . . . . . . . . .
[email protected]
Randy Katz . . . . . . . . . . . . .
09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134,
user=splunk-system-user,
action=search, info=granted,
event
search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,
search=‘search index=_internal metrics
per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,
ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,
apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample
scheduled search for dashboards (existing job case)”]
timesta
mp
09-28-2012 18:28:01.134 -0700 INFO
AuditLogger - Audit:[timestamp=09-28-2012
18:28:01.134, user=splunk-system-user,
event
action=search, info=granted,
search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,
search=‘search index=_internal metrics
per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,
ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,
apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample
scheduled search for dashboards (existing job case)”]
timesta
mp
09-28-2012 18:28:01.134 -0700 INFO
18:28:01.134,
AuditLogger - Audit:[timestamp=09-28-2012
use
r
user=splunk-system-user,
event
action=search, info=granted,
search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,
search=‘search index=_internal metrics
per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,
ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,
apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample
scheduled search for dashboards (existing job case)”]
timesta
mp
09-28-2012 18:28:01.134 -0700 INFO
18:28:01.134,
AuditLogger - Audit:[timestamp=09-28-2012
use
r
user=splunk-system-user,
event
action=search, info=granted,
search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,
search=‘search index=_internal metrics
per_sourcetype_thruput | head 100’, autojoin=‘1', buckets=0,
ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1', extra_fields=‘’,
apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012', savedsearch_name=“sample
scheduled search for dashboards (existing job case)”]
actio
n
timesta
mp
09-28-2012 18:28:01.134 -0700 INFO
18:28:01.134,
AuditLogger - Audit:[timestamp=09-28-2012
use
r
user=splunk-system-user, action=search, info=granted,
event
search_id=‘scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256’,
search=‘search index=_internal metrics per_sourcetype_thruput | head
100’, autojoin=‘1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=‘1',
extra_fields=‘’, apiStartTime=‘ZERO_TIME', apiEndTime=‘Fri Sep 28 18:28:00 2012',
savedsearch_name=“sample scheduled search for dashboards (existing job case)”]
actio
n
parameters
execution environment
configuration and version
stack trace
Motivation
Why do we need better logging?
Visualizing records of user activity to help optimize the user
experience using Google Analytics Goal Flow Tool
Applications of Good User Activity Records
recommenders
predictive
interfaces
task guidelines
activity
visualizations
traffic analysis
UX optimization
Jaideep Srivastava, Robert Cooley, Makund Deshpande and Pang-Ning Tan. “Web
usage mining: discovery and applications of usage patterns from web data.” SIGKDD
Explorations Newsletter. 2000.
Examples of this in IDEA tools
• SYF: Systematic yet flexible (Perer and
Shneiderman)
– social network analysis tool
– task guidelines for exploring social network
data
– users can provide feedback on task
usefulness
– records when users have completed tasks
Adam Perer and Ben Shneiderman. “Systematic yet flexible discovery: guiding
•domain
SeeDB
(Parameswaran, Polyzotis, Garciaexperts through exploratory data analysis.” Conference on Intelligent User
Interfaces
(IUI). 2008.
Molina)
Aditya Parameswaran, Neoklis Polyzotis, and Hector Garcia-Molina. “SeeDB:
visualizing
database queriesvisualizations
efficiently.” International
on SQL
Very Large
– recommend
forConference
a given
“Understanding the domain experts’ tasks is necessary to
defining the systematic steps for guided discovery.
Although some professions such as physicians, field
biologists, and forensic scientists have specific
methodologies defined for accomplishing tasks, this is rarer
in data analysis. Interviewing analysts, reviewing
current software approaches, and tabulating
techniques common in research publications are
important ways to deduce these steps.”
Some problems with logging
• ICSE 2012 study of logging best practices
• looks at four top OSS projects, finds
logging is:
–
–
–
–
“often a subjective and arbitrary practice”
“seldom a core feature provided by the vendors”
“written as ‘after-thoughts’ after a failure”
“arbitrary decisions on when, what and where to
log”
Ding Yuan, Soyeon Park, and Yuanyuan Zhou. “Characterizing logging practices in
open-source software.” International Conference on Software Engineering (ICSE).
2012.
“. . . it is critical to gain access to a stream of user
actions. Unfortunately, systems and applications
have not been written with an eye to user
modeling."
Eric Horvitz, Jack Breese, David Heckerman, David Hovel, and Koos Rommelse.
“The Lumière project: Bayesian user modeling for inferring the goals and needs of
software users.” Conference on Uncertainty in Artificial Intelligence. 1998.
Recommendations
Plan ahead to capture high-level user actions when designing
the system.
Track detailed provenance for all events.
Observe intermediate user actions that are not “submitted” to the
system.
Record the metadata and statistics of the data set being
analyzed.
Collect user goals and feedback.
Recommendation #1
Plan ahead to capture high-level user
actions when designing the system.
High-level task: clustering in Excel
Examples of this in IDEA tools
• HARVEST (Gotz and Zhou)
– visual analytics tool that incorporates action
semantics not events as core design element
– based on catalogue of common analytics
actions derived through review of many
analytics systems
– exposes high-level actions that retain rich
semantics as way of interacting with data
David Gotz and Michelle Zhou. “Characterizing users’ visual analytic activity for
insight provenance.” Symposium on Visual Analytics Science and Technology
(VAST). 2008.
“...work in this area has relied on
either manually recorded
provenance (e.g., user notes) or
automatically recorded eventbased insight provenance (e.g.,
clicks, drags, and key-presses),
both approaches have
fundamental limitations.”
Recommendation #2
Track detailed provenance for all events.
sources of data transformation
activity
interactively entered
at search bar
triggered by
dashboard reload
issued from
external user script
bad if same event is logged
09-28-2012 18:28:01.134 -0700 INFO AuditLogger - Audit:[timestamp=09-28-2012 18:28:01.134, user=salspaugh, action=search, info=granted ,
search_id=`scheduler__nobody__testing__RMD56569fcf2f137b840_at_1348882080_101256', search=`search source=*access_log* | eval http_success = if(status=200,
true, false) | timechart count by http_success’, autojoin=`1', buckets=0, ttl=120, max_count=500000, maxtime=8640000, enable\_lookups=`1', extra_fields=`',
apiStartTime=`ZERO_TIME', apiEndTime=`Fri Sep 28 18:28:00 2012', savedsearch_name=“”]
“...the log files do not
differentiate between
Show Me and Show Me
Alternatives. These
commands are
implemented with the
same code and the
log entry is generated
when the command is
successfully
executed.”
Visualization recommendation in Tableau’s Show Me.
Recommendation #3
Record the metadata and statistics of the
data set being analyzed.
actio
n
data
Toy Example Influence Diagram
action
scatter
plot
data
bar
chart
{categorical, categorical} .001
.999
{categorical,
quantitative}
.815
.185
P( action | data )
{quantitative,
.900
.100
Toy Example Conditional Probability Table
quantitative}
Initial recommendation ranking
Recommendation ranking based on the data
Wolfram Predictive Interface in Mathematica
Recommendation #4
Collect user goals and feedback.
Recommendation #5
Work towards a standard for logging data
analysis activity records.
Conclusion
• Goal: improve interactive data exploration
and analysis (IDEA): interfaces,
recommender systems, task guidelines,
predictive suggestions
• Problem: need better data to mine
• Recommendations for logging IDEA
activity
• When you build your next system for
IDEA, will you consider how you log user
activity?