Windows Crash Data Collection and Analysis

Download Report

Transcript Windows Crash Data Collection and Analysis

Windows Crash Analysis Results
Archana Ganapathi
Department of EECS, UC Berkeley
([email protected])
Motivation (1)
“If a problem has no solution, it may not be
a problem, but a fact, not to be solved,
but to be coped with over time.”
–Shimon Peres

Windows Crashes!!!




Determine crash causes
Document crash likelihood of SW/HW components
Evaluate product dependability
Underlying design principles causing crashes
Motivation (2)

Build an Oracle for System Behavior




Determine dominant failure cause
Drive benchmarks
Evaluate our ideas via prototypes tested
using benchmarks
Help us select problems to tackle
Current Data Collection
Process

Dataset




Collection Process




214 machines from Berkeley EECS department, ~7 months,
~1500 crashes
Windows XP SP1
Low variability in usage profile
Modify group policy to auto direct crash reports to local
server
Easy-to-install CER software for server
No prompt for user to send crash report
Frequency of collection

synchronized with application and system crashes on
computers
Crash count per month
Number of crashes per month
Jun 14: 25 machines
350
300
250
200
150
100
50
0
Jun 25: 125 machines
July 9: 150 machines
124
c
de
no
v
130
-3
1
t1
oc
p
se
g
au
130
131
31
1ju
ly
-3
0
Aug 3: 214 machines
14
ju
n
Milestones:
Aug 24: fall semester begins
Dec 21: fall semester ends
Sample Results
App
6/17/04 1:33 machine1
7/25/04 1:39
0:54 machine1
machine2
6/17/04
6/17/04 1:41
7/25/04 0:54
6/17/04 1:43
7/25/04 0:54
6/17/04 1:43
machine1
machine2
machine1
machine2
machine1
user1
App
version Component
Component
version
firefox.exe\0.8.0.0\msvcrt.dll\7.0.2600.1106\...
user2
user1 iexplore.exe\6.0.2800.1106\hungapp\0.0.0.0\...
firefox.exe\0.8.0.0\msvcrt.dll\7.0.2600.1106\...
iexplore.exe\6.0.2800.1106\MSHTML.DLL\6.0.2800.
user1
firefox.exe\0.8.0.0\msvcrt.dll\7.0.2600.1106\...
user2
1400\...
user1
firefox.exe\0.8.0.0\msvcrt.dll\7.0.2600.1106\...
user2 iexplore.exe\6.0.2800.1106\jscript.dll\5.6.0.8513\...
user1
firefox.exe\0.8.0.0\msvcrt.dll\7.0.2600.1106\...
6/17/04 1:43 machine1
user1
firefox.exe\0.8.0.0\msvcrt.dll\7.0.2600.1106\...
6/17/04 1:47 machine1
user1
firefox.exe\0.9.0.0\msvcrt.dll\7.0.2600.1106\...
Crash Behavior on a Machine
App1
App2
App3
time
Common Hypotheses




Microsoft OS is unreliable and causes
most crashes
Web browsers crash more than other
apps
Applications hang as frequently as
they crash
Crashes occur roughly as much in 3rd
party dlls as MS dlls
Application Categories
code development
custom software
database
document and
presentation editing
document archive
document viewing
e-mail
graphic viewer
i/o
instant messaging
multimedia
OS
remote connection
sci computation
system mgmt/security
web browsing
Microsoft OS is unreliable and
causes most crashes - FALSE
code development ,
1%
instant messaging ,
2%
plug and play, 1%
remote connection,
2%
xml generator , 1%
document archiver,
3%
other, 1%
multimedia , 4%
Operating System,
5%
web browsing , 36%
document viewer,
5%
scientific
computation, 5%
email , 8%
document
preparation, 10%
unknown , 16%
System Crashes



XP operating system crashes very
rarely!
More impacting…most require reboot
9/1500 isn’t enough data to study
system crashes!!

Use BOINC to collect system crash data
User behavior study



How regularly do people use different
applications?
Do people behave differently if different
things crash?
How frequently do people proactively
reboot their computer for system
stability?
Web browsers crash more
than other apps - TRUE
Usage Frequency
Crash Frequency
code development ,
1%
instant messaging ,
2%
plug and play, 1%
remote connection,
2%
xml generator , 1%
multimedia, 6%
system
management,
4%
other, 1%
web browsing,
18%
scientific
computation, 7%
document archiver,
3%
other, 1%
multimedia , 4%
Operating System,
5%
web browsing , 36%
document viewer,
5%
code
development,
10%
scientific
computation, 5%
email, 24%
document
viewing, 8%
email , 8%
document
preparation,
22%
document
preparation, 10%
unknown , 16%
Web Browser usage vs crash
Web Browsing Usage
Lynx, 2%
Mozilla , 2%
Firefox , 11%
Mozilla, 15%
Firefox, 9%
Internet
Explorer, 54%
Netscape, 20%
Web Browsing Crashes
Netscape,
24%
Internet
Explorer ,
63%
Limitations of this analysis




No info on how long each application was
used before crash
Need metrics on system and application
usage
Skewed by user community’s usage patterns
User survey limitations



Data only as accurate as user reports
Difficult to estimate time spent using each
application
Hard to distinguish active usage from background
processes
Applications hang as frequently
as they crash - TRUE
Crash Cause by Component
netscape.exe , 1%
dreamweaver.exe ,
1%
kernel32.dll , 1%
tempest.exe , 1%
mshtml.dll , 1%
gklayout.dll , 1%
WINWORD.EXE ,
2%
user32.dll , 2%
sshclient.exe , 1%
mathematica.exe,
1%
comctl32.dll , 1%
excel.exe , 1%
model_ir.exe , 1%
thunderbird.exe , 1%
win-ir pro.exe , 1%
wow32.dll , 1%
xcelsius.exe , 1%
rpcl3260.dll , 2%
exceed.exe , 2%
ray_tracing.exe , 2%
acrord32.exe , 2%
powerarc.exe , 2%
hungapp , 51%
pdm.dll , 2%
msvcrt.dll , 3%
simpl_fox_gl.exe ,
3%
ntdll.dll , 8%
unknown , 9%
App Hang Frequency
CDCopier.exe , 1%
Win-IR Pro.exe , 1%
DCPlusPlus.exe , 1%
MSACCESS.EXE , 1%
hp precisionscan
pro.exe , 1%
allegro-ansi.exe , 1%
AdDestroyer.exe , 1%
winamp.exe , 1%
wmplayer.exe , 1%
Illustrator.exe , 1%
unison.win32-gtkui.exe
, 1%
WeMail32.exe , 1%
wiaacmgr.exe , 1%
apps causing < 1% of
hangs each, 7%
EXCEL.EXE , 1%
NOTEPAD.EXE , 2%
IEXPLORE.EXE , 28%
netscape.exe , 2%
explorer.exe , 2%
thunderbird.exe , 2%
MSIMN.EXE , 2%
Acrobat.exe , 2%
mozilla.exe , 2%
WINWORD.EXE , 9%
POWERPNT.EXE , 3%
Netscp.exe , 4%
POWERARC.EXE , 4%
firefox.exe , 5%
OUTLOOK.EXE , 9%
matlab.exe , 9%
Crashes occur roughly as much in
3rd party dlls as MS dlls- TRUE
Component
Description
Author
Apps invoking component
%crash
ntdll.dll
NT system functions
MS
Internet Explorer, Matlab
12.28
simpl_fox_gl.exe
User application
3rd party
--
4.80
msvcrt.dll
Microsoft C runtime library
MS
Acrobat, Netscape
4.03
pdm.dll
Scripting component
functions
MS
Visual Studio, Internet
Explorer
3.07
powerarc.exe
Power Archiver
3rd party
Power Archiver
2.88
acrord32.exe
Acrobat Reader
3rd party
Acrobat Reader
2.69
ray_tracing.exe
User application
3rd party
--
2.69
exceed.exe
Exceed X Windows
3rd party
Exceed
2.50
rpcl3260.dll
Real Player component
3rd party
Real Player
2.50
MS
Firefox, Internet Explorer
2.50
user32.dll
Communication, message
handler, timer functions
DLLs: Model citizens and bad apples


227 dlls used by 33 apps analyzed
37 of these dlls caused crashes in our data set



Worst offenders (widely used and frequent causes
of crashes): ntdll.dll, kernel32.dll, msvcrt.dll
190 model citizens
Random sample of (almost) universally used
dlls that have not produced crashes:
CFGMGR32.DLL
NETAPI32.DLL
RPCRT4.DLL
DBGHELP.DLL
PRINTUI.DLL
USERENV.DLL
MSIMG32.DLL
RASMAN.DLL
VERSION.DLL
MSSIGN32.DLL
REGAPI.DLL 32
WMI.DLL
Crash-causing Status Code
NTSTATUS Code
Description
0xc0000005
ACCESS_VIOLATION
525
0xcfffffff
HANG
443
0xc0000006
IN_PAGE_ERROR
0xc0000096
STATUS_PRIVILEGED_INSTRUCTION
9
0xD1, 0xA,
0xC1, 0x8E
IVER_FAULT
9
0x80000003
STATUS_BREAKPOINT
5
0xc000001d
STATUS_ILLEGAL_INSTRUCTION
4
0xc0000094
STATUS_INTEGER_DIVIDE_BY_ZERO
3
0xc0000091
STATUS_FLOAT_OVERFLOW
2
0xe06d7363
0xeedfade
“Trappable error in external object”
4
0xc000001e
STATUS_INVALID_LOCK_SEQUENCE
1
0xc0000090
STATUS_FLOAT_INVALID_OPERATION
1
0xc015000f
STATUS_SXS_EARLY_DEACTIVATION
1
0xc0150010
STATUS_SXS_INVALID_DEACTIVATION
1
Count
12
Crash Behavior on a Machine
App1
App2
App3
time
Parameters for clustering








Time since previous crash normalized by
average time between crashes on machine
Time since previous cluster normalized by
average time between clusters on machine
Size of crash cluster
Time between events in a crash cluster
Frequency of crash normalized by frequency of
usage
Complexity of application (# of dlls)
Specific dlls used
Crashing dlls for app
Challenges of non-numeric
data

Meaningless to instantiate “average” data point
for many data types:





Strings: discrete values, and order irrelevant to domain structure
Sets (e.g. of DLL names): no natural ordering
Version numbers: order irrelevant to domain structure
Hexadecimal error codes: non-ordinal when used as crash labels
Non-numeric components require “unnatural”
distance measure



Strings: d(s1, s2) = 0 iff s1 = s2; d(s1, s2) = 1 otherwise
Sets: d(s1, s2) = |(s1 \ s2)  (s2 \ s1)| (size of symmetric difference)
Version numbers: Component-wise difference scaled so difference
between minor version numbers doesn’t dominate difference between
major version numbers
Sample Application Clusters
Microsoft Apps
Application
Component
iexplore.exe
mshtml.dll
explorer.exe
Netscape
Custom-written Apps
Application
Component
Application
Component
netscp.exe
gklayout.dll
dialogeditor.exe
dialogeditor.exe
shimgvw.dll
msgimap.dll
simpl_fox_gl.exe
ntdll.dll
netscp.exe
simpl_fox_gl.exe
iexplore.exe
explorer.exe
shell32.dll
ADPS_ProjectBT3.exe
ADPS_ProjectBT3.exe
netscp.exe
gkplugin.dll
mmc.Exe
comctl32.dll
unison.win32-gtkui.exe
Hungapp
OUTLOOK.EXE
FDATE.DLL
Netscp.exe
xpcom.dll
wfxctl32.exe
wfxut32i.dll
WINWORD.EXE
WINWORD.EXE
tphkmgr.exe
tphkmgr.exe
iexplore.exe
BROWSEUI.DLL
stratagus.exe
stratagus.exe
WINWORD.EXE
WINWORD.EXE
usrtogrp.exe
ntdll.dll
EXCEL.EXE
ntdll.dll
model_ir.exe
model_ir.exe
OUTLOOK.EXE
OUTLLIB.DLL
WINWORD.EXE
Mozilla Apps
Application
Component
thunderbird.exe
Hungapp
ray_tracing.exe
ray_tracing.exe
MSO.DLL
thunderbird.exe
gklayout.dll
FlexPDE4.exe
FlexPDE4.exe
notepad.exe
comctl32.dll
thunderbird.exe
Unknown
FlexPDE4.exe
Hungapp
POWERPNT.EXE
MSO.DLL
thunderbird.exe
thunderbird.exe
allegro-ansi.exe
Hungapp
POWERPNT.EXE
ole32.dll
canvas5.exe
canvas5.exe
ntvdm.exe
ntdll.dll
firefox.exe
firefox.exe
ntvdm.exe
wow32.dll
firefox.exe
Hungapp
firefox.exe
kernel32.dll
Other clusters

Alisp and Firefox (!)


Matlab and Internet Explorer


some crashes in both caused by same DLL
(but different error codes)
similar DLL sets: Matlab = IE + LAPACK?
Internet apps

E.g. IE, Netscape, Y! Messenger, Outlook,
Acrobat Reader
Clustering lessons

Trustworthy patterns require huge amounts of
structured data



Tradeoff between introducing domain knowledge
and biasing results to confirm hypotheses
Most crash histories machine/user-specific


Need several instances of every (application,
component, error code) combo
Configuration management more important than
blaming (insert victim here)?
Crash patterns are not usage patterns
Conclusions

Challenge of obtaining relevant data


Need continuous description of failure info
Hard to convince folks to let us collect data,
primarily because of privacy concerns.



predominant in the home user and academic
communities
less so in a corporate setting - better computer etiquette
Need more volunteers to share data with us!


BOINC
Industry