Visit to Monroe County High Schools

Download Report

Transcript Visit to Monroe County High Schools

Opportunities
in Statistical Software:
Phystat Workshop
Jim Linnemann
MSU
March 1 , 2004
Preliminaries
• Be sure to get a parking permit from
Lorie Neuman (room 4218, X 2180)
• Wireless: Tom Rockwell can help if you
can’t get access; you should just get a
direct connection to outside world
– Dhcp with an address starting with 10.
– If you need to print something, email to
• [email protected]
• Introductions
Why you?
• You—developers—can actually change things!
• I would personally like a better analysis
environment for HEP.
• I keep hearing about R from statisticians!
• I am convinced astronomers and HEP together
will get something better than either has alone.
• And maybe we will have some things that
statisticians can use, too.
– Suggested to Brad Efron using arxiv.org for statistics
• I subscribe to the “right people in a room” theory.
What Can We Accomplish?
• We won’t convince anyone to drop what they do now
and adopt product xxx instead!
• But we might benefit from seeing different
development cultures, work styles, or interesting ideas
• We might find ways to make interfaces across projects,
or identify common projects
• If this starts to look interesting, we can spend more
time on sharpening this up
• The “agenda” can be revised at any time!
Sociology
• HEP experiments: own data reduction software (C++)
– Usually develop common tools used by whole collaboration
– Use more generic software as tools, and final data analysis
– Particle Astrophysics similar, but more Fortran/C
• HEP lab-dominated in cross-experiment software
• CERN, Fermilab, SLAC, DESY, KEK, Brookhaven
–
–
–
–
–
Some instances of cross-lab collaboration
Grid computing is one of few non-lab major software projects
Some tools are university based (specific simulations)
Typically free to community, but not gnu…
Smaller packages: repositories not that well developed
• Not much commercial software
– Office; mathematica/maple; some mathcad/matlab/kaleidagraph
• IDL much less used than in astronomy: not as image-oriented
– Latex; ghostview; gnuplot-like
• Statistics: more distributed?
• Astronomy: more large software grants?
Some Possible Goals
•
•
•
•
Repository sponsorship
Web or Python interfaces to libraries
Root user package repository?
Interfaces between R and Root
– GUI for R?
– R scripting in Root? R libraries in Root?
– Handling of larger datasets in R?
HEP Small Packages
• Example: calculation of significance, limits from
observed counts, estimated background,
uncertainties, efficiencies, etc.
• Several competing procedures
– Some are published (PHYSTAT; NIM)
– Standard programs not on public, recognized web sites:
know the author, or someone in collaboration
implements and maybe posts or puts in local repository
• Programs not collected by Particle Data Group
– publishes generally-recognized methods review
Questions to see differences:
•
•
•
•
•
Goals + strengths
What would you like to add next?
User community: Who? How many? Platforms?
User interface: GUI, Scripting, Web, link library, code?
Documentation: how? Quality?
•
•
•
•
•
How big is developer community?
How are contributions made/tested/integrated?
Releases and bug tracking mechanisms
Implementation language(s)
Licensing/distribution
Proposed Presentations
•
•
•
•
•
•
•
•
Rene Brun:
Root
data mining in HEP
Eric Feigelson: VOSTATS R in astronomy?
Luke Tierney: R
(and omegastats?)
Who?
Frustrating Examples
Sherry Towers TerraFerMA classification in HEP
Adam Lyon
Using R in HEP
Scott Snyder Alternative Root Interfaces
Tim Beers
Rostat
robust legacy code
• Right Order? Space out or bunch?
• First pass quickly to survey, then reconsider?
• Discussion during presentation or after?
Other possible activities
• Discussion/panel:
–
–
–
–
What do users want?
How could projects reinforce one another
Selecting achievable goals
What are options for Fermilab projects?
• Technical Working Group(s)
– Specifics, e.g. root/R interface (brass tacks)
– Planning of joint projects?
– Planning of further workshops?
• Developer or user oriented?
• Post Talks to web?
– Semi-private (developer use)?
– Or public, with publicity to users
Some projects that got away
Particularly Python-based
•
•
•
•
StatPy—Tom Loredo
Python interface to Root—Harrison Prosper
Orange and related: Python--Aleks Jakulin
Jas—Java analysis framework
Restaurant: Villegas 6:15pm
N. to Grand River; E 3.2 mi. past Okemos Rd, Marsh Rd
1735 W Gr River, 347-2080 (on right before Dobie)
Central Park
BMPS
Dessert: Jim & Ruth Linnemann
1217 Ascot Pl 349-6138
Continue E (right) on Grand River
Right at Ascot Place (3rd right; 2 miles or so)
Left at Cornell Rd (1 mi)
1st drive on right of Ascot
Example 1: 2 sample classification
•
•
•
•
Plot signal efficiency vs background rejection curves (ROC)
Selection based on a set of variables (or combinations of variables).
Click on efficiency value to find value selection criterion in original variables.
Superimpose curves for several candidate variable selections.
•
•
Data:
Look in a coordinated fashion at two separate data sets with related but nonidentical data structures
HEP data usually tree-structured:
•
– many instances, each including variable number of lower-level objects
• Typically 2 or more levels down ,
•
•
I might analyze these by forming a variable number of derived variables from
the low level objects.
Much of this process is algorithmic, but I wind up re-doing it by hand each
time I try it.
Ex 2: No integrated repsository
•
•
•
•
•
End of an analysis: sample of data events, and an expected set of possible
backgrounds, each with an uncertainty.
Want to calculate a statistical significance (or 90% CL) for these.
Usually have to extract these numbers and then find a completely separate
piece of software, either in someone's private area, or on the web, or if
I'm really lucky, in a macro someone's written.
There aren't good central mechanisms (repositories or interactive web sites)
or for sharing such algorithms, either.
Ex 3: New Statistical Methods
• While the environment I'm used to is good at exploring and fitting
large data sets, the number of statistical methods part of that
framework is limited.
• I'd like to be able to apply many of the tests I might find in a textbook
to comparing two distributions.
• Or I’d like to perform bootstrap calculations or “ensemble tests”
without writing from scratch a “toy Monte Carlo”: to identify the
statistical uncertainty of my fitting results with simulated experiments.
• These tests exist in R, but my data is in Root.
Root: key features
• GUI for presentation graphics and selection
(“cuts”)
• I/O for tree-structured data: scales to
petabytes
• Histogram as base metaphor (akin to vector)
• Sophisticated nonlinear fitting
• C++ at command line, macros, compiled
macros
R: key features
• Elegant data manipulation: S language:
–
–
–
–
–
command prompt and macros
interpreted, heading to byte-compilation
GUI: only now building hooks
most users satisfied command line
Standard tool of professional research statisticians
• Sophisticated graphics
– standard statistical plots not used in HEP
– missing histograms with error bars
– Links to further multidimensional graphics (Ggobi)
• Data in virtual memory
– Data frames: vectors are a basic metaphor (cf. histogram in Root)
– interfaces to databases (postgres; mysql)
– Parallel computation under development
• Broad package library, with trivial download