Components of a Data Analysis System

Download Report

Transcript Components of a Data Analysis System

Components of a Data
Analysis System
Scientific Drivers in the Design of
an Analysis System
Data Import
• Format
–
–
–
–
Either widely used/accepted, or
Can be converted easily from something widely used
User need not know the details of the format
Well documented (e.g., which flavor of latitude).
• Fast Access
–
–
–
–
Disk I/O speeds do not follow Moore’s law
Read speed is more important than write speed
Caching
File size is only important to keep access times low
• Content must represent the details of the data
• E2E - Full intent of the observer must be
embedded
Data Export
• Format
–
–
–
–
Either widely used/accepted, or
Can be converted easily into something widely used
User need not know the details of the format
Well documented (e.g., which flavor of latitude).
• You can read what you write
– Import format == Export format
• Fast Access
– Disk I/O speeds do not follow Moore’s law
– Read speed is more important than write speed
• Content must represent the details of the data
• E2E - Full intent of the observer must be embedded.
• Includes user annotation/comments
Data Base System
• Ability to work with more than one data set
• Data base for both export and import files
• Large data volumes
– Access using scan numbers is no longer sufficient
– Require the ability to select subsets of data via sophisticated
data-base queries
– Moderate number of columns in data base index
– ‘Index’ to data kept in memory to speed data access
– File summaries at various levels of detail
•
•
•
•
Various levels of ‘granularity”
Calibrated and raw data
E2E - User can add annotation/comments
Security – Only the observer can access data
Data Archive
• Write speed more important than read speed.
• File size is very important
• Cannot anticipate types of user queries
– Large number of columns in data base index
– Very sophisticated/fast RDBMS
• Storage need not be a widely used data format
– Format can be very different from that used by
analysis system.
• Export format should be a widely used data
format
Interactive On-Line Data
Analysis
• The ability to access data ASAP
– Import file updates automatically as observations
proceed (real-time “filler”).
– Index to file updates automatically
– Updates happen per ‘integration’ (spectral-line) or per N
seconds (continuum)
– Minimum integration time ~ few times the minimum time
of real-time “filler”
– Analysis system automatically is aware of updated
index.
– Read-protect online/filled data?
• User should be able to ‘see’ the data within an
‘integration’ of when it was taken (or N seconds).
User Interface
• Command line
– Familiar syntax better than a good syntax
– Procedural with byte-wise compiling
(performance)
– History, min-match or command completion
– Useful error messages
– Interruptible
– Error trapping and exception handling
– Ability to “Undo”
User Interface
• GUI’s best for:
– Interacting with data visualizations
– Filling in forms
• data base queries
• options for data pipelines
– Browsing for data files
– Defining E2E data flow (ala labview)
Imaging Tools
• Visualization
– Shouldn’t try to recreate those things already available in
another package – export instead.
• Data Flagging – Pick a system that works
• Graphics
– Traditional capabilities (zoom in/out, scroll, print, save, …)
– Data volume requires great performance, smart libraries
(screen resolution << # data pts)
– Interactive feedback (e.g., defining baseline regions).
• Publishable plots or export into something else?
– Default plot style
– Ability to tweak everything (label formats; char sizes; add,
remove, move annotation; tick mark size; major/minor
ticks, full box; grid; multiple X and Y axes, …..)
Analysis Algorithms
• Algorithms well documented
• Study what exists in other packages.
• Robustness very important but so is speed
– Provide less robust but faster alternatives
•
•
•
•
Developers should not force an algorithm on users
Developers should provide ‘defaults’ only
Building blocks better than a do-all algorithm.
Ability to use and modify ‘header’ information as
well as data.
• E2E – do-alls are built out of the same building
blocks.
Documentation
• On-line and hardcopy
– Tutorials/Quick Guides
– Cookbook
• Based on observing types
– Reference Manuals
• Full, gory details
• Data Formats
• Algorithms
– Searchable by keywords
• Quick, interactive command help from within the
system.
• Never release until these are in place
User Support/Feedback
• A familiar system minimizes staff support
• Easily accessed, on-line “help desk” and
“Suggestion” box
• Automatic generation of “bug” reports
• Observers of observers
Marketing
•
•
•
•
•
•
•
A familiar system already has a market
Don’t be another cereal on the supermarket shelf
Workshops are better than papers
Create a User Community
Responsive feedback from developers
Independent Beta testers
Reputation & first experiences are everything
User Community
• User Forums
• Newsletters
• Accept User Contributions/Additions
– Sourceforge-like system
– NRAO-seal-of-approval
• NRAO Moderator
Real-Time Data Display
• To guarantee data quality
–
–
–
–
–
Product is not stored (except for hardcopy)
Sequential processing -- different from E2E/Data pipeline
Fast is more important than accurate
Few bells and whistles -- must avoid the RTD black hole
A simple display for all observation types more important
than sophisticated displays for a few data types
• Display happens within an ‘integration’ of when data
were taken – tied to real time filler
• GUI based – underlying language is unimportant
• Output understandable by an operator
Real Time Data Analysis
• Pointing/Focus/Tipping/… are different from RTD
– Results should be stored (Data Base)
– Results are used by the control system (pointing/focus)
or by subsequent analysis (tipping)
– Accuracy is as important as speed
– More bells, whistles, user-options
– Sequential processing (non E2E/data pipeline)
– Only a few observation types are handled
• Analysis happens within an ‘integration’ of when
data were taken
• GUI based – underlying language is unimportant
• Output understandable by an operator
IDL Work Package
• SDFITS
– Interim solution for data import/export
– Class/IDL specific; soon Aips++/Aips/UniPOPS?
– MD/BDFITS next generation (keywords,
incompleteness of contents, versatility, …)
• IDL – Tom Bania
– Uses UniPOPS as a ‘model’ – familiar to many
– Very good reproduction
– Bania-centric – needs to be generalized
IDL Work Package
• Glen Langston
– Assess whether IDL will meet performance,
extensibility, usability, … goals.
– Generalization to other observing types.
– Real-Time data access and display
– Developed on top of and in parallel with Tom’s
work (so, implementations have diverged)
– Works well for Glen’s own experiments
IDL Work Package
• Institutionalize what Tom and Glen have done
–
–
–
–
–
Code management
Code review
Combine Tom and Glen’s branch
Generalize code
Provide ways for Tom and Glen to contribute within
the same revision-control branch.
• Develop ‘Institutionalized’ code
– Improve performance, usability, maintenance
– Add/Replace I/O components with better CS
methods.
Calibration Work Package
• User-tunable algorithms
– Options for the ‘real-time filler’ – sequential
– Options for E2E pipeline – non-sequential
– Options for interactive data reduction
• Default algorithms for all observing cases
• Extensible as new algorithms are
developed
• User-defined/tweaked algorithms
• Robust and not-so-robust algorithms
Calibration Work Package
• Opacity/atmosphere model
• Output units
• Efficiencies
– Source size
– Telescope model
• Tsys(f) estimates
• Differencing schemes
• Non-linearities/template fitting/….