Name of Your Country

Download Report

Transcript Name of Your Country

Analyzing Large Data Sets
in Astronomy
Alex Szalay, Jim Gray
Patterns of Scientific Progress
Observational Science
Scientist gathers data by direct observation
Scientist analyzes data
Analytical Science
Scientist builds analytical model
Makes predictions.
Computational Science
Simulate analytical model
Validate model and makes predictions
Data Exploration Science
Data captured by instruments
Or data generated by simulator
Processed by software
Placed in a database / files
Scientist analyzes database / files
Gray and Szalay, Communications of the ACM (2002)
Living in an Exponential World
Astronomers have a few hundred TB now
1 pixel (byte) / sq arc second ~ 4TB
Multi-spectral, temporal, … → 1PB
1000
100
They mine it looking for
new (kinds of) objects,
more of interesting ones (quasars),
density variations in 400-D space,
correlations in 400-D space
10
1
0.1
1970
1975
Data doubles every year, public after 1 year
So, 50% of the data is public
Same trend appears in all sciences
1980
1985
1990
1995
2000
CCDs
Glass
The Challenges
Exponential data growth:
Distributed collections
Soon Petabytes
Data
Collection
Discovery
and Analysis
New analysis paradigm:
Data federations,
Move analysis to data
Publishing
New publishing paradigm:
Scientists are publishers
and Curators
Making Discoveries
Where are discoveries made?
At the edges and boundaries
Going deeper, collecting more data, using more colors….
Metcalfe’s law
Utility of computer networks grows as the
number of possible connections: O(N2)
Szalay’s data federation law
Federation of N archives has utility O(N2)
Possibilities for new discoveries grow as O(N2)
Current sky surveys have proven this
Very early discoveries from SDSS, 2MASS, DPOSS
Data Analysis Today
Download (FTP and GREP) are not adequate
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
Oh!, and 1PB ~10,000 disks
At some point we need
indices to limit search
parallel data search and analysis
This is where databases can help
Next generation technique: Data Exploration
Bring the analysis to the data!
Next-Generation Data Analysis
Looking for
Needles in haystacks – the Higgs particle
Haystacks: Dark matter, Dark energy
Needles are easier than haystacks
Global statistics have poor scaling
Correlation functions are N2, likelihood techniques N3
As data and computers grow at same rate,
we can only keep up with N logN
A way out?
Discard notion of optimal (data is fuzzy, answers are approximate)
Don’t assume infinite computational resources or memory
Requires combination of statistics & computer science
Why Is Astronomy Special?
It has no commercial value
No privacy concerns, freely share results with others
Great for experimenting with algorithms
It is real and well documented
High-dimensional (with confidence intervals)
Spatial, temporal
Diverse and distributed
Many different instruments from
many different places and
many different times
The questions are interesting
There is a lot of it (soon Petabytes)
The Virtual Observatory
Many new surveys are coming
SDSS is a dry run for the next ones
LSST will be 5TB/night
All the data will be on the Internet
ftp, web services…
Data and applications will be
associated with the projects
Distributed world wide, cross-indexed
Federation is a must
Will be the best telescope in the world
World Wide Telescope
Finds the “needle in the haystack”
Successful demonstrations in Jan’03
Boundary Conditions
Standards driven by evolving new technologies
Exchange of rich and
DB connectivity, Web
structured data (XML…)
Services, Grid computing
Application to astronomy domain
Data dictionaries (UCDs)
Data models
Protocols
Registries and resource/service discovery
Provenance, data quality
Dealing with the astronomy legacy
FITS
data format
Software analysis systems
Boundary
conditions
Short History of the VO
Driven by exponential data growth
In the US it started with SDSS + GriPhyN
In Europe started at CDS (Strasbourg)
Continued with NVO + AVO
Now: International Virtual Observatory Alliance
Now in 14 countries
Total data holdings >200TB
Core services and standards adopted
Getting ready for first deployment (mid04)
Data Analysis - Optimal Statistics
Brute-force examples for optimal statistics have poor
scaling
Correlation functions N2, likelihood techniques N3
As data sizes grow at Moore’s law, computers can only
keep up with at most N logN algorithms
What goes?
Notion of optimal is in the sense of statistical errors
Assumes infinite computational resources
Assumes that only source of error is statistical
‘Cosmic Variance’: we can only observe the Universe from one location
(finite sample size)
Solutions require combination of Statistics and CS
New algorithms: not worse than N logN
Organization & Algorithms
Use of clever data structures (trees, cubes):
Up-front creation cost, but only N logN access cost
Large speedup during the analysis
Tree-codes for correlations (A. Moore et al 2001)
Data Cubes for OLAP (all vendors)
Fast, approximate heuristic algorithms
No need to be more accurate than cosmic variance
Fast CMB analysis by Szapudi et al (2001)
• N logN instead of N3 => 1 day instead of 10 million years
Take cost of computation into account
Controlled level of accuracy
Best result in a given time, given our computing resources
Analysis and Databases
Much statistical analysis deals with
Creating uniform samples –
data filtering
Assembling relevant subsets
Estimating completeness
censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
Traditionally these are performed on files
Most of these tasks are much better done inside a database
Move Mohamed to the mountain, not the mountain to
Mohamed
Cosmic Microwave Background
Szapudi et al 2002
Data Exploration:
A New Way of Doing Science
Primary access to data is through databases
Exponential data growth – distributed data
Publication before analysis
Large data: move analysis to where data is
Distributed computing – data federation
New algorithms are needed
The Virtual Observatory is a good example
Unavoidable, emerging in all sciences!