Transcript Document

Astrophysics with
Terabytes of Data
Alex Szalay
The Johns Hopkins University
Jim Gray
Microsoft Research
Astronomy in an Exponential World
• Astronomers have a few hundred TB now
– 1 pixel (byte) / sq arc second ~ 4TB
– Multi-spectral, temporal, … → 1PB
1000
• They mine it looking for
100
new (kinds of) objects or
more of interesting ones (quasars),
density variations in 400-D space
correlations in 400-D space
• Data doubles every year
• Same access for everyone
10
1
0.1
1970
1975
1980
1985
1990
1995
2000
CCDs
Glass
Data Access is Hitting a Wall
FTP and GREP are not adequate
•
•
•
•
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years
•
Oh!, and 1PB ~4,000 disks
•
•
•
•
You can FTP 1 MB in 1 sec
You can FTP 1 GB / min (= 1 $/GB)
…
2 days and 1K$
…
3 years and 1M$
• At some point you need
indices to limit search
parallel data search and analysis
• This is where databases can help
•
•
If there is too much data to move around,
take the analysis to the data!
Do all data manipulations at database
– Build custom procedures and functions in the database
Next-Generation Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• ‘Optimal’ statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
– For large data sets main errors are not statistical
• As data and computers grow with Moore’s Law,
we can only keep up with N logN
• Take cost of computation into account
– Controlled level of accuracy
– Best result in a given time, given our computing resources
• Requires combination of statistics & computer science
– New algorithms
Why Is Astronomy Special?
• Especially attractive for the wide public
• It has no commercial value – “worthless!” (Jim Gray)
– No privacy concerns, freely share results with others
– Great for experimenting with algorithms
• It is real and well documented
– High-dimensional (with confidence intervals)
– Spatial, temporal
• Diverse and distributed
– Many different instruments from
many different places and
many different times
 Virtual Observatory
• The questions are interesting
• There is a lot of it (soon Petabytes)
National Virtual Observatory
• NSF ITR project, “Building the Framework for the National
Virtual Observatory” is a collaboration of 17 funded and 3
unfunded organizations
–
–
–
–
–
•
•
•
•
Astronomy data centers
National observatories
Supercomputer centers
University departments
Computer science/information technology specialists
PI and project director: Alex Szalay (JHU)
CoPI: Roy Williams (Caltech/CACR)
Natural cohesion with Grid Computing
Several widely used applications now up and running
– Registry, Datascope, SkyQuery, Unified data access
International Collaboration
• Similar grass-roots efforts now in 17 countries:
– USA, Canada, UK, France, Germany, Italy, Holland, Japan,
Australia, India, China, Russia, Spain, Hungary, South Korea,
ESO, Brazil
• Total awarded funding world-wide is over $60M
• Active collaboration among projects
– Standards, common demos
– International VO roadmap being developed
– Regular teleconferences over 10 timezones
• Formal collaboration
International Virtual Observatory Alliance (IVOA)
Sloan Digital Sky Survey
Goal
Create the most detailed map
of the Northern sky
“The Cosmic Genome Project”
Two surveys in one
Photometric survey in 5 bands
Spectroscopic redshift survey
Automated data reduction
150 man-years of development
High data volume
40 TB of raw data
5 TB processed catalogs
Data is public
The University of Chicago
Princeton University
The Johns Hopkins University
The University of Washington
New Mexico State University
Fermi National Accelerator Laboratory
US Naval Observatory
The Japanese Participation Group
The Institute for Advanced Study
Max Planck Inst, Heidelberg
Sloan Foundation, NSF, DOE, NASA
The Imaging Survey
Drift scan of 10,000 square degrees
24k x 1M pixel “panoramic” images
in 5 colors – broad-band filters (u,g,r,i,z)
2.5 Terapixels of images
The Spectroscopic Survey
Expanding universe
redshift = distance
SDSS Redshift Survey
1 million galaxies
100,000 quasars
100,000 stars
Two high throughput spectrographs
spectral range 3900-9200 Å
640 spectra simultaneously
R=2000 resolution, 1.3 Å
Features
Automated reduction of spectra
Very high sampling density and completeness
The SkyServer Portal
•
•
•
•
Sloan Digital Sky Survey: Pixels + Objects
About 500 attributes per “object”, 400M objects
Currently 2.4TB fully public
Prototype eScience lab (800 users)
– Moving analysis to the data
– Fast searches: color, spatial
•
Visual tools
– Join pixels with objects
•
Prototype in data publishing
– 200 million web hits in 5 years
– 930,000 distinct users
http://skyserver.sdss.org/
20
01
/
20 7
01
/1
0
20
02
/1
20
02
/4
20
02
/
20 7
02
/1
0
20
03
/1
20
03
/4
20
03
/
20 7
03
/1
0
20
04
/1
20
04
/4
20
04
/7
SkyServer Traffic
1.E+07
Web hits/mo
SQL queries/mo
1.E+06
1.E+05
1.E+04
Precision Cosmology
• Main questions
– Dark Matter, Dark Energy
• Over the last few years
–
–
–
–
Detected that the Universe accelerates
Detected the baryon bumps
Constrained the neutrino mass
Contribute substantially to constraints combined with CMB
• Surveys: ‘systems’ astronomy!!
– Measuring the parameters of the Universe
to a few percent accuracy
SDSS Power Spectrum
Main challenge: with so much
data the dominant errors are
systematic, not statistical!
Using large simulations to
understand significance of
detection
Trends
CMB Surveys
• 1990 COBE
• 2000 Boomerang
• 2002 CBI
• 2003 WMAP
• 2008 Planck
1000
10,000
50,000
1 Million
10 Million
Time Domain
• QUEST
• SDSS Extension survey
• Dark Energy Camera
• PanStarrs:
1PB by 2007
• LSST:
100PB by 2020
Angular Galaxy Surveys
•
1970 Lick
1M
•
1990 APM
2M
•
2005 SDSS
200M
•
2008 VISTA
1000M
•
2012 LSST
3000M
Galaxy Redshift Surveys
•
1986 CfA
3500
•
1996 LCRS
23000
•
2003 2dF
250000
•
2005 SDSS
750000
Petabytes/year by the end of the decade…
Simulations
• Cosmological simulations have 109 particles and
produce over 30TB of data (Millennium)
• Build up dark matter halos
• Track merging history of halos
• Use it to assign star formation history
• Combination with spectral synthesis
• Realistic distribution of galaxy types
• Need more realizations (now 50)
• Hard to analyze the data afterwards -> need DB
• What is the best way to compare to real data?
Exploration of Turbulence
For the first time, we can now “put it all together”
•
•
•
Large scale range, scale-ratio O(1,000)
Three-dimensional in space
Time-evolution and Lagrangian
approach (follow the flow)
Unique turbulence database
•
We are creating a database of
O(2,000) consecutive snapshots
of a 1,0243 simulation of
turbulence:
close to 100 Terabytes
•
Treat it as an experiment
Wireless Sensor Networks
• Use 200 wireless (Intel) sensors,
monitoring
• Air temperature, moisture
• Soil temperature, moisture,
at least in two depths (5cm, 20 cm)
• Light (intensity, composition)
• Gases (O2, CO2, CH4, …)
•
•
•
•
•
•
Long-term continuous data
Small (hidden) and affordable (many)
Less disturbance
>200 million measurements/year
Collaboration with Microsoft
Complex database of sensor data
and samples, derived from astronomy
Summary
• Data growing exponentially
• Requires a new model
– Having more data makes it harder to extract knowledge
• Information at your fingertips
– Students see the same data as professionals
• More data coming: Petabytes/year by 2010
– Need scalable solutions
– Move analysis to the data!
• Same thing happening in all sciences
– High energy physics, genomics/proteomics,
medical imaging, oceanography, environmental science…
• Data Exploration: an emerging new branch of science
– We need multiple skills in a world of increasing specialization…