Towards an International Virtual Observatory, Garching, 2002 Analyzing Large Datasets in Astrophysics (Living in an exponential world….) Alexander Szalay The Johns Hopkins University.

Download Report

Transcript Towards an International Virtual Observatory, Garching, 2002 Analyzing Large Datasets in Astrophysics (Living in an exponential world….) Alexander Szalay The Johns Hopkins University.

Towards an International Virtual Observatory,
Garching, 2002
Analyzing Large Datasets
in Astrophysics
(Living in an exponential world….)
Alexander Szalay
The Johns Hopkins University
Outline
Collecting Data
Exponential Growth
Making Discoveries
Publishing Data
VO: How will it work?
Web Services
Atomic vs Composite services
Distributed queries with SkyQuery
Cross-Matching Algorithm
SkyNode Web Services + Portal
Statistical Analysis of large data sets
Alex Szalay, Garching 2002
2
The World is Exponential
Astrophysical data is growing exponentially
Doubling every year (Moore’s Law+):
both data sizes and number of data sets
Computational resources scale the same way
Constant $$$ will keep up with the data
Main problem is the software component
Currently components are not reused
Software costs are increasingly larger fraction
Aggregate costs are growing exponentially
Alex Szalay, Garching 2002
3
Making Discoveries
When and where are discoveries made?
Always at the edges and boundaries
Going deeper, using more colors….
Metcalfe’s law
Utility of computer networks grows as the
number of possible connections: O(N2)
VO: Federation of N archives
Possibilities for new discoveries grow as O(N2)
Current sky surveys have proven this
Very early discoveries from SDSS, 2MASS, DPOSS
Alex Szalay, Garching 2002
4
Publishing Data
Roles
Traditional
Emerging
Authors
Scientists
Collaborations
Publishers
Journals
Project www site
Curators
Libraries
Bigger Archives
Consumers Scientists
Scientists
Alex Szalay, Garching 2002
5
Changing Roles
Exponential growth:
Projects last at least 3-5 years
Data sent upwards only at the end of the project
Data will be never centralized
More responsibility on projects
Becoming Publishers and Curators
Larger fraction of budget spent on software
Lot of development duplicated, wasted
More standards are needed
Easier data interchange, fewer tools
More templates are needed
Develop less software on your own
Alex Szalay, Garching 2002
6
Emerging New Concepts
Standardizing distributed data
Web Services, supported on all platforms
Custom configure remote data dynamically
XML: Extensible Markup Language
SOAP: Simple Object Access Protocol
WSDL: Web Services Description Language
Standardizing distributed computing
Grid Services
Custom configure remote computing dynamically
Build your own remote computer, and discard
Virtual Data: new data sets on demand
Alex Szalay, Garching 2002
7
NVO: How Will It Work?
Define commonly used `atomic’ services
Build higher level toolboxes/portals on top
We do not build `everything for everybody’
Use the 90-10 rule:
1
0.9
0.8
0.7
# of users
Define the standards and interfaces
Build the framework
Build the 10% of services
that are used by 90%
Let the users build the rest
from the components
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
# of s e rvice s
Alex Szalay, Garching 2002
8
Atomic Services
Metadata information about resources
Waveband
Sky coverage
Translation of names to universal dictionary (UCD)
Simple search patterns on the resources
Cone Search
Image mosaic
Unit conversions
Simple filtering, counting, histogramming
On-the-fly recalibrations
Alex Szalay, Garching 2002
9
Higher Level Services
Built on Atomic Services
Perform more complex tasks
Examples
Automated resource discovery
Cross-identifications
Photometric redshifts
Outlier detections
Visualization facilities
Expectation:
Build custom portals in matter of days from existing building
blocks (like today in IRAF or IDL)
Alex Szalay, Garching 2002
10
SkyQuery
Distributed Query tool using a set of services
Feasibility study, built in 6 weeks from scratch
Tanu Malik (JHU CS grad student)
Tamas Budavari (JHU astro postdoc)
Implemented in C# and .NET
Won 2nd prize of Microsoft XML Contest
Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
Alex Szalay, Garching 2002
11
Architecture
Web Page
Image cutout
SkyQuery
SkyNode
SDSS
SkyNode
2Mass
SkyNode
First
Alex Szalay, Garching 2002
12
Cross-id Steps
Parse query
Get counts
Sort by counts
Make plan
Cross-match
SELECT o.objId, o.r,
o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND (o.i - t.m_j) > 2
AND o.type=3
Recursively,
from small to large
Select necessary attributes only
Return output
Insert cutout image
Alex Szalay, Garching 2002
13
Monte-Carlo Simulation
Comparing different algorithms for 3-way xid
Transmit all the data
Transmit after filtering
Recursive cross-match
Surveys
2000
1500
1000
SDSS
2MASS
First
500
Random variables:
0
-4
Sky Area (0..10 sqdeg)
Selectivity of each subselect (0..1)
Efficiency of join (0.5..2)
Selectivity of common select (0..1)
-2 log cost 0
Alex Szalay, Garching 2002
2
4
14
SkyNode
Metadata functions (SOAP)
Info, Tables, Columns, Schema, Functions, Keysearch
Query functions (SOAP)
Dataset Query(String sqlCmd)
Dataset Xmatch(Dataset input, String sqlCmd, float eps)
Database
MS SQL Server
Upload dataset
Very fast spatial search engine (HTM-based)
crossmatch takes <3 ms/object over 15M in SDSS
User defined functions and stored procedures
Alex Szalay, Garching 2002
15
Data Flow
query
SkyQuery
SkyNode 1
SkyNode 2
SkyNode 3
http://www.skyquery.net
Alex Szalay, Garching 2002
16
Optimal Statistics
The examples for optimal statistics have poor scaling
Correlation functions N2, likelihood techniques N3
As data sizes grow at Moore’s law, computers can
only keep up with at most N logN algorithms
What goes?
Notion of optimal is in the sense of statistical errors
Assumes infinite computational resources
Assumes that only source of error is statistical
`Cosmic Variance’: we can only observe the Universe from one
location (finite sample size)
Solutions require combination of Statistics and CS
New algorithms: not worse than N logN
Alex Szalay, Garching 2002
17
Clever Data Structures
Heavy use of tree structures:
Up-front cost, but only N logN
Large speedup later
Tree-codes for correlations (A. Moore et al 2001)
Fast, approximate heuristic algorithms
No need to be more accurate than cosmic variance
Fast CMB analysis by Szapudi etal (2001)
• N logN instead of N3 => 1 day instead of 10 million years
Take cost of computation into account
Controlled level of accuracy
Best result in a given time, given our computing resources
Alex Szalay, Garching 2002
18
Angular Clustering with Photo-z
w() by Peebles and Groth:
The first example of publishing and analyzing large data
Samples based on rest-frame quantities
Strictly volume limited samples
Largest angular correlation study to date
Very clear detection of
Luminosity and color dependence
Results consistent with 3D clustering
T. Budavari, A. Connolly, I. Csabai, I. Szapudi, A. Szalay, S.
Dodelson, J. Frieman, R. Scranton, D. Johnston
and Alex
the Szalay,
SDSS Garching
Collaboration
2002
19
The Samples
2800 square degrees in 10 stripes, data in custom DB
All: 50M
mr<21 : 15M
10 stripes: 10M
0.1<z<0.3
-20 > Mr
0.1<z<0.5
-21.4 > Mr
2.2M
3.1M
-20 > Mr >-21
-21 > Mr >-23
-21 > Mr >-22
1182k
931k
662k
-22 > Mr >-23
343k 254k 185k
316k
280k 326k 185k
127k
Alex Szalay, Garching 2002
269k
20
The Stripes
10 stripes over the SDSS area, covering
about 2800 square degrees
About 20% lost due to bad seeing
Masks: seeing, bright stars, etc.
Images generated from query by web service
Alex Szalay, Garching 2002
21
The Masks
Stripe 11 + masks
Masks are derived from the database
Search and intersect extended objects with boundaries
Alex Szalay, Garching 2002
22
The Analysis
eSpICE : I.Szapudi, S.Colombi and S.Prunet
Integrated with the database by T. Budavari
Extremely fast processing (N logN)
1 stripe with about 1 million galaxies is processed in 3 mins
Usual figure was 10 min for 10,000 galaxies => 70 days
Each stripe processed separately for each cut
2D angular correlation function computed
w(): average with rejection of
pixels along the scan
flat field vector causes
mock correlations
Alex Szalay, Garching 2002
23
Angular Correlations I.
Luminosity dependence: 3 cuts
-20> M > -21
-21> M > -22
-22> M > -23
Alex Szalay, Garching 2002
24
Angular Correlations II.
Color Dependence
4 bins by rest-frame SED type
Alex Szalay, Garching 2002
25
Summary
Exponential data growth – distributed data
Web Services – hierarchical architecture
Use the 90-10 rule (maybe 80-20)
There are clever ways to federate datasets!
Statistical analyses do not follow Moore’s law
Need to revisit optimal statistics
Give interesting new tools into the hands of
smart young people…
They will quickly turn them into cutting edge
science
Alex Szalay, Garching 2002
26
Virtual Observatory
Astronomy with an
attitude…
Alex Szalay, Garching 2002
27