Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.

Download Report

Transcript Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.

Managing Data for the World Wide Telescope aka: The Virtual Observatory

Jim Gray Alex Szalay SLAC Data Management Workshop 1

The Evolution of Science

• • • •

Observational Science

– Scientist gathers data by direct observation – Scientist analyzes data

Analytical Science

– Scientist builds analytical model – Makes predictions.

Computational Science

– Simulate analytical model – Validate model and makes predictions

Data Exploration Science

Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database / files – Scientist analyzes database / files 2

Information Avalanche

• In science, industry, government,….

Image courtesy C. Meneveau & A. Szalay @ JHU

– better observational instruments and – and, better simulations producing a data avalanche • • Examples – BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information – CERN: LHC will generate 1GB/s .~10 PB/y – VLBA (NRAO) generates 1GB/s today – Pixar: 100 TB/Movie

New emphasis on informatics:

Capturing, Organizing, Summarizing, Analyzing, Visualizing

BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ 3 Space Telescope

Experiments & Instruments

The Big Picture

questions Other Archives facts facts Literature

?

answers Simulations • Data ingest • Managing a petabyte • Common schema

The Big Problems

• How to organize it?

• How to

re

organize it • How to coexist with others • Query and Vis tools • Support/training • Performance – Execute queries in a minute – Batch query scheduling 4

FTP - GREP

• Download (FTP and GREP) are not adequate – You can GREP 1 MB in a second – You can GREP 1 GB in a minute – You can GREP 1 TB in 2 days – You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~3,000 disks • At some point we need

indices

to limit search

parallel

data search and analysis • This is where databases can help • Next generation technique:

Data Exploration

– Bring the analysis to the data!

5

The Speed Problem

• Many users want to search the whole DB ad hoc queries, often combinatorial • Want ~ 1 minute response • Brute force (parallel search): – 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB • Indices (limit search, do column store) – 1,000x less equipment: 1M$/PB • Pre-compute answer – No one knows how do it for all questions. 6

Next-Generation Data Analysis

• Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N 2 , likelihood techniques N 3 • As data and computers grow at same rate, we can only keep up with

N logN

• A way out? – Relax notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory • Combination of statistics & computer science 7

Analysis and Databases

• Much statistical analysis deals with – Creating uniform samples – – data filtering – Assembling relevant subsets – Estimating completeness – censoring bad data – Counting and building histograms – Generating Monte-Carlo subsets – Likelihood calculations – Hypothesis testing • Traditionally these are performed on files • Most of these tasks are much better done inside a database • Move Mohamed to the mountain, not the mountain to Mohamed.

8

Organization & Algorithms

• Use of clever data structures (trees, cubes): – Up-front creation cost, but only

N logN access cost

– Large speedup during the analysis – Tree-codes for correlations (A. Moore et al 2001) – Data Cubes for OLAP (all vendors) • Fast, approximate heuristic algorithms – No need to be more accurate than cosmic variance – Fast CMB analysis by Szapudi et al (2001) •

N logN

instead of

N 3

=> 1 day instead of 10 million years • Take cost of computation into account – Controlled level of accuracy – Best result in a given time, given our computing resources 9

World Wide Telescope Virtual Observatory

http://www.ivoa.net/ • Premise:

Most data is (or could be online)

• The Internet is the world’s best telescope: – It has data on every part of the sky – In every measured spectral band: optical, x-ray, radio ..

– As deep as the best instruments (2 years ago).

– It is up when you are up.

The “seeing” is always great (no working at night, no clouds no moons no..).

– It’s a smart telescope: links objects and data to literature on them.

10

Why Astronomy?

• Community has lots of data • Data is real and well documented – High-dimensional (with confidence intervals) – Spatial, temporal • Diverse and distributed – Many different instruments from many different places and many different times • Community wants to share/cross compare – –

Can freely share data and algorithms.

“DataMining, Not Data MINE!!”

Mark Ellisman, UCSD • They are well organized • Community is small and homogeneous • No commercial or privacy concerns – All the problems are technical or social.

11

The WWT Components

• Data Sources – Literature – Archives • Unified Definitions – Units, – Semantics/Concepts/Metrics, Representations, – Provenance • Object model • Classes and methods • Portals 12

Data Sources

• Literature online and cross indexed – Simbad, ADS, NED, http://simbad.u-strasbg.fr/Simbad , http://adswww.harvard.edu/ , http://nedwww.ipac.caltech.edu/ • Many curated archives online – FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,… – Typically files with English meta-data and some programs • Groups, Researchers, Amateurs Publish – Datasets online in various formats – Data publications are ephemeral (may disappear) – Many have unknown provenance • Documentation varies; some good and some none. 13

Unified Definitions

• Universal Content Definitions http://vizier.u-strasbg.fr/doc/UCD.htx

– Collated all table heads from all the literature – 100,000 terms reduced to ~1,500 – Rough consensus that this is the right thing.

– Refinement in progress as people use UCDs • Defines – Units: • gram, radian, second, janski... – Semantic Concepts / Metrics • Std error, Chi 2 fit, magnitude, flux @ passband, velocity, 14

Provenance

• Most data will be derived.

• To do science, need to trace derived data back to source.

• So programs and inputs must be registered.

• Must be able to re-run them.

• Example: Space Telescope Calibrated Data – Run on demand – Can specify software version (to get old answers) • Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science). 15

Object Model

• General acceptance of XML • Recent acceptance of XML Schema (XSD over DTD) Your program • Wait-and-See about SOAP/WSDL/… – “ Web Services are just Corba with angle brackets.” – FTP is good enough for me.

• Personal opinion: – Web Services are much more than “Corba + <>” – Huge focus on interop – Huge focus on integrated tools • But the community says “Show me!” – Many technologists convinced, but not yet the astronomers Your program Data In your address space Web Server Web Service 16

Classes and Methods

Your program • First Class: VO table http://www.us-vo.org/VOTable/ – Represents an answer set in XML • Defined by an XML Schema (XSD) • Metadata (in terms of UCDs) • Data representation (numbers and text) Data In your address space – First method • Cone Search: Get objects in this cone http://voservices.org/cone/ Web Service 17

Other Classes

• Space-Time class – http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf

Your program • Image Class (returns pixels) – SdssCutout – Simple Image Access Protocol http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf

– HyperAtlas http://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf

Data In your address space • Spectral – Simple Spectral Access Protocol – 500K spectra available at http://voservices.net/wave • Query Services – ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/ – And http://SkyQuery.Net

• Registry: – see below Web Service 18

The Registry

• UDDI seemed inappropriate – Complex – Irrelevant questions – Relevant questions missing • • Evolved Dublin Core – Represent Datasets, Services, Portals – Needs to be machine readable – Federation (DNS model) – Push & Pull: register then harvest http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg 19

Demo

• SkyServer: – navigator showing cutout web service – List: showing many calls and variant use.

• SkyQuery: – Show integration of various archives.

– Explain spatial join xMatch operator. 20

SkyServer.SDSS.org

• A modern Astronomy archive – Raw Pixel data lives in file servers – Catalog data (derived objects) lives in Database – Online query to any and all • Also used for education – 150 hours of online Astronomy – Implicitly teaches data analysis • Interesting things – Spatial data search – Client query interface via Java Applet – Query interface via Emacs – Popular – Cloned by other surveys (a template design) – Web services are core of it.

21

SkyQuery A Prototype WWT

• Started with SDSS data and schema • Imported12 other datasets into that spine schema.

(a day per dataset plus load time) • Unified them with a portal • Implicit spatial join among the datasets.

• All built on Web Services – Pure XML – Pure SOAP – Used .NET toolkit 22

Federation: SkyQuery.Net

• Combine 4 archives initially • Added 9 more • Send query to portal, portal joins data from archives.

• Problem: want to do multi-step data analysis (not just single query).

• Solution: Allow personal databases on portal • Problem: some queries are monsters • Solution: “batch schedule” on portal server, Deposits answer in personal database.

23

SkyQuery Structure

• Portal is – Plans Query (2 phase) – Integrates answers – Is a web service • Each SkyNode publishes – Schema Web Service – Database Web Service

INT Image Cutout SDSS SkyQuery Portal FIRST

24

2MASS

MyDB

http://skyservice.pha.jhu.edu/devel/casjobs/ • Portal allows federation of data but… • Intermediate results may be large. • Intermediate results feed into next analysis step.

• Sending them back-and-forth to client is costly and sometimes infeasible.

• Solution: create a working DB for client at Portal: MyDB 25

MyDB

http://skyservice.pha.jhu.edu/devel/casjobs/ • Anyone can create a personal DB at SkyServer portal. – It is about 100 MB – It is private • Simple queries done immediately • Complex queries done by batch scheduler • All queries can create/read/write MyDB tables • Very popular with “serious” users.

• MyDB will be sharable with by a group. 26

Open SkyQuery

• SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI (Open Grid Services Architecture, Data Access and Integration).

• SkyNode basic archive object http://www.ivoa.net/twiki/bin/view/IVOA/SkyNode • SkyQuery Language (VoQL) is evolving.

http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL 27

Outline

The WWT Components

What we learned

• Data Sources • Astro is a community of 10,000 – Literature – Archives • Unified Definitions – Units, Representations, • Homogenous & Cooperative • If you can’t do it for Astro, do not bother with 3M bio-info.

• Agreement – Semantics/Concepts/Metrics, – Takes time – Provenance – Takes endless meetings • Object model • Big problems are non-technical • Classes and methods • Portals – Legacy is a big problem.

• Plumbing and tools are there But… • WWT is a poster child for the Data Grid.

– What is the object model?

– What do you want to save?

– How document provenance? 28