Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
Download ReportTranscript Managing Data for the World Wide Telescope aka: The Virtual Observatory Jim Gray Alex Szalay SLAC Data Management Workshop.
Managing Data for the World Wide Telescope aka: The Virtual Observatory
Jim Gray Alex Szalay SLAC Data Management Workshop 1
The Evolution of Science
• • • •
Observational Science
– Scientist gathers data by direct observation – Scientist analyzes data
Analytical Science
– Scientist builds analytical model – Makes predictions.
Computational Science
– Simulate analytical model – Validate model and makes predictions
Data Exploration Science
Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database / files – Scientist analyzes database / files 2
Information Avalanche
• In science, industry, government,….
Image courtesy C. Meneveau & A. Szalay @ JHU
– better observational instruments and – and, better simulations producing a data avalanche • • Examples – BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information – CERN: LHC will generate 1GB/s .~10 PB/y – VLBA (NRAO) generates 1GB/s today – Pixar: 100 TB/Movie
New emphasis on informatics:
–
Capturing, Organizing, Summarizing, Analyzing, Visualizing
BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ 3 Space Telescope
Experiments & Instruments
The Big Picture
questions Other Archives facts facts Literature
?
answers Simulations • Data ingest • Managing a petabyte • Common schema
The Big Problems
• How to organize it?
• How to
re
organize it • How to coexist with others • Query and Vis tools • Support/training • Performance – Execute queries in a minute – Batch query scheduling 4
FTP - GREP
• Download (FTP and GREP) are not adequate – You can GREP 1 MB in a second – You can GREP 1 GB in a minute – You can GREP 1 TB in 2 days – You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~3,000 disks • At some point we need
indices
to limit search
parallel
data search and analysis • This is where databases can help • Next generation technique:
Data Exploration
– Bring the analysis to the data!
5
The Speed Problem
• Many users want to search the whole DB ad hoc queries, often combinatorial • Want ~ 1 minute response • Brute force (parallel search): – 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB • Indices (limit search, do column store) – 1,000x less equipment: 1M$/PB • Pre-compute answer – No one knows how do it for all questions. 6
Next-Generation Data Analysis
• Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N 2 , likelihood techniques N 3 • As data and computers grow at same rate, we can only keep up with
N logN
• A way out? – Relax notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory • Combination of statistics & computer science 7
Analysis and Databases
• Much statistical analysis deals with – Creating uniform samples – – data filtering – Assembling relevant subsets – Estimating completeness – censoring bad data – Counting and building histograms – Generating Monte-Carlo subsets – Likelihood calculations – Hypothesis testing • Traditionally these are performed on files • Most of these tasks are much better done inside a database • Move Mohamed to the mountain, not the mountain to Mohamed.
8
Organization & Algorithms
• Use of clever data structures (trees, cubes): – Up-front creation cost, but only
N logN access cost
– Large speedup during the analysis – Tree-codes for correlations (A. Moore et al 2001) – Data Cubes for OLAP (all vendors) • Fast, approximate heuristic algorithms – No need to be more accurate than cosmic variance – Fast CMB analysis by Szapudi et al (2001) •
N logN
instead of
N 3
=> 1 day instead of 10 million years • Take cost of computation into account – Controlled level of accuracy – Best result in a given time, given our computing resources 9
World Wide Telescope Virtual Observatory
http://www.ivoa.net/ • Premise:
Most data is (or could be online)
• The Internet is the world’s best telescope: – It has data on every part of the sky – In every measured spectral band: optical, x-ray, radio ..
– As deep as the best instruments (2 years ago).
– It is up when you are up.
The “seeing” is always great (no working at night, no clouds no moons no..).
– It’s a smart telescope: links objects and data to literature on them.
10
Why Astronomy?
• Community has lots of data • Data is real and well documented – High-dimensional (with confidence intervals) – Spatial, temporal • Diverse and distributed – Many different instruments from many different places and many different times • Community wants to share/cross compare – –
Can freely share data and algorithms.
“DataMining, Not Data MINE!!”
Mark Ellisman, UCSD • They are well organized • Community is small and homogeneous • No commercial or privacy concerns – All the problems are technical or social.
11
The WWT Components
• Data Sources – Literature – Archives • Unified Definitions – Units, – Semantics/Concepts/Metrics, Representations, – Provenance • Object model • Classes and methods • Portals 12
Data Sources
• Literature online and cross indexed – Simbad, ADS, NED, http://simbad.u-strasbg.fr/Simbad , http://adswww.harvard.edu/ , http://nedwww.ipac.caltech.edu/ • Many curated archives online – FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,… – Typically files with English meta-data and some programs • Groups, Researchers, Amateurs Publish – Datasets online in various formats – Data publications are ephemeral (may disappear) – Many have unknown provenance • Documentation varies; some good and some none. 13
Unified Definitions
• Universal Content Definitions http://vizier.u-strasbg.fr/doc/UCD.htx
– Collated all table heads from all the literature – 100,000 terms reduced to ~1,500 – Rough consensus that this is the right thing.
– Refinement in progress as people use UCDs • Defines – Units: • gram, radian, second, janski... – Semantic Concepts / Metrics • Std error, Chi 2 fit, magnitude, flux @ passband, velocity, 14
Provenance
• Most data will be derived.
• To do science, need to trace derived data back to source.
• So programs and inputs must be registered.
• Must be able to re-run them.
• Example: Space Telescope Calibrated Data – Run on demand – Can specify software version (to get old answers) • Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science). 15
Object Model
• General acceptance of XML • Recent acceptance of XML Schema (XSD over DTD) Your program • Wait-and-See about SOAP/WSDL/… – “ Web Services are just Corba with angle brackets.” – FTP is good enough for me.
• Personal opinion: – Web Services are much more than “Corba + <>” – Huge focus on interop – Huge focus on integrated tools • But the community says “Show me!” – Many technologists convinced, but not yet the astronomers Your program Data In your address space Web Server Web Service 16
Classes and Methods
Your program • First Class: VO table http://www.us-vo.org/VOTable/ – Represents an answer set in XML • Defined by an XML Schema (XSD) • Metadata (in terms of UCDs) • Data representation (numbers and text) Data In your address space – First method • Cone Search: Get objects in this cone http://voservices.org/cone/ Web Service 17
Other Classes
• Space-Time class – http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf
Your program • Image Class (returns pixels) – SdssCutout – Simple Image Access Protocol http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf
– HyperAtlas http://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf
Data In your address space • Spectral – Simple Spectral Access Protocol – 500K spectra available at http://voservices.net/wave • Query Services – ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/ – And http://SkyQuery.Net
• Registry: – see below Web Service 18
The Registry
• UDDI seemed inappropriate – Complex – Irrelevant questions – Relevant questions missing • • Evolved Dublin Core – Represent Datasets, Services, Portals – Needs to be machine readable – Federation (DNS model) – Push & Pull: register then harvest http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg 19
Demo
• SkyServer: – navigator showing cutout web service – List: showing many calls and variant use.
• SkyQuery: – Show integration of various archives.
– Explain spatial join xMatch operator. 20
SkyServer.SDSS.org
• A modern Astronomy archive – Raw Pixel data lives in file servers – Catalog data (derived objects) lives in Database – Online query to any and all • Also used for education – 150 hours of online Astronomy – Implicitly teaches data analysis • Interesting things – Spatial data search – Client query interface via Java Applet – Query interface via Emacs – Popular – Cloned by other surveys (a template design) – Web services are core of it.
21
SkyQuery A Prototype WWT
• Started with SDSS data and schema • Imported12 other datasets into that spine schema.
(a day per dataset plus load time) • Unified them with a portal • Implicit spatial join among the datasets.
• All built on Web Services – Pure XML – Pure SOAP – Used .NET toolkit 22
Federation: SkyQuery.Net
• Combine 4 archives initially • Added 9 more • Send query to portal, portal joins data from archives.
• Problem: want to do multi-step data analysis (not just single query).
• Solution: Allow personal databases on portal • Problem: some queries are monsters • Solution: “batch schedule” on portal server, Deposits answer in personal database.
23
SkyQuery Structure
• Portal is – Plans Query (2 phase) – Integrates answers – Is a web service • Each SkyNode publishes – Schema Web Service – Database Web Service
INT Image Cutout SDSS SkyQuery Portal FIRST
24
2MASS
MyDB
http://skyservice.pha.jhu.edu/devel/casjobs/ • Portal allows federation of data but… • Intermediate results may be large. • Intermediate results feed into next analysis step.
• Sending them back-and-forth to client is costly and sometimes infeasible.
• Solution: create a working DB for client at Portal: MyDB 25
MyDB
http://skyservice.pha.jhu.edu/devel/casjobs/ • Anyone can create a personal DB at SkyServer portal. – It is about 100 MB – It is private • Simple queries done immediately • Complex queries done by batch scheduler • All queries can create/read/write MyDB tables • Very popular with “serious” users.
• MyDB will be sharable with by a group. 26
Open SkyQuery
• SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI (Open Grid Services Architecture, Data Access and Integration).
• SkyNode basic archive object http://www.ivoa.net/twiki/bin/view/IVOA/SkyNode • SkyQuery Language (VoQL) is evolving.
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL 27
Outline
The WWT Components
What we learned
• Data Sources • Astro is a community of 10,000 – Literature – Archives • Unified Definitions – Units, Representations, • Homogenous & Cooperative • If you can’t do it for Astro, do not bother with 3M bio-info.
• Agreement – Semantics/Concepts/Metrics, – Takes time – Provenance – Takes endless meetings • Object model • Big problems are non-technical • Classes and methods • Portals – Legacy is a big problem.
• Plumbing and tools are there But… • WWT is a poster child for the Data Grid.
– What is the object model?
– What do you want to save?
– How document provenance? 28