Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research [email protected] Http://research.Microsoft.com/~Gray Alex Szalay Johns Hopkins University [email protected].
Download ReportTranscript Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research [email protected] Http://research.Microsoft.com/~Gray Alex Szalay Johns Hopkins University [email protected].
Where The Rubber Meets the Sky Giving Access to Science Data Jim Gray Microsoft Research [email protected] Http://research.Microsoft.com/~Gray Alex Szalay Johns Hopkins University [email protected] 1 Outline • Want to build a TerraServer for Hungary? • My view of eScience 2 TerraServer / TerraService http://terraService.Net/ • • • • • • • • • http://TerraServer-USA.com/ USGS Photo of US Online since June 1998 Operated by Microsoft 20 TB data source 10 M web hits/day A web service Our laboratory I recommend you clone it for Hungary 100x less data (92k km2), very useful – Education, land management, science Info framework. 3 TerraServer – Today – LOW TCO • Storage Bricks – Commodity servers” – 4 TB raw / 2 TB Raid1 SATA storage – Dual 2 Ghz + 4GB RAM • Bunch – 3 Bricks = TerraServer data – Data partitioned KVM / IP • Low Cost Availability Pair & Spare – – – – RAID1 Mirroring Mirrored Bunches Spare Brick Web Application • Load balances mirrors • Uses surviving database on failure 4 Outline • Want to build a TerraServer for Hungary? • My view of eScience 5 New Science Paradigms • Thousand years ago: science was empirical describing natural phenomena • Last few hundred years: theoretical branch using models, generalizations 2 . 4G c2 a a 3 a 2 • Last few decades: a computational branch simulating complex phenomena • Today: data exploration (eScience) unify theory, experiment, and simulation using data management and statistics – Data captured by instruments Or generated by simulator – Processed by software – Scientist analyzes database / files 6 Information Avalanche and eScience • In science, industry, government,…. – better observational instruments and – and, better simulations producing a data avalanche • New emphasis on informatics: – Capturing, Organizing, Summarizing, Analyzing, Visualizing • Each science is objectfying itself – Defining core concepts – Integrating all data and literature online – Hungary could be a leader in this Image courtesy C. Meneveau & A. Szalay @ JHU BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ (you have the Martians – great tech education ) 7 Space Telescope The Big Picture Experiments & Instruments Other Archives Literature questions facts facts ? answers Simulations The Big Problems • • • • • • Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it? How to coexist with others? • • • Data Query and Visualization tools Support/training Performance – Execute queries in a minute – Batch (big) query scheduling 8 The Virtual Observatory • Premise: most data is (or could be online) • The Internet is the world’s best telescope: – It has data on every part of the sky – In every measured spectral band: optical, x-ray, radio.. – As deep as the best instruments (2 years ago). – It is up when you are up – The “seeing” is always great – It’s a smart telescope: links objects and data to literature • Software is the capital expense – Share, standardize, reuse.. 9 What X-info Needs from us (cs) (not drawn to scale) Miners Scientists Science Data & Questions Data Mining Algorithms Plumbers Database To store data Execute Queries Question & Answer Visualization Tools 10 Data Access Hitting a Wall Current science practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow • • • • You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. • • • • You can FTP 1 MB in 1 sec You can FTP 1 GB / min (~1$) … 2 days and 1K$ … 3 years and 1M$ • Oh!, and 1PB ~5,000 disks • At some point you need indices to limit search parallel data search and analysis • This is where databases can help 11 Next-Generation Data Analysis • Looking for – Needles in haystacks – the Higgs particle – Haystacks: dark matter, dark energy, turbulence, ecosystem dynamics • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N2, likelihood techniques N3 • As data and computers grow at Moore’s Law, we can only keep up with N logN • A way out? – Relax optimal notion (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory 12 • Requires combination of statistics & computer science Smart Data: Unifying DB and Analysis • There is too much data to move around Do data manipulations at database – Build custom procedures and functions into DB Move Mohamed to the mountain, – Unify data Access & Analysis not the mountain to Mohamed. – Examples • Statistical sampling and analysis • Temporal and spatial indexing • Pixel processing • Automatic parallelism • Auto (re)organize • Scalable to Petabyte datasets 13 Experiment Budgets ¼…½ Software Software for • Instrument scheduling • Instrument control • Data gathering • Data reduction • Database • Analysis • Visualization Millions of lines of code Repeated for experiment after experiment Not much sharing or learning Let’s work to change this Identify generic tools • Workflow schedulers • Databases and libraries • Analysis packages • Visualizers • … 14 Simulation (computational science) are > ½ software How to Help? • Can’t learn the discipline before you start (takes 4 years.) • Can’t go native – you are a CS person not a bio,… person • Have to learn how to communicate Have to learn the language • Have to form a working relationship with domain expert(s) • Have to find problems that leverage your skills 15 Working Cross-Culture A Way to Engage With Domain Scientists • Find someone who is desperate for help • Communicate in terms of scenarios • Work on a problem that gives 100x benefit – Weeks/task vs hours/task • Solve 20% of the problem – The other 80% will take decades • Prototype • Go from working-to-working, Always have – Something to show – Clear next steps – Clear goal • Avoid death-by-collaboration-meetings. 16 Working Cross-Culture -- 20 Questions: A Way to Engage With Domain Scientists • Astronomers proposed 20 questions • Typical of things they want to do • Each would require a week or more in old way (programming in tcl / C++/ FTP) • Goal, make it easy to answer questions • This goal motivates DB and tools design 17 The 20 Queries Q11: Find all elliptical galaxies with spectra that have an anomalous emission line. Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2’, and create a map of masks over the same grid. Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization. Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes. Q15: Provide a list of moving objects consistent with an asteroid. Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5. Q17: Find binary stars where at least one of them has the colors of a white dwarf. Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-I are less than 0.05m. Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies. Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 Also some good queries at: count of galaxies within 30"of it that have a photoz18 within http://www.sdss.jhu.edu/ScienceArchive/sxqt/sxQT/Example_Queries.html 0.05 of that galaxy. Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023 Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and 10<super galactic latitude (sgb) <10, and declination less than zero. Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75. Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds. Q5: Find all galaxies with a deVaucouleours profile (r¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profile Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes. Q7: Provide a list of star-like objects that are 1% rare. Q8: Find all objects with unclassified spectra. Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7. Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.) http://SkyServer.sdss.org • Solves the 20 queries • Has 150 hours of online instruction – Translated to Hungarian • Professional astronomers us it as the SDSS Science Catalog Analysis Service. • Clone operating in Hungary. 19 SkyQuery (http://skyquery.net/) • Distributed Query tool using a set of web services • Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England) • Has grown from 4 to 15 archives, now becoming international standard • Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2 20 SkyQuery Structure • Portal is – Plans Query (2 phase) – Integrates answers – Is itself a web service • Each SkyNode publishes – Schema Web Service – Database Web Service Image Cutout SDSS INT SkyQuery Portal FIRST 2MASS 21 MyDB: eScience Workbench • Prototype of bringing analysis to the data • Everybody gets a workspace (database) – Executes analysis at the data – Store intermediate results there – Long queries run in batch – Results shared within groups • Only fetch the final results • Extremely successful – matches work patterns 22 Summary • Computational Science – Simulation – Data Bases – Analysis (organization and mining) • needed by simulations and • Experiments – Visualization • Each Science X – Has a comp-X branch – getting a X-info branch – Objectifying that science: defining terms precisely • This broadening is multi-disciplinary – Pair: good domain scientist + good computer scientist – Chemistry is important • A concrete way to approach Grid-computing. 23 Outline • Want to build a TerraServer for Hungary? – Could be done inexpensively (if you have the data) – Microsoft would license the software to you • My view of eScience & Hungary – Hungary can’t lead in hardware – Hungary CAN lead in software • Algorithms: data mining, analysis • Tools: that implement the algorithms • Systems: learn by doing • Could start an industry, fits EU agenda. 24