Early Experience Prototyping a Science Data Server for

Download Report

Transcript Early Experience Prototyping a Science Data Server for

Early Experience Prototyping a Science
Data Server for Environmental Data
Deb Agarwal, LBL ([email protected])
Catharine van Ingen, MSFT
([email protected])
20 September 2006
Berkeley Water Center
Outline
 Landscape
 Data archives and other sources
 Typical small group collaboration needs
 Examples using “Ameriflux”
 Science Data Server
 Goals and ideal capabilities
 Approach
 Experiences with the current system
 Next generation
 Next set of development efforts
 Research issues
 Conclusion
Berkeley Water Center
Unprecedented Data Availability
Berkeley Water Center
Typical Data Flow Today
Large Data Archives
Local measurements
Models
Berkeley Water Center
Ameriflux Collaboration Overview
 149 Sites across the Americas
 Each site reports a minimum
of 22 common measurements.
 Communal science – each
principle investigator acts
independently to prepare and
publish data.
 Data published to and
archived at Oak Ridge.
 Total data reported to date on
the order of 150M half-hourly
measurements.
 http://public.ornl.gov/ameriflux/
Berkeley Water Center
6
What A Tower Sees
Fc  Fstorage  NEE  Pnet  Rleaf  R wood  R roots  R microbes
1. Applications of eddy covariance measurements, Part 1: Lecture on Analyzing and
Interpreting CO2 Flux Measurements, Dennis Baldocchi, CarboEurope Summer
Course, 2006, Namur, Belgium (http://nature.berkeley.edu/biometlab/lectures/)
Berkeley Water Center
Example Carbon-Climate Investigations
 Net carbon exchange for the ecosystem
 Impact of climate change on the greening of
ecosystems
 Start of leaf growth
 Duration of photosynthesis
 Effects of early spring on carbon uptake
 Role of ecosystem and latitude on carbon flux
 Effect of various pollution sources on carbon in
atmosphere and carbon balance
Berkeley Water Center
Measurements Are Not Simple or Complete
 Gaps in the data




 Quiet nights
 Bird poop
 High winds
 ….
Difficult to make measurements
 Leaf area index
 Wood respiration
 Soil respiration
 …
Localized measurements – tower footprint
Local investigator knowledge important
PIs’ science goals are not uniform across the towers
Berkeley Water Center
Soils
Climate
Observatory
datasets
Spatially
continuous
datasets
Examples of Carbon-Climate Datasets
Remote Sensing
Scientific Data Server
Large Data Archives
Local measurements
Berkeley Water Center
Scientific Data Server - Goals
 Act as a local repository for data and metadata assembled by a
small group of scientists from a wide variety of sources
 Simplify provenance by providing a common “safe deposit box”
for assembled data
 Interact simply with existing and emerging Internet portals for data
and metadata download, and, over time, upload
 Simplify data assembly by adding automation
 Simplify name space confusion by adding explicit decode
translation
 Support basic analyses across the entire dataset for both data
cleaning and science
 Simplify mundane data handling tasks
 Simplify quality checking and data selection by enabling data
browsing
Berkeley Water Center
Scientific Data Server - Non-Goals
 Replace the large Internet data source sites

The technology developed may be applicable, but the focus is
on the group collaboration scale and usability
 Very large datasets require different operational practices
 Perform complex modeling and statistical analyses
 There are a lot of existing tools with established trust based on
long track records
 Only part of a full LIMS (laboratory information management
system)
 Develop a new standard schema or controlled vocabulary
 Other work on these is progressing independently
 Due to the heterogeneity of the data, more than one such
standard seems likely to be relevant
Berkeley Water Center
Scientific Data Server - Workflows





Staging: adding data or metadata

New downloaded or field measurements
added

New derived measurements added
Editing: changing data or metadata

Existing older measurements re-calibrated
or re-derived

Data cleaning or other algorithm changes

Gap filling
Sharing: making the latest acquired data
available rapidly

Even before all the checks have been
made

Browsing new data before more detailed
analyses
Private Analysis: Supporting individual
researchers (MyDB)

Stable location for personal calibrations,
derivations, and other data transformations

Import/Export to analysis tools and models
Curating: data versioning and provenance

Simple parent:child versioning to track
collections of data used for specific uses
Large Data Archives
Berkeley Water Center
Local measurements
Scientific Data Server - Logical Overview
Data
Access
and
Analysis
Tools
Analysis Tools
Excel, Matlab,
SPlus, SAS,
ArcGIS
Simple web
data plots and
tables
Staging
Databases
and
Cubes
BigPlot data
browsing
Private
Data
Analysis
Databases
and
Cubes
Older Dataset(s)
Archive
Database
Last Known
Good Dataset(s)
Database
Last Known Good
Dataset(s) Cubes
Flat file data
import/export
Latest Dataset
Database
Computational
Models
Scientific Data Server
Berkeley Water Center
Latest
Dataset
Cube
Databases
 All descriptive metadata and data held in
relational databases
 Metadata is important too!
 While separate databases are shown,
the datasets may actually reside in a
single database
 Mapping is transparent to the
scientist
 Separate databases used for
performance
 Unified databases used for
simplicity
 New metadata and data are staged with
a temporary database
 Minimal quality checks applied
 All name and unit conversions
 Data may be exported to flat file, copied
to a private MyDb database, directly
accessed programmatically, or ?
Older Dataset(s)
Archive
Database
Staging
Database
Berkeley Water Center
Last Known
Good Dataset(s)
Database
Latest Dataset
Database
MyDb
Analysis
Database
Data Cubes

A data cube is a database specifically for
data mining (OLAP)
 Initially developed for commercial
needs like tracking sales of Oreos
and milk
 Simple aggregations (sum, min, or
max) can be pre-computed for
speed
 Additional calculations (median) can
be computed dynamically
 Both operate along dimensions such
as time, site, or datumtype
 Constructed from a relational
database
 A specialized query language (MDX)
is used
 Client tool integrations is evolving
 Excel PivotTables allow simple data
viewing
 More powerful charting with
Tableaux or ProClarity (commercial
mining tools)
Staging
Data
Cube
Berkeley Water Center
Last Known Good
Dataset(s) Cubes
MyDb
Analysis
Data
Cubes
Latest
Dataset
Cube
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
Canada - Boreas 1850
Canada -- BOREAS NSA - 1930 bu
Canada -- BOREAS NSA - 1963 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - Old Bla
Canada -- British Col., Campbe
Canada -- Lethbridge
USA -- AK Atqasuk, Alaska
USA -- AK Barrow, Alaska
USA -- AK Happy Valley, Alaska
USA -- AK Upad, Alaska
USA -- AZ Audubon Research Ran
USA -- CA Blodgett Forest, Cal
USA -- CA Sky Oaks, Old Stand,
USA -- CA Sky Oaks, Young Stan
USA -- CA Tonzi Ranch, Califor
USA -- CA Vaira Ranch, Ione, C
USA -- CO Niwot Ridge Forest,
USA -- CT Great Mountain Fores
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- FL Slashpine-Austin Car
USA -- FL Slashpine-Donaldson,
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Rayonier,m
USA -- IL Bondville, Illinois
USA -- IN Morgan Monroe State
USA -- KS Walnut River Watersh
USA -- MA Harvard Forest EMS T
USA -- MA Harvard Forest hemlo
USA -- MA Little Prospect Hill
USA -- ME Howland Forest (main
USA -- MI Sylvania Wilderness
USA -- MI Univ. of Mich. Biolo
USA -- MO Missouri Ozark Site
USA -- MS Goodwin Creek, Missi
USA -- MT Fort Peck, Montana
USA -- NC Duke Forest - loblol
USA -- NC Duke Forest-hardwood
USA -- NE Mead - irrigated con
USA -- NE Mead - irrigated mai
USA -- NE Mead - rainfed maize
USA -- OK Little Washita Water
USA -- OK Ponca City, Oklahoma
USA -- OK Shidler, Oklahoma
USA -- OK Southern Great Plain
USA -- OR Metolius-first young
USA -- OR Metolius-intermediat
USA -- OR Metolius-old aged po
USA -- SD Black Hills, South D
USA -- SD Brookings, South Dak
USA -- TN Walker Branch Waters
USA -- WA Wind River Crane Sit
USA -- WI Lost Creek, Wisconsi
USA -- WI Park Falls/WLEF, Wis
USA -- WI Willow Creek, Wiscon
USA -- WV Canaan Valley, West
Browsing For Data Availability
Sites Reporting Data Colored by Year
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
Ameriflux Data Availability : All Data
Berkeley Water Center
Browsing For Data Availability
Total Data Availability by Type Colored by Site
18,000,000
16,000,000
14,000,000
12,000,000
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
Berkeley Water Center
WS
WD
VPD
UW
UST
U
TS
Tdew
Tbole
TA
Data type reporting is far from uniform across type
TAU
SVP
SWC
Sa
SCO2
Rn
RH
Rgl
Rg
Rd
PREC
PRESS
PAR
Other
O3
NEE
LE
Leafwetness
H
H2O
GPP
FPAR
FG
FH2O
FC
DT
CO2
APAR
0
USA -- WV Canaan Valley, West
USA -- WI Willow Creek, Wiscon
USA -- WI Park Falls/WLEF, Wis
USA -- WI Lost Creek, Wisconsi
USA -- WA Wind River Crane Sit
USA -- TN Walker Branch Waters
USA -- SD Brookings, South Dak
USA -- SD Black Hills, South D
USA -- OR Metolius-old aged po
USA -- OR Metolius-intermediat
USA -- OR Metolius-first young
USA -- OK Southern Great Plain
USA -- OK Shidler, Oklahoma
USA -- OK Ponca City, Oklahoma
USA -- OK Little Washita Water
USA -- NE Mead - rainfed maize
USA -- NE Mead - irrigated mai
USA -- NE Mead - irrigated con
USA -- NC Duke Forest-hardw ood
USA -- NC Duke Forest - loblol
USA -- MT Fort Peck, Montana
USA -- MS Goodw in Creek, Missi
USA -- MO Missouri Ozark Site
USA -- MI Univ. of Mich. Biolo
USA -- MI Sylvania Wilderness
USA -- ME How land Forest (main
USA -- MA Little Prospect Hill
USA -- MA Harvard Forest hemlo
USA -- MA Harvard Forest EMS T
USA -- KS Walnut River Watersh
USA -- IN Morgan Monroe State
USA -- IL Bondville, Illinois
USA -- FL Slashpine-Rayonier,m
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Donaldson,
USA -- FL Slashpine-Austin Car
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- CT Great Mountain Fores
USA -- CO Niw ot Ridge Forest,
USA -- CA Vaira Ranch, Ione, C
USA -- CA Tonzi Ranch, Califor
USA -- CA Sky Oaks, Young Stan
USA -- CA Sky Oaks, Old Stand,
USA -- CA Blodgett Forest, Cal
USA -- AZ Audubon Research Ran
USA -- AK Upad, Alaska
USA -- AK Happy Valley, Alaska
USA -- AK Barrow , Alaska
USA -- AK Atqasuk, Alaska
Canada -- Lethbridge
Canada -- British Col., Campbe
Canada -- BOREAS NSA - Old Bla
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1963 bu
Canada -- BOREAS NSA - 1930 bu
Canada - Boreas 1850
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
0
USA -- AK Upad, Alaska
USA -- KS Walnut River Watersh
Berkeley Water Center
USA -- MI Univ. of Mich. Biolo
USA -- WV Canaan Valley, West
USA -- WI Willow Creek, Wiscon
USA -- WI Park Falls/WLEF, Wis
USA -- WI Lost Creek, Wisconsi
USA -- WA Wind River Crane Sit
USA -- TN Walker Branch Waters
USA -- SD Brookings, South Dak
H
Rd
TAU
USA -- SD Black Hills, South D
USA -- OR Metolius-old aged po
USA -- OR Metolius-intermediat
USA -- OR Metolius-first young
USA -- OK Southern Great Plain
GPP
PRESS
TA
WS
USA -- OK Shidler, Oklahoma
USA -- OK Ponca City, Oklahoma
USA -- OK Little Washita Water
USA -- NE Mead - rainfed maize
USA -- NE Mead - irrigated mai
USA -- NE Mead - irrigated con
FPAR
PREC
SWC
WD
USA -- NC Duke Forest-hardwood
USA -- NC Duke Forest - loblol
USA -- MT Fort Peck, Montana
USA -- MS Goodwin Creek, Missi
USA -- MO Missouri Ozark Site
FH2O
PAR
SVP
VPD
USA -- MI Sylvania Wilderness
USA -- ME Howland Forest (main
USA -- MA Little Prospect Hill
USA -- MA Harvard Forest hemlo
USA -- MA Harvard Forest EMS T
FG
Other
SCO2
UW
USA -- IN Morgan Monroe State
USA -- IL Bondville, Illinois
USA -- FL Slashpine-Rayonier,m
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Donaldson,
FC
O3
Sa
UST
USA -- FL Slashpine-Austin Car
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- CT Great Mountain Fores
USA -- CO Niwot Ridge Forest,
USA -- CA Vaira Ranch, Ione, C
DT
NEE
Rn
U
USA -- CA Tonzi Ranch, Califor
USA -- CA Sky Oaks, Young Stan
USA -- CA Sky Oaks, Old Stand,
USA -- CA Blodgett Forest, Cal
USA -- AZ Audubon Research Ran
CO2
Leafwetness
RH
TS
USA -- AK Happy Valley, Alaska
USA -- AK Barrow, Alaska
USA -- AK Atqasuk, Alaska
Canada -- Lethbridge
Canada -- British Col., Campbe
Canada -- BOREAS NSA - Old Bla
APAR
LE
Rgl
Tdew
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1963 bu
6,000,000
Canada -- BOREAS NSA - 1930 bu
7,000,000
Canada - Boreas 1850
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
Browsing for Data Availability
Total Data Availability by Site Colored by Type
H2O
Rg
Tbole
5,000,000
4,000,000
3,000,000
2,000,000
1,000,000
Sites report more data either because of longevity or specific research interests
Browsing for Data Quality
Real field data has both short term
gaps and longer term outages

The utility of the data depends
on the nature of the science
being performed

Browsing data counts can give
rapid insight into how the data
can be used before more
complex analyses are
performed
30
What’s going on at higher
latitudes?
(It should be getting colder)
25
20
15
Deg C

10
5
0
20
30
40
50
60
70
80
-5
-10
Average Temperature
-15
Latitude
Soil Water Content (% Volume)
6000
45
Measurements
charted on
axes are gaps
40
35
55.86306 BOREAS NSA 1981 burn site
0 (cm)
20 (cm)
55.879002 BOREAS NSA Old Black Spruce
5000
55.90583 BOREAS NSA 1930 burn site
Data often missing in the winter!
4000
55.911671 BOREAS NSA 1963 burn site
30
3000
Vaira Ranch
25
55.916672 BOREAS NSA 1989 burn site
20
56.63583 BOREAS NSA 1998 burn site
2000
15
10
69.133331 AK Happy
Valley
1000
70.281471 AK Upad
5
0
0
0
10
20
30
40
50
60
70.496002 AK Atqasuk
1
2
3
-5
Data Count
-10
Tonzi Ranch
Berkeley Water Center
4
5
6
7
8
9
10
11
12
Browsing for Data Quality

Real field data has unit and time scale
conversion problems

Sometimes easy to spot in
isolation

Sometimes easier to spot when
comparing to other data

Browsing data values can give
rapid insight into how the data can
be used before more complex
analyses are performed
Maximum Annual Air Temperature
Global Warming or
Reporting in Fahrenheit?
Local time or GMT time?
Average Air Temperature
Two Nearby Sites
Odd Microclimate Effects or
Error in Time Reporting ?
Berkeley Water Center
Lessons Learned To Date





Metadata is as important as data

Comparing sites of like vegetation, climate is as important
as latitude or other physical quantity

Curate the two together
Controlled vocabularies are hard

Humans like making up names and have a hard time
remembering 100+ names

Assume a decode step in the staging pipeline
There are at least three database schema families and two cube
construction approaches

Everyone has a favorite

Each has advantages and disadvantages

Automate the maintenance and use the right one for the
right job
Visual programming tools are great for prototyping

But debugging and maintenance can hit a wall

It’s easy to overbuild – use when “good enough”
Data analysis and data cleaning are intertwined

Data cleaning is always on-going

Share the simple tools and visualizations
The saga continues at http://dsd.lbl.gov/BWC/amfluxblog/ and
http://research.microsoft.com/~vaningen/BWC/BWC.htm
Berkeley Water Center
Near Term Futures
 Improve current capabilities


Assemble gap-filled and non-gap filled data sets
Implement incremental data staging to enable speedy and
simple data editing by an actual scientist (rather than a
programmer)
 Implement expanded metadata handling to enable scientist
to add site characteristics and sort sites on those expanded
definitions
 Add basic reporting capabilities for server-side browsing of
data availability to speed and simplify locating “interesting”
data
 Apply Data Server capabilities to a different set of data with
different (but related) science
 Considering either Russian River or Yosemite Valley
hydrological data
 Will be automating download from multiple different national
data sets
 Spatial (GIS) analyses more important
 Linkage with imagery data necessary for science
Berkeley Water Center
Longer Term Futures
 Handling imagery and other remote sensing information






 Curating images is different from curating time series data
 Using both together enables new science and new insights
 Graphical selection and display of data
Support for user specified calculations within the database
Support for direct connections to analysis and statistical
packages
Linkage with models
 Additional (emerging) data standards such as NetCDF
 Handling “just in time” data delivery and model result
curation
Data mining subscription services
Handling of a broader array of data types
Support for workflow tools
Berkeley Water Center
Conclusions
 Large data archives create the opportunity to
 Do science at the regional and global scale
 Combine data from multiple disciplines
 Perform historical trend analysis
 Small scientific collaborations need help to
 Perform analyses using more data than they can
currently manage
 Enable data handling and versioning
 Store the currently needed data and metadata
 Browse the data for science
 It’s the science, not the computer science
 Computer science research can certainly help
Berkeley Water Center
URLs
 Berkeley Water Center (BWC)
http://esd.lbl.gov/BWC/
 Microsoft Project at BWC
http://esd.lbl.gov/BWC/thrust_areas/mstci.html
 Ameriflux Project
http://esd.lbl.gov/BWC/thrust_areas/ameriflux.html
http://dsd.lbl.gov/BWC/amfluxblog/
Berkeley Water Center