Early Experience Prototyping a Science Data Server for
Download
Report
Transcript Early Experience Prototyping a Science Data Server for
Early Experience Prototyping a Science
Data Server for Environmental Data
Deb Agarwal, LBL ([email protected])
Catharine van Ingen, MSFT
([email protected])
20 September 2006
Berkeley Water Center
Outline
Landscape
Data archives and other sources
Typical small group collaboration needs
Examples using “Ameriflux”
Science Data Server
Goals and ideal capabilities
Approach
Experiences with the current system
Next generation
Next set of development efforts
Research issues
Conclusion
Berkeley Water Center
Unprecedented Data Availability
Berkeley Water Center
Typical Data Flow Today
Large Data Archives
Local measurements
Models
Berkeley Water Center
Ameriflux Collaboration Overview
149 Sites across the Americas
Each site reports a minimum
of 22 common measurements.
Communal science – each
principle investigator acts
independently to prepare and
publish data.
Data published to and
archived at Oak Ridge.
Total data reported to date on
the order of 150M half-hourly
measurements.
http://public.ornl.gov/ameriflux/
Berkeley Water Center
6
What A Tower Sees
Fc Fstorage NEE Pnet Rleaf R wood R roots R microbes
1. Applications of eddy covariance measurements, Part 1: Lecture on Analyzing and
Interpreting CO2 Flux Measurements, Dennis Baldocchi, CarboEurope Summer
Course, 2006, Namur, Belgium (http://nature.berkeley.edu/biometlab/lectures/)
Berkeley Water Center
Example Carbon-Climate Investigations
Net carbon exchange for the ecosystem
Impact of climate change on the greening of
ecosystems
Start of leaf growth
Duration of photosynthesis
Effects of early spring on carbon uptake
Role of ecosystem and latitude on carbon flux
Effect of various pollution sources on carbon in
atmosphere and carbon balance
Berkeley Water Center
Measurements Are Not Simple or Complete
Gaps in the data
Quiet nights
Bird poop
High winds
….
Difficult to make measurements
Leaf area index
Wood respiration
Soil respiration
…
Localized measurements – tower footprint
Local investigator knowledge important
PIs’ science goals are not uniform across the towers
Berkeley Water Center
Soils
Climate
Observatory
datasets
Spatially
continuous
datasets
Examples of Carbon-Climate Datasets
Remote Sensing
Scientific Data Server
Large Data Archives
Local measurements
Berkeley Water Center
Scientific Data Server - Goals
Act as a local repository for data and metadata assembled by a
small group of scientists from a wide variety of sources
Simplify provenance by providing a common “safe deposit box”
for assembled data
Interact simply with existing and emerging Internet portals for data
and metadata download, and, over time, upload
Simplify data assembly by adding automation
Simplify name space confusion by adding explicit decode
translation
Support basic analyses across the entire dataset for both data
cleaning and science
Simplify mundane data handling tasks
Simplify quality checking and data selection by enabling data
browsing
Berkeley Water Center
Scientific Data Server - Non-Goals
Replace the large Internet data source sites
The technology developed may be applicable, but the focus is
on the group collaboration scale and usability
Very large datasets require different operational practices
Perform complex modeling and statistical analyses
There are a lot of existing tools with established trust based on
long track records
Only part of a full LIMS (laboratory information management
system)
Develop a new standard schema or controlled vocabulary
Other work on these is progressing independently
Due to the heterogeneity of the data, more than one such
standard seems likely to be relevant
Berkeley Water Center
Scientific Data Server - Workflows
Staging: adding data or metadata
New downloaded or field measurements
added
New derived measurements added
Editing: changing data or metadata
Existing older measurements re-calibrated
or re-derived
Data cleaning or other algorithm changes
Gap filling
Sharing: making the latest acquired data
available rapidly
Even before all the checks have been
made
Browsing new data before more detailed
analyses
Private Analysis: Supporting individual
researchers (MyDB)
Stable location for personal calibrations,
derivations, and other data transformations
Import/Export to analysis tools and models
Curating: data versioning and provenance
Simple parent:child versioning to track
collections of data used for specific uses
Large Data Archives
Berkeley Water Center
Local measurements
Scientific Data Server - Logical Overview
Data
Access
and
Analysis
Tools
Analysis Tools
Excel, Matlab,
SPlus, SAS,
ArcGIS
Simple web
data plots and
tables
Staging
Databases
and
Cubes
BigPlot data
browsing
Private
Data
Analysis
Databases
and
Cubes
Older Dataset(s)
Archive
Database
Last Known
Good Dataset(s)
Database
Last Known Good
Dataset(s) Cubes
Flat file data
import/export
Latest Dataset
Database
Computational
Models
Scientific Data Server
Berkeley Water Center
Latest
Dataset
Cube
Databases
All descriptive metadata and data held in
relational databases
Metadata is important too!
While separate databases are shown,
the datasets may actually reside in a
single database
Mapping is transparent to the
scientist
Separate databases used for
performance
Unified databases used for
simplicity
New metadata and data are staged with
a temporary database
Minimal quality checks applied
All name and unit conversions
Data may be exported to flat file, copied
to a private MyDb database, directly
accessed programmatically, or ?
Older Dataset(s)
Archive
Database
Staging
Database
Berkeley Water Center
Last Known
Good Dataset(s)
Database
Latest Dataset
Database
MyDb
Analysis
Database
Data Cubes
A data cube is a database specifically for
data mining (OLAP)
Initially developed for commercial
needs like tracking sales of Oreos
and milk
Simple aggregations (sum, min, or
max) can be pre-computed for
speed
Additional calculations (median) can
be computed dynamically
Both operate along dimensions such
as time, site, or datumtype
Constructed from a relational
database
A specialized query language (MDX)
is used
Client tool integrations is evolving
Excel PivotTables allow simple data
viewing
More powerful charting with
Tableaux or ProClarity (commercial
mining tools)
Staging
Data
Cube
Berkeley Water Center
Last Known Good
Dataset(s) Cubes
MyDb
Analysis
Data
Cubes
Latest
Dataset
Cube
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
Canada - Boreas 1850
Canada -- BOREAS NSA - 1930 bu
Canada -- BOREAS NSA - 1963 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - Old Bla
Canada -- British Col., Campbe
Canada -- Lethbridge
USA -- AK Atqasuk, Alaska
USA -- AK Barrow, Alaska
USA -- AK Happy Valley, Alaska
USA -- AK Upad, Alaska
USA -- AZ Audubon Research Ran
USA -- CA Blodgett Forest, Cal
USA -- CA Sky Oaks, Old Stand,
USA -- CA Sky Oaks, Young Stan
USA -- CA Tonzi Ranch, Califor
USA -- CA Vaira Ranch, Ione, C
USA -- CO Niwot Ridge Forest,
USA -- CT Great Mountain Fores
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- FL Slashpine-Austin Car
USA -- FL Slashpine-Donaldson,
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Rayonier,m
USA -- IL Bondville, Illinois
USA -- IN Morgan Monroe State
USA -- KS Walnut River Watersh
USA -- MA Harvard Forest EMS T
USA -- MA Harvard Forest hemlo
USA -- MA Little Prospect Hill
USA -- ME Howland Forest (main
USA -- MI Sylvania Wilderness
USA -- MI Univ. of Mich. Biolo
USA -- MO Missouri Ozark Site
USA -- MS Goodwin Creek, Missi
USA -- MT Fort Peck, Montana
USA -- NC Duke Forest - loblol
USA -- NC Duke Forest-hardwood
USA -- NE Mead - irrigated con
USA -- NE Mead - irrigated mai
USA -- NE Mead - rainfed maize
USA -- OK Little Washita Water
USA -- OK Ponca City, Oklahoma
USA -- OK Shidler, Oklahoma
USA -- OK Southern Great Plain
USA -- OR Metolius-first young
USA -- OR Metolius-intermediat
USA -- OR Metolius-old aged po
USA -- SD Black Hills, South D
USA -- SD Brookings, South Dak
USA -- TN Walker Branch Waters
USA -- WA Wind River Crane Sit
USA -- WI Lost Creek, Wisconsi
USA -- WI Park Falls/WLEF, Wis
USA -- WI Willow Creek, Wiscon
USA -- WV Canaan Valley, West
Browsing For Data Availability
Sites Reporting Data Colored by Year
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
Ameriflux Data Availability : All Data
Berkeley Water Center
Browsing For Data Availability
Total Data Availability by Type Colored by Site
18,000,000
16,000,000
14,000,000
12,000,000
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
Berkeley Water Center
WS
WD
VPD
UW
UST
U
TS
Tdew
Tbole
TA
Data type reporting is far from uniform across type
TAU
SVP
SWC
Sa
SCO2
Rn
RH
Rgl
Rg
Rd
PREC
PRESS
PAR
Other
O3
NEE
LE
Leafwetness
H
H2O
GPP
FPAR
FG
FH2O
FC
DT
CO2
APAR
0
USA -- WV Canaan Valley, West
USA -- WI Willow Creek, Wiscon
USA -- WI Park Falls/WLEF, Wis
USA -- WI Lost Creek, Wisconsi
USA -- WA Wind River Crane Sit
USA -- TN Walker Branch Waters
USA -- SD Brookings, South Dak
USA -- SD Black Hills, South D
USA -- OR Metolius-old aged po
USA -- OR Metolius-intermediat
USA -- OR Metolius-first young
USA -- OK Southern Great Plain
USA -- OK Shidler, Oklahoma
USA -- OK Ponca City, Oklahoma
USA -- OK Little Washita Water
USA -- NE Mead - rainfed maize
USA -- NE Mead - irrigated mai
USA -- NE Mead - irrigated con
USA -- NC Duke Forest-hardw ood
USA -- NC Duke Forest - loblol
USA -- MT Fort Peck, Montana
USA -- MS Goodw in Creek, Missi
USA -- MO Missouri Ozark Site
USA -- MI Univ. of Mich. Biolo
USA -- MI Sylvania Wilderness
USA -- ME How land Forest (main
USA -- MA Little Prospect Hill
USA -- MA Harvard Forest hemlo
USA -- MA Harvard Forest EMS T
USA -- KS Walnut River Watersh
USA -- IN Morgan Monroe State
USA -- IL Bondville, Illinois
USA -- FL Slashpine-Rayonier,m
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Donaldson,
USA -- FL Slashpine-Austin Car
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- CT Great Mountain Fores
USA -- CO Niw ot Ridge Forest,
USA -- CA Vaira Ranch, Ione, C
USA -- CA Tonzi Ranch, Califor
USA -- CA Sky Oaks, Young Stan
USA -- CA Sky Oaks, Old Stand,
USA -- CA Blodgett Forest, Cal
USA -- AZ Audubon Research Ran
USA -- AK Upad, Alaska
USA -- AK Happy Valley, Alaska
USA -- AK Barrow , Alaska
USA -- AK Atqasuk, Alaska
Canada -- Lethbridge
Canada -- British Col., Campbe
Canada -- BOREAS NSA - Old Bla
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1963 bu
Canada -- BOREAS NSA - 1930 bu
Canada - Boreas 1850
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
0
USA -- AK Upad, Alaska
USA -- KS Walnut River Watersh
Berkeley Water Center
USA -- MI Univ. of Mich. Biolo
USA -- WV Canaan Valley, West
USA -- WI Willow Creek, Wiscon
USA -- WI Park Falls/WLEF, Wis
USA -- WI Lost Creek, Wisconsi
USA -- WA Wind River Crane Sit
USA -- TN Walker Branch Waters
USA -- SD Brookings, South Dak
H
Rd
TAU
USA -- SD Black Hills, South D
USA -- OR Metolius-old aged po
USA -- OR Metolius-intermediat
USA -- OR Metolius-first young
USA -- OK Southern Great Plain
GPP
PRESS
TA
WS
USA -- OK Shidler, Oklahoma
USA -- OK Ponca City, Oklahoma
USA -- OK Little Washita Water
USA -- NE Mead - rainfed maize
USA -- NE Mead - irrigated mai
USA -- NE Mead - irrigated con
FPAR
PREC
SWC
WD
USA -- NC Duke Forest-hardwood
USA -- NC Duke Forest - loblol
USA -- MT Fort Peck, Montana
USA -- MS Goodwin Creek, Missi
USA -- MO Missouri Ozark Site
FH2O
PAR
SVP
VPD
USA -- MI Sylvania Wilderness
USA -- ME Howland Forest (main
USA -- MA Little Prospect Hill
USA -- MA Harvard Forest hemlo
USA -- MA Harvard Forest EMS T
FG
Other
SCO2
UW
USA -- IN Morgan Monroe State
USA -- IL Bondville, Illinois
USA -- FL Slashpine-Rayonier,m
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Donaldson,
FC
O3
Sa
UST
USA -- FL Slashpine-Austin Car
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- CT Great Mountain Fores
USA -- CO Niwot Ridge Forest,
USA -- CA Vaira Ranch, Ione, C
DT
NEE
Rn
U
USA -- CA Tonzi Ranch, Califor
USA -- CA Sky Oaks, Young Stan
USA -- CA Sky Oaks, Old Stand,
USA -- CA Blodgett Forest, Cal
USA -- AZ Audubon Research Ran
CO2
Leafwetness
RH
TS
USA -- AK Happy Valley, Alaska
USA -- AK Barrow, Alaska
USA -- AK Atqasuk, Alaska
Canada -- Lethbridge
Canada -- British Col., Campbe
Canada -- BOREAS NSA - Old Bla
APAR
LE
Rgl
Tdew
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1963 bu
6,000,000
Canada -- BOREAS NSA - 1930 bu
7,000,000
Canada - Boreas 1850
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
Browsing for Data Availability
Total Data Availability by Site Colored by Type
H2O
Rg
Tbole
5,000,000
4,000,000
3,000,000
2,000,000
1,000,000
Sites report more data either because of longevity or specific research interests
Browsing for Data Quality
Real field data has both short term
gaps and longer term outages
The utility of the data depends
on the nature of the science
being performed
Browsing data counts can give
rapid insight into how the data
can be used before more
complex analyses are
performed
30
What’s going on at higher
latitudes?
(It should be getting colder)
25
20
15
Deg C
10
5
0
20
30
40
50
60
70
80
-5
-10
Average Temperature
-15
Latitude
Soil Water Content (% Volume)
6000
45
Measurements
charted on
axes are gaps
40
35
55.86306 BOREAS NSA 1981 burn site
0 (cm)
20 (cm)
55.879002 BOREAS NSA Old Black Spruce
5000
55.90583 BOREAS NSA 1930 burn site
Data often missing in the winter!
4000
55.911671 BOREAS NSA 1963 burn site
30
3000
Vaira Ranch
25
55.916672 BOREAS NSA 1989 burn site
20
56.63583 BOREAS NSA 1998 burn site
2000
15
10
69.133331 AK Happy
Valley
1000
70.281471 AK Upad
5
0
0
0
10
20
30
40
50
60
70.496002 AK Atqasuk
1
2
3
-5
Data Count
-10
Tonzi Ranch
Berkeley Water Center
4
5
6
7
8
9
10
11
12
Browsing for Data Quality
Real field data has unit and time scale
conversion problems
Sometimes easy to spot in
isolation
Sometimes easier to spot when
comparing to other data
Browsing data values can give
rapid insight into how the data can
be used before more complex
analyses are performed
Maximum Annual Air Temperature
Global Warming or
Reporting in Fahrenheit?
Local time or GMT time?
Average Air Temperature
Two Nearby Sites
Odd Microclimate Effects or
Error in Time Reporting ?
Berkeley Water Center
Lessons Learned To Date
Metadata is as important as data
Comparing sites of like vegetation, climate is as important
as latitude or other physical quantity
Curate the two together
Controlled vocabularies are hard
Humans like making up names and have a hard time
remembering 100+ names
Assume a decode step in the staging pipeline
There are at least three database schema families and two cube
construction approaches
Everyone has a favorite
Each has advantages and disadvantages
Automate the maintenance and use the right one for the
right job
Visual programming tools are great for prototyping
But debugging and maintenance can hit a wall
It’s easy to overbuild – use when “good enough”
Data analysis and data cleaning are intertwined
Data cleaning is always on-going
Share the simple tools and visualizations
The saga continues at http://dsd.lbl.gov/BWC/amfluxblog/ and
http://research.microsoft.com/~vaningen/BWC/BWC.htm
Berkeley Water Center
Near Term Futures
Improve current capabilities
Assemble gap-filled and non-gap filled data sets
Implement incremental data staging to enable speedy and
simple data editing by an actual scientist (rather than a
programmer)
Implement expanded metadata handling to enable scientist
to add site characteristics and sort sites on those expanded
definitions
Add basic reporting capabilities for server-side browsing of
data availability to speed and simplify locating “interesting”
data
Apply Data Server capabilities to a different set of data with
different (but related) science
Considering either Russian River or Yosemite Valley
hydrological data
Will be automating download from multiple different national
data sets
Spatial (GIS) analyses more important
Linkage with imagery data necessary for science
Berkeley Water Center
Longer Term Futures
Handling imagery and other remote sensing information
Curating images is different from curating time series data
Using both together enables new science and new insights
Graphical selection and display of data
Support for user specified calculations within the database
Support for direct connections to analysis and statistical
packages
Linkage with models
Additional (emerging) data standards such as NetCDF
Handling “just in time” data delivery and model result
curation
Data mining subscription services
Handling of a broader array of data types
Support for workflow tools
Berkeley Water Center
Conclusions
Large data archives create the opportunity to
Do science at the regional and global scale
Combine data from multiple disciplines
Perform historical trend analysis
Small scientific collaborations need help to
Perform analyses using more data than they can
currently manage
Enable data handling and versioning
Store the currently needed data and metadata
Browse the data for science
It’s the science, not the computer science
Computer science research can certainly help
Berkeley Water Center
URLs
Berkeley Water Center (BWC)
http://esd.lbl.gov/BWC/
Microsoft Project at BWC
http://esd.lbl.gov/BWC/thrust_areas/mstci.html
Ameriflux Project
http://esd.lbl.gov/BWC/thrust_areas/ameriflux.html
http://dsd.lbl.gov/BWC/amfluxblog/
Berkeley Water Center