Tracking Metadata and Lineage of the Data Processing Chair

Download Report

Transcript Tracking Metadata and Lineage of the Data Processing Chair

Tracking Metadata and Lineage
of the Data Processing Chain
for Mapping Snow Cover Properties
with the NASA MODIS
James Frew1, Thomas H. Painter2,
Peter Slaughter1, Jeff Dozier1
1Donald
Bren School of Environmental Science and
Management, University of California, Santa Barbara
2National Snow and Ice Data Center,
University of Colorado, Boulder
1
Outline
o
Motivation
 Snow mapping product
 Implications for hydrologic modeling
o
Lineage Capture
 Wrapping: the ESSW experience
 Instrumenting,
overriding,
monitoring: the (ongoing) ES3 experience
2
MODIS image – Sierra Nevada
EOS Terra MODIS
07 March 2004
MOD09 Surface
Reflectance
0.555 0.645 0.858
3
Snow-covered area and grain size
4
4
Hindu Kush
2003 DOY 070
5
Colorado Rockies
CLPX
13 March 2002
6
Model structure: MODIS snow-area / albedo
MODIS
cloud mask
(48 bits)
MODIS 7
land bands
(112 bits)
Watershed
info
Snow
fraction
MODIS
view
angles
albedo
Basin
mask
MODIS
quality
flags
MODIS
snow cover
and grain
size
Topography
Solar
zenith,
azimuth
Processing
Lineage
RMS
error
Veg
fraction
Soil
fraction
Quality
flag
Shade
fraction
Open
water
fraction
7
Lineage Capture, Take 1
The ESSW experience
8
Using Existing Science Applications
o
No “standard”
Earth science computing environment
 commercial packages (ArcInfo, MATLAB, …)
 public packages/models (MM5, MODTRAN,
…)
 locally-developed codes
 arbitrary combinations of 
o
Example: SST from AVHRR   
 commercial, standalone programs
 parameters highly customized for UCSB
o
How do we get these programs to
 communicate
 cooperate
with ESSW, without rewriting them?
Receive
Ingest and Calibrate
Navigate
(Manual/Automatic)
Sea Surface
Temp (SST)
Rectify
SST
Maps
9
Lineage: Current Best Practice
10
Earth System Science Workbench (ESSW)
o
Producer and consumer issues can both be addressed
by a laboratory metaphor
Experiment
 Network of models
 … ingesting / synthesizing data
 … generating products
o
Laboratory
 Experiment execution environment
– Computing + storage = accessibility + scalability
o
Lab Notebook
 Persistent storage that can be queried
 Keeps track of all experiments
– Documentation + lineage = accountability
11
Wrap Your App: Scripts Talk to ESSW
o
No changes,
just additions
 Wrapper scripts
XML + SQL
Perl API
Lab Notebook
daemon
– Make program (groups)
look like ESSW
experiments
– use Perl API
 Lab Notebook daemon
Ingest and Calibrate
Navigate
(Manual/Automatic)
ESSW
Database
– Accepts API commands
– Creates XML documents
u
Receive
Sea Surface
Temp (SST)
Sends to database
 ESSW database
– XML metadata & DTDs
– Tabular metadata
u
u
XML search terms
Lineage links
Rectify
Java
MySQL
JDBC
Perl
SST
Maps
12
ESSW Metadata management
o
Lab Notebook daemon verifies XML metadata document
o
Experiment step metadata stored for product lineage
tracking
o
Complete metadata document stored in custom database
table
 XML DTD ← 1:1 → database table
 (n+1)th column is document itself
o
Some metadata values extracted into database tables
 DTD contains column names and types for some elements
 Always save all the XML,
even if don’t know how to “columnize” all of it
13
Wrapper Example: Input Dataset
# SST experiment wrapper
# $L1B is the input Level 1B AVHRR image file
# $SST is the output SST image file
AHVRR Level 1B
product
# run legacy command "nitpix": creates SST image from L1B
image
$base_temp = 5.0;
$temp_step = 0.1;
...
system("nitpix base_temp=$base_temp temp_step=$temp_step ...
$L1B $SST");
# start recording ESSW metadata
avhrr_l1b
Multi-channel
sea surface
temperature
algorithm
avhrr_
sstModel
beginXMLBld($ENV{USER}, "PRODUCTION");
# get metadata for input file
$L1B_ID = findSciObjFromFile($L1B);
Sea surface
temperature
(SST)
avhrr_sst
14
Wrapper Example: Output Dataset
AHVRR Level 1B
product
# create metadata for SST image
$SST_ID = createMetadata("avhrr_sst");
addValue($SST_ID, "avhrr_sst.scene_id.satellite",
$satellite);
addValue($SST_ID, "avhrr_sst.scene_id.pass_date",
$pass_date);
...
saveToDB($SST_ID, avhrr_sst);
closeMetadata($SST_ID);
saveDigest($SST, $SST_ID);
avhrr_l1b
Multi-channel
sea surface
temperature
algorithm
avhrr_
sstModel
Sea surface
temperature
(SST)
avhrr_sst
15
Wrapper Example: Process
# create metadata for SST experiment
AHVRR Level 1B
product
$exp = createExperimentMetadata("avhrr_sstModel");
$exp_step = createExpStepMetadata($exp, "avhrr_sstExpStp");
addValue($exp_step, "avhrr_sstExpStp.base_temp",
$base_temp);
addValue($exp_step, "avhrr_sstExpStp.temp_step",
$temp_step);
...
saveToDB($exp_step, "avhrr_sstExpStp");
closeMetadata($exp_step);
avhrr_l1b
Multi-channel
sea surface
temperature
algorithm
avhrr_
sstModel
# connect input and output images to experiment
registerExperimentInputs($exp, $L1B_ID);
registerExperimentOutputs($exp, $SST_ID);
# finish recording ESSW metadata
Sea surface
temperature
(SST)
avhrr_sst
endXMLBld();
16
Wrapper Example: Lineage Links
# create metadata for SST experiment
AHVRR Level 1B
product
$exp = createExperimentMetadata("avhrr_sstModel");
$exp_step = createExpStepMetadata($exp, "avhrr_sstExpStp");
addValue($exp_step, "avhrr_sstExpStp.base_temp",
$base_temp);
addValue($exp_step, "avhrr_sstExpStp.temp_step",
$temp_step);
...
saveToDB($exp_step, "avhrr_sstExpStp");
closeMetadata($exp_step);
avhrr_l1b
Multi-channel
sea surface
temperature
algorithm
avhrr_
sstModel
# connect input and output images to experiment
registerExperimentInputs($exp, $L1B_ID);
registerExperimentOutputs($exp, $SST_ID);
# finish recording ESSW metadata
Sea surface
temperature
(SST)
avhrr_sst
endXMLBld();
17
Process graph reconstructed
from ESSW database
18
ESSW Lessons
o
Providers are customers

o
A light touch is the right touch

o
ESIPs aren’t much good unless scientists are happy to put information in
them
Wrapping is easier for scientists and their programmers to deal with than
complete re-engineering
Scientists do write scripts, but not necessarily Perl

Scripting (gluing stuff together) comes naturally to scientists
o
Scientists don’t write DTDs
o
Nobody calls metadata APIs
ESSW was automatic, but not automatic enough…
19
Lineage Capture, Take 2
The ES3 experience
20
ESSW++ data lineage
tracking
ES3 : Earth System Science Server
MODster
OpenDAP
Watershedscale snow
product
MODIS
Microsoft TerraServer
AVHRR
Globalscale snow
product
Alexandria Digital
Library
Corona
BUB data
storage
ROCKS processing
clusters
21
From ESSW to ES3: Summary
o
Perl wrappers  “Probulators”
o
Perl API  web services + XML messages
o
MySQL  XML database(s)
22
From Wrappers to Probulators
Wrappers: Active Lineage
o +
 Complete control over what gets recorded
 Single language/API for all wrapped events
 Not tied to execution
– You can even lie about what happened
o
–
 Must explicitly script everything
 Scripts can drift from reality
– You can even lie about what happened
23
From Wrappers to Probulators
Probulators: Passive Lineage
o +
 Record what actually happened
– Not just what you think happened
– Not what didn’t happen
 Automatic: don’t have to write new scripts for everything
o
–
 Different flavors for different environments
– Can’t just do everything in Perl…
24
Probulator patterns
o
Instrumentation
 Insert lineage capture instructions directly into science codes
– e.g. “I just created file ‘foo’”
 Typical implementation: preprocessor/precompiler
o
Overriding
 Replace standard routines/libraries with lineage-capturing versions
– e.g. open(…) → snoopy_open(…)
 Typical implementation: modify execution environment
– environment variables
– configuration files
o
Passive monitoring
 Trace program execution
– e.g. “called open() with args foo, bar, …”
 Typical implementation: strace’d shell
25
ES3 Lineage Architecture
probulator1
logger
transmitter
ES3 core
probulatorn
log
files
26
Probulating IDL: Instrumenting the code
;edit
pro modscag_cleanse,prefix=prefix,ns=ns,nl=nl
HELP, NAMES="*", OUTPUT=ES3_ENVIROMENT & ES3_LOG, $
ENTER="modscag_cleanse", ENVIROMENT=ES3_ENVIROMENT
;
;
;
;
;
;
;
;
clean up {under,over}flow of MODSCAG run
Input:
Output:
prefix = prefix for all of the MODSCAG output filenames
ns = number of samples
nl = number of lines
rewrite of the MODSCAG files
t.h.painter / 1.19.2005
; open snow file
ES3_openr,1,string(prefix,'snow.pic')
snow=fltarr(ns,nl)
readu,1,snow
[ blah blah blah ]
HELP, NAMES="*", OUTPUT=ES3_ENVIROMENT & ES3_LOG, LEAVE="modscag_cleanse", $
ENVIROMENT=ES3_ENVIROMENT
END ; modscag_cleanse
27
Probulating IDL: Results
<init time="20050522T234606Z”
pid="31002" stime="20050522T234604Z" pstime="20050522T234256Z" ppid="30920"
language="idl" user="haavar" hostname="spitting-duck.bren.ucsb.edu">
<enviroment>
<variable name="!PATH" value="/home/haavar/probulator//idl:
/home/rsi/idl_6.1/lib/hook:
[…]
</enviroment>
<mount-points>
<mount share="dab15:/ed15/rsi" type="nfs">/home/rsi</mount>
</mount-points>
</init>
<enter region="modscag_cleanse">
<enviroment>
<variable type="INT" name="NL" value="2"/>
<variable type="INT" name="NS" value="2"/>
[…]
</enviroment>
</enter>
<exec time="20050522T234610Z" routine="OPENR">
<io>
<file read="true">/home/haavar/painter/data/tillsnow.pic</file>
</io>
</exec>]
28
Probulating bash: Passive Monitoring
cat /etc/passwd | grep haavar | sed
25232 1138336174.480079
25232 1138336174.480215
[…]
25234 1138336178.887267
25234 1138336178.887912
25234 1138336178.888257
25235 1138336178.889366
25235 1138336178.889975
25235 1138336178.890326
25235 1138336178.891260
25235 1138336178.891756
25235 1138336178.892753
25238 1138336178.894266
25236 1138336178.894726
25237 1138336178.894763
25237 1138336178.895581
[…]
25238 1138336178.897006
9]\\+\\).*/\\2/p"],
25236 1138336178.900117
25237 1138336178.903342
-n 's/\(.*:\)\{2\}\([0-9]\+\).*/\2/p'
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib/libm.so.6", O_RDONLY) = 3
dup2(3, 255)
= 255
pipe([3, 4])
= 0
clone(child_stack=0, […],
dup2(4, 1)
= 1
pipe([3, 4])
= 0
clone(child_stack=0, […],
pipe([4, 5])
= 0
clone(child_stack=0, […],
clone(child_stack=0, […],
dup2(4, 0)
= 0
dup2(4, 1) = 1
dup2(3, 0) = 0
dup2(5, 1)
= 1
child_tidptr=0xb7f2e708) = 25235
child_tidptr=0xb7f2e708) = 25236
child_tidptr=0xb7f2e708) = 25237
child_tidptr=0xb7f2e708) = 25238
execve("/bin/sed", ["sed", "-n", "s/\\(.*:\\)\\{2\\}\\([0["HOSTNAME=rubber-duck.bren.ucsb.edu", "TERM=xterm-color", […]
execve("/bin/cat", ["cat", "/etc/passwd”], […]
execve("/bin/grep", ["grep", "haavar"], […]
29
Probulating bash: Results
[… <init> same as IDL …]
<exec time="20060027T042938.900117Z" routine="/bin/cat" pid="25236" ppid="25235">
<arguments>
<argument>/etc/passwd</argument>
</arguments>
<io>
<pipe read="true" id="std-in"/>
<pipe write="true" id="3"/>
<pipe write="true" id="std-err"/>
<file read="true">/etc/ld.so.cache</file>
[…]
<file read="true">/etc/passwd</file>
</io>
</exec>
<exec time="20060027T042938.903342Z" routine="/bin/grep" pid="25237" ppid="25235">
<arguments>
<argument>haavar</argument>
</arguments>
<io>
<pipe read="true" id="3"/>
<pipe write="true" id="4"/>
[…]
</io>
</exec>
30
Now What?
o
Probulator reports not universally unique
 Q: How hook separate reports together?
 A: Logger assigns UUIDs to
– Data streams
– Processes
– Jobs (workflows)
o
Lineage not explicit
 Q: How publish lineage?
 A: ES3 Core builds serialized graph
31
Thanks to:
Current
o Mike Colee
o Stephane Maritorena
o Dominic Metzger
o Karl Rittger
o Dave Siegel
Former
o Anurag Acharya
o Rajendra Bose
o Scott Denning
o Debbie Donahue
o Jim Duff
o Calin Duma
o Erik Fields
o Jim Gray
o Steve Miley
o Jordan Morris
o Mark Pelletier
o Pete Peterson
o Walter Rosenthal
o Klaus Schauser
o Håvar Valeur
32
To Probulate Further…
http://www.snow.ucsb.edu : Publications
o
Bose, R. and Frew, J., 2005. Lineage retrieval for scientific data processing: a
survey. ACM Computing Surveys, vol. 37, no. 1, pp. 1-28.
 doi:10.1145/1057977.1057978
o
Dozier, J., and Painter, T.H., 2004. Multispectral and hyperspectral remote
sensing of alpine snow properties. Annual Review of Earth and Planetary
Sciences, vol. 32, pp. 465-494.
 doi:10.1146/annurev.earth.32.101802.120404
o
Molotch, N.P., Painter, T.H., Bales, R.C., and Dozier, J., 2004. Incorporating
remotely sensed snow albedo into spatially distributed snowmelt modeling.
Geophysical Research Letters, 31, L03501
 doi:10.1029/2003GL019063
o
Frew, J. and Bose, R., 2001. Earth System Science Workbench: a data
management infrastructure for Earth science products. In: Kerschberg, L. and
Kafatos, M. (eds.) 2001. Proceedings, 13th International Conference on Scientific
and Statistical Database Management (SSDBM 2001), pp. 180-189.
 doi:10.1109/SSDM.2001.938550
33