The Semantics of Quality, Uncertainty and Bias Representations of NASA Atmospheric Remote Sensing Data and Information Products ON THE WEB Peter Fox, and … Gregory Leptoukh2, Stephan.
Download ReportTranscript The Semantics of Quality, Uncertainty and Bias Representations of NASA Atmospheric Remote Sensing Data and Information Products ON THE WEB Peter Fox, and … Gregory Leptoukh2, Stephan.
The Semantics of Quality, Uncertainty and Bias Representations of NASA Atmospheric Remote Sensing Data and Information Products ON THE WEB
Peter Fox, and … Gregory Leptoukh 2 , Stephan Zednik 1 , Chris Lynnes 2 1.
2.
Tetherless World Constellation, Rensselaer Polytechnic Inst.
NASA Goddard Space Flight Center, Greenbelt, MD, United States
Webs of data
• • Early Web - Web of pages http://www.ted.com/index.php/talks/tim_ berners_lee_on_the_next_web.html
• Semantic web started as a way to facilitate “machine accessible content” – Initially was available only to those with familiarity with the languages and tools, e.g. your parents could not use it • Webs of data grew out of this – One specific example is W3C’s Linked Open Data 2
Semantic Web
• • http://www.w3.org/2001/sw/ “ The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF). See also the separate FAQ for further information.
” 3
Linked open data
• • http://linkeddata.org/guides-and-tutorials http://tomheath.com/slides/2009-02 austin-linkeddata-tutorial.pdf
• And of course: – http://logd.tw.rpi.edu/ 4
2009-03-05 (Chris Bizer)
5
September 2011
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” 6
Deep web
• Data behind web services • Data behind query interfaces (databases or files) 7
Data on the internet
• • http://www.dataspaceweb.org/ http://mp-datamatters.blogspot.com/ • Data files on other protocols – FTP – RFTP – GridFTP – SABUL – XMPP/AMQP – Others… 8
Acronyms
AOD Aerosol Optical Depth MDSA Multi-sensor Data Synergy Advisor MISR Multi-angle Imaging Spectro-Radiometer MODIS Moderate Resolution Imaging Spectro-radiometer OWL Web Ontology Language PML Proof Markup Language REST Representational State Transfer UTC XML Coordinated Universal Time eXtensible Markup Language XSL eXtensible Stylesheet Language XSLT XSL Transformation
Where are we in respect to
the
data challenge?
“The user cannot
find
the data; If he can find it, cannot
access
it; If he can access it, ; he doesn't know
how good
they are; if he finds them good, he can not
merge
them with other data”
The Users View of IT, NAS 1989 10
Giovanni Earth Science Data Visualization & Analysis Tool • Developed and hosted by NASA/ Goddard Space Flight Center (GSFC) • Multi-sensor and model data analysis and visualization online tool • Supports dozens of visualization types • Generate dataset comparisons • ~1500 Parameters • Used by modelers, researchers, policy makers, students, teachers, etc.
11
The Old Way:
Pre Find data Science Retrieve high volume data Learn formats and develop readers Extract parameters Perform spatial and other subsetting Identify quality and other flags and constraints Perform filtering/masking Develop analysis and visualization Accept/discard/get more data (sat, model, ground-based) DO Exploration SCIENCE Initial Analysis Use the best data for the final analysis Derive conclusions Write the paper Submit the paper Jan
Giovanni Allows Scientists to Concentrate on the
Science
Web-based Services:
The Giovanni Way:
Feb Mar Apr May Jun Jul Aug Sep Minutes Read Data Extract Parameter Filter Quality Subset Spatially Reformat Reproject Days for exploration Use the best data for the final analysis Derive conclusions DO Write the paper Submit the paper SCIENCE Visualize Explore Analyze
Web-based tools like Giovanni allow scientists to
compress
the time needed for pre science preliminary tasks:
data discovery, access, manipulation, visualization, and basic statistical analysis
.
Scientists have
more time to do science!
Oct
Data Usage Workflow
Data Discovery Assessment Access Manipulation Visualization Analyze 13
Data Usage Workflow
Data Discovery Assessment Access Manipulation Visualization Analyze Subset / Constrain Reformat Re-project Filtering Integration 14
Data Usage Workflow
Precision Requirements Integration Planning Intended Use Quality Assessment Requirements Data Discovery Assessment Access Manipulation Visualization Analyze Subset / Constrain Reformat Re-project Filtering Integration 15
Challenge
• Giovanni streamlines data processing, performing required actions on behalf of the user –
but
automation amplifies the potential for users to generate and use results they do not fully understand • The assessment stage is integral for the user to understand fitness-for-use of the result –
but
Giovanni did not assist in assessment • We were challenged to instrument the system to help users understand results 16
Producers Consumers Quality Control Fitness for Purpose Trustee Quality Assessment Fitness for Use Trustor 17
Definitions – for an atmospheric scientist
• Quality – Is in the eyes of the beholder – worst case scenario… or a good challenge • Uncertainty – has aspects of accuracy (how accurately the real world situation is assessed, it also includes bias) and precision (down to how many digits) 18
Quality Control vs. Quality Assessment
• Quality Control (QC) flags in the data (assigned by the algorithm) reflect “happiness” of the retrieval algorithm, e.g., all the necessary channels indeed had data, not too many clouds, the algorithm has converged to a solution, etc.
• Quality assessment is done by analyzing the data “after the fact” through validation, intercomparison with other measurements, self-consistency, etc. It is presented as bias and uncertainty. It is rather inconsistent and can be found in papers, validation reports all over the place.
Definitions – for an atmospheric scientist
• Bias has two aspects: – Systematic error resulting in the distortion of measurement data caused by prejudice or faulty measurement technique – A vested interest, or strongly held paradigm or condition that may skew the results of sampling, measuring, or reporting the findings of a quality assessment: • Psychological: for example, when data providers audit their own data, they usually have a bias to overstate its quality.
• Sampling: Sampling procedures that result in a sample that is not truly representative of the population sampled. (Larry English) 20
Data quality needs: fitness for use
• • • • – –
Measuring Climate Change:
Model validation
:
gridded contiguous data with uncertainties
Long-term time series
:
bias assessment
is the must , especially sensor degradation, orbit and spatial sampling change –
Studying phenomena using multi-sensor data: Cross-sensor bias
is needed – –
Realizing Societal Benefits through Applications:
Near-Real Time for transport/event monitoring
- in some cases,
coverage and timeliness
might be more important that accuracy
Pollution monitoring
(e.g., air quality exceedance levels) –
accuracy Educational
(users generally not well-versed in the intricacies of quality; just taking all the data as usable can impair educational lessons) –
only the best products
Level 2 data
22
• Swath for MISR, orbit 192 (2001)
Level 2 data
23
Level 3 data
24
MODIS Same parameter
MODIS vs. MERIS
Same space & time MERIS Different results – why?
A threshold used in MERIS processing effectively excludes high aerosol values.
Note: MERIS was designed primarily as an ocean-color instrument, so aerosols are “obstacles” not signal.
Spatial and temporal sampling – how to quantify to make it useful for modelers? MODIS Aqua AOD July 2009 MISR Terra AOD July 2009 • • • Completeness: MODIS dark target algorithm does not work for deserts Representativeness: monthly aggregation is not enough for MISR and even MODIS Spatial sampling patterns are different for MODIS Aqua and MISR Terra: “pulsating” areas over ocean are oriented differently due to different orbital direction during day-time measurement
Cognitive bias
Three projects with data quality flavor • Multi-sensor Data Synergy Advisor
–
Product-level
Quality: how closely the data represent the actual geophysical state
• Data Quality Screening Service
–
Pixel-level
Quality: algorithmic guess at usability of data point – Granule-level Quality: statistical roll-up of Pixel-level Quality
• Aerosol Statistics
–
Record-level
Quality: how consistent and reliable the data record is across generations of measurements 27
Multi-Sensor Data Synergy Advisor (MDSA)
•
Goal
: Provide science users with clear, cogent information on salient differences between data candidates for fusion, merging and intercomparison –Enable scientifically and statistically valid conclusions • Develop MDSA on current missions: – NASA - Terra, Aqua, (maybe Aura) • Define implications for future missions 28
How MDSA works?
MDSA is a service designed to characterize the differences between two datasets and advise a user (human or machine) on the advisability of combining them.
• Provides the Giovanni online analysis tool • • Describes parameter and products Documents steps leading to the final data product • Enables better interpretation and utilization of parameter difference and correlation visualizations. • Provides clear and cogent information on salient differences between data candidates for intercomparison and fusion. • Provides information on data quality • Provides advice on available options for further data processing and analysis. 29
Correlation – same instrument, different satellites
Anomaly
MODIS Level 3 dataday definition leads to artifact in correlation 30
…is caused by an Overpass Time Difference
31
Effect of the Data Day definition on Ocean Color data correlation with Aerosol data
Only half of the Data Day artifact is present because the Ocean Group uses the better Data Day definition!
Correlation between MODIS Aqua AOD (Ocean group product) and MODIS-Aqua AOD (Atmosphere group product)
Pixel Count distribution
Research approach
• Systematizing quality aspects – Working through literature – Identifying aspects of quality and their dependence of measurement and environmental conditions – Developing Data Quality ontologies – Understanding and collecting internal and external provenance • Developing rulesets allows to infer pieces of knowledge to extract and assemble • Presenting the data quality knowledge with good visual, statement and references
Semantic Web Basics
• The triple: { subject predicate object } Interferometer is-a optical instrument Optical instrument has focal length • W3C is the primary (but not sole) governing org. languages – RDF programming environment for 14+ languages, including C, C++, Python, Java, Javascript, Ruby, PHP,...(no Cobol or Ada yet ; ( ) – OWL 1.0 and 2.0 - Ontology Web Language - programming for Java • Query, rules, inference… • Closed World - where complete knowledge is known (encoded), AI relied on this SW promotes this
Ontology Spectrum
Catalog/ ID Thesauri “ narrower term ” relation Terms/ glossary Informal is-a Formal is-a Frames (properties) Selected Logical Constraints (disjointness, inverse, …) Formal instance Value Restrs.
General Logical constraints Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty; – updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
35
Semantic Web Layers
http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/ 36
Expressivity
Working with knowledge
Rule execution Implement -ability Query Inference Maintainability/ Extensibility
Model for Quality Evidence
38
Data Quality Ontology Development (Quality flag)
Working together with Chris Lynnes’s DQSS project, started from the pixel-level quality view.
Data Quality Ontology Development (Bias) http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet ?rid=1286316097170_183793435_22228&partName=htmltext
Modeling quality (Uncertainty)
Link to other cmap presentations of quality ontology: http://cmapspublic3.ihmc.us:80/servlet/SBRead ResourceServlet?rid=1299017667444_189782 5847_19570&partName=htmltext
MDSA Aerosol Data Ontology Example
Ontology of Aerosol Data made with
cmap
ontology editor http://tw.rpi.edu/web/project/MDSA/DQ-ISO_mapping
Multi-Domain Knowledgebase
Provenance Domain Data Processing Domain Earth Science Domain 43
RuleSet Development
[DiffNEQCT: (?s rdf:type gio:RequestedService), (?s gio:input ?a), (?a rdf:type gio:DataSelection), (?s gio:input ?b), (?b rdf:type gio:DataSelection), (?a gio:sourceDataset ?a.ds), (?b gio:sourceDataset ?b.ds), (?a.ds gio:fromDeployment ?a.dply), (?b.ds gio:fromDeployment ?b.dply), (?a.dply rdf:type gio:SunSynchronousOrbitalDeployment), (?b.dply rdf:type gio:SunSynchronousOrbitalDeployment), (?a.dply gio:hasNominalEquatorialCrossingTime ?a.neqct), (?b.dply gio:hasNominalEquatorialCrossingTime ?b.neqct), notEqual(?a.neqct, ?b.neqct) -> (?s gio:issueAdvisory giodata:DifferentNEQCTAdvisory) ]
Advisor Knowledge Base
Advisor Rules test for potential anomalies, create association between service metadata and anomaly metadata in Advisor KB 45
Assisting in Assessment
Precision Requirements Integration Planning Quality Assessment Requirements Intended Use Provenance & Lineage Visualization Data Discovery Assessment Access Manipulation Visualization Analyze Re Assessment MDSA Advisory Report Reformat Subset / Constrain Filtering Re-project Integration 46
Thus - Multi-Sensor Data Synergy Advisor
•
Assemble
semantic knowledge base – Giovanni Service Selections – Data Source Provenance (external provenance - low detail) – Giovanni Planned Operations (what service intends to do) •
Analyze
service plan – Are we integrating/comparing/synthesizing?
• Are similar dimensions in data sources semantically comparable? (semantic diff) • How comparable? (semantic distance) – What data usage caveats exist for data sources?
•
Advise
caveats regarding general fitness-for-use and data-usage 47
RPI
Semantic Advisor Architecture
…. complexity
49
Presenting data quality to users
• Global or product level quality information, e.g. consistency, completeness, etc., that can be presented in a tabular form.
• Regional/seasonal. This is where we've tried various approaches: – maps with outlines regions, one map per sensor/parameter/season – scatter plots with error estimates, one per a combination of Aeronet station, parameter, and season; with different colors representing different wavelengths, etc.
Advisor Presentation Requirements
• Present metadata that can affect fitness for use of result • In comparison or integration data sources – Make obvious which properties are comparable – Highlight differences (that affect comparability) where present • Present descriptive text (and if possible visuals) for any data usage caveats highlighted by expert ruleset • Presentation must be understandable by Earth Scientists!! Oh you laugh… 51
Advisory Report
• Tabular representation of the semantic equivalence of comparable data source and processing properties comparable input parameters and their semantic equivalence • Advise of and describe potential data anomalies/bias Expert Advisories 52
Advisory Report (Dimension Comparison Detail)
comparable input parameters and their semantic equivalence Expert Advisories 53
comparable input parameters and their semantic equivalence Advisory Report (Expert Advisories Detail)
Quality Comparison Table for Level 3 AOD (Global example)
Quality Aspect MODIS MISR Completeness Total Time Range Local Revisit Time Platform
Terra Aqua
Platform Time Range
2/2/2000-present 7/2/2002-present
Time Range
2/2/200-present
Platform Time Range Revisit Time Swath Width Spectral AOD AOD Uncertainty or Expected Error (EE) Successful Retrievals
Terra Aqua 10:30 AM 1:30 PM Terra 10:30 AM global coverage of entire earth in 1 day; coverage overlap near pole global coverage of entire earth in 9 days & coverage in 2 days in polar region 2330 km AOD over ocean for 7 wavelengths (466, 553, 660, 860, 1240, 1640, 2120 nm ); AOD over land for 4 wavelengths (466, 553, 660, 2120 nm (land) +-0.03+- 5% (over ocean; QAC > = 1) +-0.05+-20% (over land, QAC=3); 15% of Time 380 km AOD over land and ocean for 4 wavelengths (446, 558, 672, and 866 nm) 63% fall within 0.05 or 20% of Aeronet AOD; 40% are within 0.03 or 10% 15% of Time (slightly more because of retrieval over Glint region also)
What they really like!
56
Summary
• Quality is very hard to characterize, different groups will focus on different and inconsistent measures of quality – Modern ontology representations and reasoning to the rescue!
• Products with known Quality (whether good or bad quality) are more valuable than products with unknown Quality.
– Known quality helps you correctly assess fitness-for-use • Harmonization of data quality is even more difficult that characterizing quality of a single data product 57
Summary
• Advisory Report is not a replacement for proper analysis planning – But provides benefit for all user types summarizing general fitness-for-usage, integrability, and data usage caveat information – Science user feedback has been very positive • Provenance trace dumps are difficult to read, especially to non-software engineers – Science user feedback; “Too much information in provenance lineage, I need a simplified abstraction/view ” • Transparency Translucency – make the important stuff stand out 58
Current Work
• Advisor suggestions to correct for potential anomalies • Views/abstractions of provenance based on specific user group requirements • Continued iteration on visualization tools based on user requirements • Present a comparability index / research techniques to quantify comparability 59
Data Quality Screening Service for Remote Sensing Data
The DQSS filters out bad pixels for the user • Default user scenario – Search for data – Select science team recommendation for quality screening (filtering) – Download screened data • More advanced scenario – Search for data – Select custom quality screening parameters – Download screened data 60
The quality of data can vary considerably
AIRS Parameter Best (%)
38
Good (%)
38
Do Not Use (%)
24 Total Precipitable Water Carbon Monoxide Surface Temperature 64 5 7 44 29 51
Version 5 Level 2 Standard Retrieval Statistics
Percent of Biased Data in MODIS Aerosols Over Land Increase as Confidence Flag Decreases Very Good Good Marginal Compliant* Biased Low Biased High Bad 0% 20% 40% 60% 80% 100% *Compliant data are within + 0.05 + 0.2
Aeronet Statistics from Hyer, E., J. Reid, and J. Zhang, 2010, An over-land aerosol optical depth data set for data assimilation by filtering, correction, and aggregation of MODIS Collection 5 optical depth retrievals, Atmos. Meas. Tech. Discuss., 3, 4091 –4167.
The effect of bad quality data is often
not
negligible
Hurricane Ike, 9/10/2008
Total Column Precipitable Water Quality kg/m 2 Best Good Do Not Use
DQSS replaces bad-quality pixels with fill values
Original data array (Total column precipitable water) Mask based on user criteria (Quality level < 2) Good quality data pixels retained Output file has the same format and structure as the input file (except for extra mask and original_data fields)
AeroStat?
• • • • Different papers provide
different views
on whether
MODIS and MISR
measure aerosols well.
Peer-reviewed papers usually are
well behind the latest version of the data
.
It is
difficult to verify results
of a published paper and resolve controversies between different groups as it is flags.
difficult to reproduce the results
- they might have dealt with either different data or used different quality controls or It is important to have an
online shareable environment
where data processing and analysis can be done in a environment and can be shared amongst all the members of the aerosol community.
transparent
way by any user of this 65 2/18/2011
AeroStat: Online Platform for the Statistical Intercomparison of Aerosols Explore & Visualize Level 3 Level 3 are too aggregated Compare Level 3 Explore & Visualize Level 2 Correct Level 2 Switch to high-res Level 2 Compare Level 2 Before and After Merge Level 2 to new Level 3 66 5/1/2020
Monthly AOD Standard deviation
Areas with high AOD standard deviation might point to areas of high uncertainty.
Next slide shows these areas on map of mean AOD.
Data Quality Issues
• Validation of aerosol data show that not all data pixels labeled as “bad” are actually bad if looking at from a bias perspective. • But many pixels are biased differently due to various reasons From Levy et al, 2009
Types of Bias Correction
Type of Correction
Relative (Cross sensor) linear Climatological
Spatial Basis
Region Relative (Cross sensor) non linear Climatological Anchored Parameterized Linear Anchored Parameterized 2/18/2011 Non-Linear Global Near Aeronet stations Near Aeronet stations
Tempora l Basis
Season Full data record
Pros
Not influenced by data in other regions, good sampling Complete sampling
Cons
Difficult to validate Difficult to validate Full data record Can be validated Full data record Can be validated Limited areal sampling Limited insight into 69 correction
Quality & Bias assessment using FreeMind FreeMind allows capturing various relations between various aspects of aerosol measurements, algorithms, conditions, validation, etc. The “traditional” worksheets do not support complex multi-dimensional nature of the task from the Aerosol Parameter Ontology
AeroStat Ontology
71
Title: MODIS Terra C5 AOD vs. Aeronet during Aug-Oct Biomass burning in Central Brazil,
South America
:
Collection 5 MODIS AOD at 550 nm during Aug- Oct over Central South America highly over-estimates for large AOD and in non-burning season underestimates for small AOD, as compared to Aeronet; good comparisons are found at moderate AOD.
Region & season characteristics
: Central region of Brazil is mix of forest, cerrado, and pasture and known to have low AOD most of the year except during biomass burning season
* Alta Floresta * Mato Grosso
(Dominating factors leading to Aerosol Estimate bias):
1. Large positive bias in AOD estimate during biomass burning season may be due to wrong assignment of Aerosol absorbing characteristics.
(Specific explanation) a constant Single Scattering Albedo ~ 0.91 is assigned for all seasons, while the true value is closer to ~0.92-0.93.
[ Notes or
exceptions
: Biomass burning regions in Southern Africa do not show as large positive bias as in this case, it may be due to different optical characteristics or single scattering albedo of smoke particles, Aeronet observations of SSA confirm this]
2
. Low AOD is common in non burning season. In Low AOD cases, biases are highly dependent on lower boundary conditions. In general a negative bias is found due to uncertainty in Surface Reflectance Characterization which dominates if signal from atmospheric aerosol is low.
* Santa Cruz Central South America
(Example)
: S
catter plot
of MODIS AOD and AOD at 550 nm vs. Aeronet from
ref.
(Hyer et al, 2011)
(Description Caption)
shows severe
over-estimation
of MODIS Col 5 AOD (dark target algorithm) at
large AOD
at 550 nm during
Aug Oct
2005-2008 over Brazil.
(Constraints)
Only best quality of MODIS data (
Quality =3
) used. Data with scattering angle
> 170 deg
Red Lines
define regions of
Expected Error (EE),
Green
excluded.
(Symbols) is the fitted slope
Results: Tolerance
= 62% within
EE
; RMSE=0.212 ;
r2
=0.81;
Slope
=1.00
For Low AOD
(<0.2)
Slope
=0.3.
For high AOD
(> 1.4)
Slope
=1.54
0 1 2 Aeronet AOD
Reference:
Hyer, E. J., Reid, J. S., and Zhang, J., 2011: An over-land aerosol optical depth data set for data assimilation by filtering, correction, and aggregation of MODIS Collection 5 optical depth retrievals, Atmos. Meas. Tech., 4, 379-408, doi:10.5194/amt-4-379-2011
• We are done.
Bam
73