Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13, November 29, 2010

Download Report

Transcript Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13, November 29, 2010

Foundations VII:
Data life-cycle, Mining and
Knowledge Discovery
Deborah McGuinness and Joanne Luciano
With Peter Fox and Li Ding
CSCI-6962-01
Week 13, November 29, 2010
1
Contents
• Review assignment
• More advanced topics; life cycle, mining and
adding to your knowledge base
• Summary
• Next week (your presentations)
2
Semantic Web Methodology and
Technology Development Process
•
•
Establish and improve a well-defined methodology vision for
Semantic Technology based application development
Leverage controlled vocabularies, et c.
Adopt
Leverage
Science/Expert
Rapid
Technology
Open World: Prototype Technology
Review & Iteration
Approach
Infrastructure
Evolve, Iterate,
Redesign,
Redeploy
Use Tools
Evaluation
Analysis
Use Case
Small Team,
mixed skills
Develop
model/
ontology
3
Data->Information->Knowledge
4
Data Life Cycle
• Life cycle (we will define these shortly)
– Acquisition, curation, preservation
– Long term stewardship
• Data and information – we use this to get to
the discussion of knowledge
– Content; the values
– Context; the background, setting, etc.
– Structure; organization and form
• Representation/ storage
– Analog
– Digital (and born digital)
5
Why it is important
• 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving
Dying Data, Sep. 12, 2002, Forbes. [Online]. Available:
http://www.forbes.com/2002/09/12/0912data_print.html)
• 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory
threatened as file formats evolve,” Houston Chronicle, Jan.
16, 2003. [Online]. Available:
http://www.chron.com/cs/CDA/story.hts/tech/1739675)
• R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The
international polar year: Making data available for the longterm,” in Proc. Fall AGU Conf., San Francisco, CA, Dec.
2004. [Online]. Available:
ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_Inter
national_Polar_Year:_Making_Data_and_Information_Availa
ble_for_the_Long_Term.ppt
6
Why (cont’d)
• e-science aims to derive new knowledge from
(possibly) multiple sources data
• The data needs to be persistent, available
and usable
• The rate of creation of knowledge
representations is increasing; they are a
representation of the known ‘facts’ based on
the data
• We studied KR creation, engineering,
evolution and iteration
• Knowledge needs a life-cycle as well
7
At the heart of it
• Inability to read the underlying sources,
e.g. the data formats, metadata formats,
knowledge formats, etc.
• Inability to know the inter-relations,
assumptions and missing information
• We’ll look at a (data) use case for this
shortly
• But first we will look at what, how and
who in terms of the full life cycle
8
What to collect?
• Documentation
– Metadata
– Provenance
• Ancillary Information
• Knowledge
9
Who does this?
• Roles:
– Data creator
– Data analyst
– Data manager
– Data curator
10
How it is done
11
Acquisition
12
Curation
13
Preservation
•
•
•
•
Usually refers to the full life cycle
Archiving is a component
Stewardship is the act of preservation
Intent is that ‘you can open it any time in the
future’ and that ‘it will be there’
• This involves steps that may not be
conventionally thought of
• Think 10, 20, 50, 200 years…. looking
historically gives some guide to future
considerations
14
Some examples and experience
•
•
•
•
NASA
NOAA
Library community
Note:
– Mostly in relation to publications, books, etc but
some for data
– Note that knowledge is in publications but the
structure form is meant for humans not
computers, despite advances in text analysis
– Very little for the type of knowledge we are
considering: in machine accessible form
15
Back in the day...
SEEDS Working Group on Data Lifecycle
• Second Workshop Report
o https://esdswg.eosdis.nasa.gov/documents/W2_Bothwell.pdf
o Many LTA recommendations
• Earth Sciences Data Lifecycle Report
o https://esdswg.eosdis.nasa.gov/documents/lta_prelim_rprt2.pdf
o Many lessons learned from USGS experience, plus some
recommendations
• SEEDS Final Report (2003) - Section 4
o https://esdswg.eosdis.nasa.gov/documents/FinRec.pdf
o Final recommendations vis a vis data lifecycle
MODIS Pilot Project
• GES DISC, MODAPS, NOAA/CLASS, ESDIS effort
• Transferred some MODIS Level 0 data to CLASS
Mostly Technical Issues
• Data Preservation
o Bit-level integrity
o Data readability
• Documentation
• Metadata
• Semantics
• Persistent Identifiers
• Virtual Data Products
• Lineage Persistence
• Required ancillary data
• Applicable standards
Mostly Non-Technical Issues
• Policy (constrained by money…)
• Front end of the lifecycle
o
Long-term planning, data formats, documentation...
• Governance and policy
• Legal requirements
• Archive to archive transitions
• Money (intertwined with policy)
• Cost-benefit trades
• Long-term needs of NASA Science Programs
• User input
o
Identifying likely users
• Levels of service
• Funding source and mechanism
Use case: a real live one; deals mostly
with structure and (some) content
HDF4 Format "Maps"
for Long Term Readability
C. Lynnes, GES DISC
R. Duerr and J. Crider, NSIDC
M. Yang and P. Cao, The HDF Group
HDF=Hierarchical Data Format
NSIDC=National Snow and Ice Data Center
GES=Goddard Earth Science
DISC=Data and Information Service Center
In the year 2025...
A user of HDF-4 data will run into the following likely hurdles:
• The HDF-4 API and utilities are no longer supported...
o ...now that we are at HDF-7
• The archived API binary does not work on today's OS's
o ...like Android 3.1
• The source does not compile on the current OS
o ...or is it the compiler version, gcc v. 7.x?
• The HDF spec is too complex to write a simple read
program...
o ...without re-creating much of the API
What to do?
HDF Mapping Files
Concept: create text-based "maps" of the HDF-4 file layouts
while we still have a viable HDF-4 API (i.e., now)
• XML
• Stored separately from, but close to the data files
• Includes
o internal metadata
o variable info
o chunk-level info
 byte offsets and length
 linked blocks
 compression information
Task funded by ESDIS project
• The HDF Group, NSIDC and GES DISC
Map sample (extract)
<hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields"
objID="xid-DFTAG_NDG-5">
<hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer">
00
</hdf4:Attribute>
<hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" />
<hdf4:Dataspace ndims="2">
180 360
</hdf4:Dataspace>
<hdf4:Datablock nblocks="1">
<hdf4:Block offset="27266625" nbytes="20582"
compression="coder_type=DEFLATE" />
</hdf4:Datablock>
</hdf4:SDS>
Status and Future
Status
• Map creation utility (part of HDF)
• Prototype read programs
o C
o Perl
• Paper in TGRS special issue
• Inventory of HDF-4 data products within EOSDIS
Possible Future Steps
• Revise XML schema
• Revise map utility and add to HDF baseline
• Implement map creation and storage operationally
o e.g., add to ECS or S4PA metadata files
Examples of NASA context
24
Contextual Information:
• Instrument/sensor characteristics including pre-flight
or pre-operational performance measurements (e.g.,
spectral response, noise characteristics, etc.)
• Instrument/sensor calibration data and method
• Processing algorithms and their scientific basis,
including complete description of any sampling or
mapping algorithm used in creation of the product
(e.g., contained in peer-reviewed papers, in some
cases supplemented by thematic information
introducing the data set or derived product)
• Complete information on any ancillary data or other
data sets used in generation or calibration of the
data set or derived product
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
25Working Group
Infusion
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Contextual Information (continued):
• Processing history including versions of processing source
code corresponding to versions of the data set or derived
product held in the archive
• Quality assessment information
• Validation record, including identification of validation data sets
• Data structure and format, with definition of all parameters and
fields
• In the case of earth based data, station location and any
changes in location, instrumentation, controlling agency,
surrounding land use and other factors which could influence
the long-term record
• A bibliography of pertinent Technical Notes and articles,
including refereed publications reporting on research using the
data set
• Information received back from users of the data set or product
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
26Working Group
Infusion
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
However…
• Even groups like NASA do not have a
governance model for this work
• Governance: defintion
• Stakeholders:
– NASA for integrity of their data holdings (is it their
responsibility?)
– Public for value for and return on investment
– Scientists for future use (intended and unintended)
– Historians
27
NOAA
28
Library community
• OAIS
• OAI (PMH and ORE)
29
Metadata Standards - PREMIS
• Provide a core preservation metadata set
with broad applicability across the digital
preservation community
• Developed by an OCLC and RLG sponsored
international working group
– Representatives from libraries, museums,
archives, government, and the private sector.
• Based on the OAIS reference model
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
Metadata Standards - PREMIS
• Maintained by the Library of Congress
• Editorial board with international membership
• User community consulted on changes
through the PREMIS Implementers Group
• Version 1 was released in June 2005
• Version 2 was just released
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
PREMIS - Entity-Relationship Diagram
Intellectual
Entities
“an action that involves at
least one
object or agent
Rights
“a person,
organization,
or
“a coherent set of content
known
to the
preservation
software
program
associated
“a
discrete
unit
of information
that is reasonably
repository”
with preservation
events in
in
digital
described
as a form”
unit”
e.g.,
created,
archived,
For
asite,
datadata
file
lifeexample,
of an
object”
For the
example,
a web
migrated
e.g.,
Dr.
Spock
it
Objects set
Agents
“assertions
or sets
more
or collectionofdonated
ofone
data
rights or permissions
pertaining to an object
or an agent”
e.g., copywrite
Eventsnotice, legal
statute, deposit agreement
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
PREMIS - Types of Objects
• Representation - “the set of files needed for a
complete and reasonable rendition of an
Intellectual Entity”
• File
• Bitstream - “contiguous or non-contiguous
data within a file that has meaningful common
properties for preservation purposes”
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
Metadata Standards - METS
• Metadata Encoding and Transmission
Standard
• An initiative of the Digital Library Federation
• Based on the Making of America II project
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
METS - What’s Its Purpose?
• Provides the means to convey the metadata
necessary for
– management of digital objects within a repository
– exchange of objects between repositories (or
between repositories and their users)
• Designed to facilitate
– shared development of information management
tools/services
– interoperable exchange of digital materials
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
METS - What’s its status?
•
•
•
•
Version 1.6 was released in Sept. 2007
Maintained by the Library of Congress
International Editorial Board
NISO registration as of 2006
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
Backup Materials - MODIS
Contextual Info
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
Instrument/sensor characteristics
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored by the Technology
38
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Processing Algorithms & Scientific Basis
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored
39 by the Technology
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Ancillary Data
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored
40 by the Technology
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Processing History including Source Code
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored
41 by the Technology
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Quality Assessment Information
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored
42 by the Technology
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Validation Information
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored
43 by the Technology
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Other Factors that can Influence the Record
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored
44 by the Technology
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Bibliography
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008
sponsored
45 by the Technology
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Infusion Working Group
Information from users
•
•
•
•
•
Data Errors found
Quality updates
Things that need further explanation
Metadata updates/additions?
Community contributed metadata????
7th Joint ESDSWG meeting,
October 22, Philadelphia, PA
Data Lifecycle Workshop
sponsored by the Technology
Infusion Working Group
Back to why you need to…
• E-science uses data and it needs to be
around when what you create goes into
service and you go on to something else
• That’s why someone on the team must
address life-cycle (data, information and
knowledge – we’ll get to the latter shortly)
and work with other team members to
implement organizational, social and
technical solutions to the requirements
47
What would you need to do?
48
(Digital) Object Identifiers
• Object is used here so as not to pre-empt an
implementation, e.g. resource, sample, data,
catalog
• Examples:
– DOI
– URI
– XRI
49
Versioning
50
Mining
• We will start with data but the ideas apply to
information and knowledge bases as well
• Definition
• History
• Our interest
51
SAM: Smart Assistant for
Earth Science Data Mining
PI: Rahul Ramachandran
Co-I: Peter Fox, Chris Lynnes,
Robert Wolf, U.S. Nair
Science Motivation
• Study the impact of natural iron fertilization process such as
dust storm on plankton growth and subsequent DMS
production
–
–
–
–
Plankton plays an important role in the carbon cycle
Plankton growth is strongly influenced by nutrient availability (Fe/Ph)
Dust deposition is important source of Fe over ocean
Satellite data is an effective tool for monitoring the effects of dust
fertilization
• Analysis entails
– Mine MODIS L1B data for dust storm events and identify the swath of
area influenced by the passage of the dust storms.
– Examine correlations between fertilization, plankton growth and DMS
production
Current Analysis Process
• MODIS aerosol products don’t provide speciation
• Locate and download all the data to their local machine
• Write code to classify and detect dust accurately [ 3-4
month effort]
• Write code to classify and detect other dust aerosols [ 34 month effort]
• Write code to segment the detected region in order to
account for advection effect and correlation coefficient [2
months effort]
Analysis with SAM
• Create a workflow to perform classification using many
different state of the art classifiers on distributed data
• Create a workflow to segment detected regions using
image processing services on distributed data
Bottom line:
• Scientist does not have to write all the code to perform
the analysis
• Can compose workflows that utilize distributed
data/services
• Can share the workflow with others to collaborate, reuse
and modify
Conducting Science using
Internet as the Primary Computer
Mash-ups Example: Yahoo Pipes
Data Mining in the ‘new’ Distributed
Data/Services Paradigm
Too many choices!!
•And that’s only part of the toolkit
•ADaM-IVICS toolkit has over 100+ algorithms
SAM Objectives
• Improve usability of Earth Science data by
existing data mining services for research, by
incorporating semantics into the workflow
composition process.
– Semantic search capable of mapping a
conceptual task
– Assistance in mining workflow composition
– Verification that services are connected in a
semantically correct fashion
Ontology Use
Semi-automated
Workflow Composition
Filtering services based
on data format
Semi-automated
Workflow Composition
Filtering service options
based on both data format
and task selected
Semi-automated
Workflow Composition
Final Workflow
Science Motivation
• Study the impact of natural iron fertilization process
such as dust storm on plankton growth and
subsequent DMS production
– Plankton plays an important role in the carbon cycle
– Plankton growth is strongly influenced by nutrient
availability (Fe/Ph)
– Dust deposition is important source of Fe over ocean
– Satellite data is an effective tool for monitoring the effects
of dust fertilization
Hypothesis
• In remote ocean locations there is a positive
correlation between the area averaged
atmospheric aerosol loading and oceanic
chlorophyll concentration
• There is a time lag between oceanic dust
deposition and the photosynthetic activity
Primary source of
ocean nutrients
OCEAN
UPWELLI
NG
WIND
BLOWNDU
ST
SEDIMENTS
FROM RIVER
SAHAR
A
CLOU
DS
Factors modulating
dust-ocean
photosynthetic effect
SST
CHLOROPH
YLL
NUTRIE
NTS
DUST
SAHAR
A
Objectives
• Use satellite data to determine, if
atmospheric dust loading and
phytoplankton photosynthetic activity are
correlated.
• Determine physical processes responsible
for observed relationship
Preliminary Results
Data and Method
• Data sets obtained from SeaWiFS and
MODIS during 2000 – 2006 are employed
• MODIS derived AOT
The areas of study
8
7
6
1
2
3
4
5
1-Tropical North Atlantic Ocean 2-West coast of Central Africa 3Patagonia
Tropical North Atlantic Ocean 
dust from Sahara Desert
-0.0902
-0.328
-0.4595
-0.14019
-0.7253
-0.1095
-0.75102
-0.66448
-0.72603
AOT
Chlorophyll
-0.17504
-0.68497
-0.15874
-0.85611
-0.4467
Arabian Sea  Dust from Middle
East
0.66618
0.65211
0.76650
0.37991
0.45171
0.52250
0.36517
0.5618
0.4412
0.75071
0.708625
0.8495
AOT
Chlorophyll
0.59895
0.69797
Summary and future work
• Dust impacts oceans photosynthetic activity,
positive correlations in some areas NEGATIVE
correlation in other areas, especially in the Saharan
basin
• Hypothesis for explaining observations of negative
correlation: In areas that are not nutrient limited,
dust reduces photosynthetic activity
• But also need to consider the effect of clouds,
ocean currents. Also need to isolate the effects of
dust. MODIS AOT product includes contribution
from dust, DMS, biomass burning etc.
Case for SAM
• MODIS aerosol products don’t provide speciation
• Why performing this data analysis is hard?
– Need to classify and detect Dust accurately
– Need to classify and detect other aerosols (eg. DMS accurately)
– Need to segment the detected region in order to account for
advection effects and correlation coefficient.
• What will SAM provide?
– Provide capability to create a workflow to perform classification
– Provide capability to create a workflow to segment detected
regions
Bottom line:
• Scientist does not have to write all the code to perform the
analysis
• Can compose workflows that utilize distributed data/services
• Can share the workflow with others to collaborate, reuse and
modify
Knowledge Discovery
• Has a broad meaning
– Finding ontologies
– Creating new knowledge from
• Previous knowledge
• New sources (data, information)
• Modeling
• We’ll look at a mining approach as an
example
77
Ingest/pipelines: problem definition
•
Data is coming in faster, in greater volumes and outstripping our ability to perform
adequate quality control
•
Data is being used in new ways and we frequently do not have sufficient
information on what happened to the data along the processing stages to
determine if it is suitable for a use we did not envision
•
We often fail to capture, represent and propagate manually generated
information that need to go with the data flows
•
Each time we develop a new instrument, we develop a new data ingest
procedure and collect different metadata and organize it differently. It is then hard
to use with previous projects
•
The task of event determination and feature classification is onerous and we
don't do it until after we get the data
78
79
20080602 Fox VSTO et al.
Use cases
• Who (person or program) added the comments
to the science data file for the best vignetted,
rectangular polarization brightness image from
January, 26, 2005 1849:09UT taken by the
ACOS Mark IV polarimeter?
• What was the cloud cover and atmospheric
seeing conditions during the local morning of
January 26, 2005 at MLSO?
• Find all good images on March 21, 2008.
• Why are the quick look images from March 21,
2008, 1900UT missing?
• Why does this image look bad?
80
81
20080602 Fox VSTO et al.
82
20080602 Fox VSTO et al.
Summary
• (Data) life cycle – key actions
–A
–B
• Mining (data, information and knowledge) –
key results and work in progress
–A
–B
• Facilitating new discoveries
–A
83
Next week
• This weeks assignments:
– Reading: None
– Assignment: None
• Next class (week 14 – December 6):
– Class presentation III: Use case iteration
• Term assignment due – December 6 before class
• Office hours this week – by appointment or drop in
– Winslow 2104 (Professor McGuinness)
– Winslow 2143 (Professor Luciano)
• Questions?
84