Academic Basis for Data and Information Science, Data Models, Schema, Data Tools and Data as Service Paradigms Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 12,

Download Report

Transcript Academic Basis for Data and Information Science, Data Models, Schema, Data Tools and Data as Service Paradigms Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 12,

Academic Basis for Data and
Information Science, Data
Models, Schema, Data Tools
and Data as Service
Paradigms
Peter Fox
Data Science – CSCI/ERTH/ITWS-4350/6350
Week 11, November 12, 2013
1
Contents
•
•
•
•
•
•
•
•
Reading
Informatics
Data models
Schema
Tools
Markup languages
Data as service
How are the projects going?
2
Reading?
• Introduction to Data Management
• Changing software, hardware a nightmare for
tracking scientific data (and Parts I, II and III)
• Overview of Scientific Workflow Systems, Gil
(AAAI08 Tutorial)
• Comparison of workflow software products,
Krasimira Stoilova ,Todor Stoilov
• Scientific Workflow Systems for 21st Century,
New Bottle or New Wine? Yong Zhao, Ioan
Raicu, Ian Foster
3
Definitions (revisited)
• Data - are pieces of <x> that represent the
qualitative or quantitative attributes of a
variable or set of variables.
• Data (plural of "datum", which is seldom
used) - are typically the results of
measurements and can be the basis of
graphs, images, or observations of a set of
variables.
• Data - are often viewed as the lowest level of
abstraction from which information and
knowledge are derived
4
Definitions ctd.
• Information
– Representations (of facts? data?) in a form that
lends itself to human use
• Knowledge
– …. Meaning – but watch how this may become
so very important
5
Data-Information-Knowledge
Ecosystem
Producers
Consumers
Experience
Data
Creation
Gathering
Information
Presentation
Organization
Knowledge
Integration
Conversation
Context
6
Mind the gap
• As we aim to use modern technology to
 Informatics
- information
advance data
science: science includes the
science of (data and) information, the practice
• There is often a gap between science and
of information processing, and the engineering
the underlying infrastructure and technology
of information systems. Informatics studies the
that is available
structure, behavior, and interactions of natural
and artificial systems that store, process and
(data and)isinformation.
• communicate
Cyberinfrastructure
the new It also
develops
own conceptual and
theoretical
researchitsenvironment(s)
that
support
foundations.
Sinceacquisition,
computers, individuals
advanced data
data and
organizations
all management,
process information,
storage,
data
data
informatics
hasdata
computational,
cognitive and
integration,
mining, data
social
aspects, including
study
of the social
visualization
and other
computing
impact
of information
technologies.
Wikipedia.
and information
processing
services
over the Internet.
7
A moment of history
• In the late 1950’s (actually around 1957-1958)
the modern informatics term was coined
• Existed for a while but then split into library
science and computer science and developed
their own fields, became disconnected
• Now coming back to be relevant to science
• Informatics IS NOT just having a scientist
work with an “IT/ICT” person
8
Advertisement
• Spring 2014 – Xinformatics
• See last year:
http://tw.rpi.edu/web/course/Xinformatics/201
3
9
Library science
• Curates the artifacts of knowledge
• Organizes and manages them for consumers
– Cataloging and classification
• Preservation
– ‘maintaining or restoring access to artifacts,
documents and records through the study,
diagnosis, treatment and prevention of decay and
damage’ (wikipedia)
• Digital age
– Curation and preservation
10
Cognitive Science
• Cognitive science is an interdisciplinary study
of the mind and intelligence
• It operates at the intersection of psychology,
philosophy, computer science, linguistics,
anthropology, and neuroscience.
• Of relevance for data and information science
are three significant theoretical underpinnings
– mental representation,
– the nature of expertise,
– and intuition
• Very relevant to model, data/metadata choice
11
Social Science
• Branch of humanities
• Especially as it relates to networks of
scientists
• Exploits sociology of groups, teams
• Cultural norms as well as discipline norms
– Modes of what and how rewards are given
– Between those who produce and those who
consume data (and information)
– More
12
Information theory
• Semiotics, also called semiotic studies or
semiology, is the study of sign processes
(semiosis), or signification and
communication, signs and symbols, into three
branches:
– Syntactics: Relation of signs to each other in
formal structures
– Semantics: Relation between signs and the
things to which they refer; their denotata
– Pragmatics: Relation of signs to their impacts on
those who use them
13
Note: we have theories for…
• Knowledge -> various forms of logic(s)
• Information (Shannon, Weaver, Peirce…)
•
•
•
•
But not ‘Data’ (except for …)
< reading for this week > sneak peek…
http://tw.rpi.edu/web/node/3605 (Mealy 1967)
http://tw.rpi.edu/web/node/3606 (Wickett et
al.)
Mealy’s Introduction
• “We do not, it seems, have a very clear and
commonly agreed upon set of notions about dataeither what they are, how they should be fed and
cared for, or their relation to the design of
programming languages and operating systems.
This paper sketches a theory of data which may
serve to clarify these questions. It is based on a
number of old ideas and may, as a result, seem
obvious. Be that as it may, some of these old ideas
are not common currency in our field, either
separately or in combination; it is hoped that
rehashing them in a somewhat new form may prove
15
to be at least suggestive.”
Three elements and connections
• Relations
• Data Maps
• Access Functions
•
•
•
•
The data itself
Procedures
Storage and representation
Descriptors
16
Wickett et al…
• “Heterogeneous digital data that has been produced by
different communities with varying practices and
assumptions, and that is organized according to different
representation schemes, encodings, and file formats,
presents substantial obstacles to efficient integration,
analysis, and preservation. This is a particular impediment to
data reuse and interdisciplinary science. An underlying
problem is that we have no shared formal conceptual
model of information representation that is both
accurate and sufficiently detailed to accommodate the
management and analysis of real world digital data in
varying formats. Developing such a model involves
confronting extremely challenging foundational problems in
information science. “
17
Premise
Context
Experience
Data
Creation
Gathering
Information
Presentation
Organization
Knowledge
Integration
Conversation
18
1. Assume context free
• Content and Structure
• D=f(x;p)
• D=data, f=transduction function, x=thing, p=parametric
dependence (e.g. time of transduction)
• HAVE – Syntax
• DO NOT HAVE - Semantics – no meaning without context
• OR - Pragmatics – no use without meaning??
• What about - Uncertainty, quality, bias (error) – none without
context?
2. Assume minimal context
• Minimal = incomplete?
• E.g. know instrument but not when, or of what
• E.g. know what but not how
• Partial uncertainty? Conditional entropy?
• Constructive induction?
Pulling things over from Informatics
Context
Experience
Data
Creation
Gathering
Information
Presentation
Organization
Knowledge
Integration
Conversation
21
Information Models
• Conceptual models, sometimes called domain
models, are typically used to explore domain
concepts
• High-level conceptual models are often created as
part of initial requirements envisioning efforts as
they are used to explore the high-level static
business or science or medical structures and
concepts.
22
(Information) Architecture
• Definition:
– “is the art of expressing a model or concept of
information used in activities that require explicit
details of complex systems” (wikipedia)
– “… I mean architect as in the creating of
systemic, structural, and orderly principles to
make something work - the thoughtful making of
either artifact, or idea, or policy that informs
because it is clear.” Wuman
23
Data Models
• Conceptual data models, sometimes called domain
models, are typically used to explore domain
concepts
• Conceptual data models are often created as the
precursor to logical data models or as alternatives
to them.
• http://en.wikipedia.org/wiki/Data_modelling
24
Observation and Measurement
«metaclass»
GF_FeatureType
{root}
«FeatureT ype»
OM_Process
+theGF_FeatureT ype
«instanceOf»
1
1
«FeatureTyp...
GFI_Feature
+procedure
MD_Metadata
{root}
1
ProcessUsed
+m etadata
0..1
+featureOfInterest
Domain
Metadata
+propertyValueProvider
0..*
«FeatureT ype»
OM_Observ ation
+
+
+
+
+
+carrierOfCharacteristics
0..*
«metaclass»
GF_PropertyType
+generatedObservation
0..*
+
+
name: GenericName
value: Any
constraints
{observedProperty shall be a phenomenon
associated with the type of the feature of interest}
{procedure shall be suitable for observedProperty} +relatedObservation 0..*
{result type shall be suitable for observedProperty}
{parameter.name shall not more than once}
{root}
Phenomenon
«instanceOf»
«DataT ype»
NamedValue
parameter: Nam edValue [0..*]
phenomenonT ime: TM_Object
resultTime: T M_Instant
validT ime: T M_Period [0..1]
resultQuality: DQ_Element [0..*]
0..*
1
Range
«T ype»
+observedProperty
GFI_PropertyType
Observ ationContext
+result
+
«type»
Any
{root}
role: GenericName
25
Mapping model to geochemistry
26
Specimen Model
«FeatureTyp...
GFI_Feature
SamplingFeatureComplex
+
role: GenericName
1..*
+sampledFeature
Intention
+relatedSamplingFeature
0..*
«FeatureType»
SF_SamplingFeature
+
0..*
parameter: NamedValue [0..*]
«FeatureType»
OM_Process
0..*
+processingDetails
«FeatureType»
SF_Specimen
+
+
+
+
+
+
+
materialClass: GenericName
samplingTime: TM_Object
samplingLocation: GM_Object [0..1]
samplingMethod: OM_Process [0..1]
currentLocation: Location [0..1]
specimenType: GenericName [0..1]
size: Measure [0..1]
«Union»
Location
+
+
geometryLocation: GM_Object
nameLocation: EX_GeographicDescription
PreparationStep
+
+
time: TM_Object
processOperator: CI_ResponsibleParty [0..1]
27
Conceptual model
28
Logical model
29
Physical model
30
31
Conceptual model – shoreline photos
32
Logical model – shoreline photos
33
However as a consumer
• Do you ever really see these data models?
• What’s the most common form of making
data available to others?
• What’s the most common means? Second
most common?
34
Example XML
<?xml version="1.0" encoding="ISO-8859-1"?>
<shiporder orderid="889923"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="shiporder.xsd">
<orderperson>John Smith</orderperson>
<shipto>
<name>Ola Nordmann</name>
<address>Langgt 23</address>
<city>4000 Stavanger</city>
<country>Norway</country>
</shipto>
<item>
<title>Empire </title>
<note>Special Edition</note>
<quantity>1</quantity>
<price>10.90</price>
</item>
<item>
<title>Hide your heart</title>
<quantity>1</quantity>
<price>9.90</price>
</item>
</shiporder>
35
Very simple schema
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>
<xs:element name="shiporder">
<xs:complexType>
<xs:sequence>
<xs:element name="orderperson" type="xs:string"/>
<xs:element name="shipto">
<xs:complexType>
<xs:element name="item" maxOccurs="unbounded">
<xs:sequence>
<xs:complexType>
<xs:element name="name" type="xs:string"/> <xs:sequence>
<xs:element name="address" type="xs:string"/> <xs:element name="title" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="note" type="xs:string" minOccurs="0"/>
<xs:element name="country" type="xs:string"/> <xs:element name="quantity" type="xs:positiveInteger"/>
</xs:sequence>
<xs:element name="price" type="xs:decimal"/>
</xs:complexType>
</xs:sequence>
</xs:element>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="orderid" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
36
</xs:schema>
Markup Languages
• Reminder:
– Mixes data and metadata, and yes, information
– Tag structure does not always model the
underlying data structure
– Modeling the XML itself, i.e. the schema is
another task
– Does have the potential benefit that it is more for
use than storage
• Parsing the file:
– Incomplete versus complete tags
– Empty or optional fields
37
Data tools (just a few)
• Models
– http://www.datamodel.org/
– MSDN: http://msdn.microsoft.com/enus/library/bb399249.aspx
• Schema
– The Schematron differs in basic concept from other
schema languages in that it not based on grammars but
on finding tree patterns in the parsed document. This
approach allows many kinds of structures to be
represented which are inconvenient and difficult in
grammar-based schema languages. If you know XPath or
the XSLT expression language, you can start to use The
Schematron immediately.
38
– http://www.schematron.com/
Markup Language tools
• Any context-sensitive editor
• XMLSpy, XML Notepad, XML Editor, oXygen
39
Data as Service
• Modern internet architectures allow for
– Service oriented architectures
– Resource oriented architectures
• Why is this important for data models,
schema, etc.
– Hides/ obscures underlying model, schemas
– Service interfaces are often a poor/ hybrid match
for underlying models
• UML and ISO 19xxx family of standards, e.g.
19135 are changing the landscape
• Mature in certain settings.
40
Open Geospatial Consortium
• Web Feature Service (WFS)
– http://www.opengeospatial.org/standards/wfs
– support INSERT, UPDATE, DELETE, LOCK,
QUERY and DISCOVERY operations on
geographic features using HTTP as the
distributed computing platform
– Built on Geographic Markup Language (GML)
• Tutorial
– http://docs.codehaus.org/display/MAP/WFS+Tuto
rial
41
WFS examples
42
Open Geospatial Consortium
• Web Mapping Service (WMS)
– http://www.opengeospatial.org/standards/wms
– produces maps of spatially referenced data
dynamically from geographic information ("map"
is a portrayal of geographic information as a
digital image file suitable for display on a
computer screen). A map is not the data itself.
WMS-produced maps are generally rendered in a
pictorial format such as PNG, GIF or JPEG, or
occasionally as vector-based graphical elements
in Scalable Vector Graphics formats.
– http://www.intl-interfaces.com/cookbook/WMS/ 43
– http://oceanesip.jpl.nasa.gov/esipde/guide.html
Open Geospatial Consortium
• Web Coverage Service (WCS)
– http://www.opengeospatial.org/standards/wcs
– supports electronic interchange of geospatial
data as "coverages" – that is, digital geospatial
information representing space-varying
phenomena
44
Open Geospatial Consortium
• Sensor Observation Service (SOS)
– http://www.opengeospatial.org/standards/sos
• SWE Common
– http://www.opengeospatial.org/projects/groups/s
wecommonswg
– Get_capabilities
45
IVOA (www.ivoa.net)
• Simple Image Access Protocol
– http://ivoa.net/Documents/SIA/20091008/PR-SIA-1.020091008.pdf
– This specification defines a protocol for retrieving image
data from a variety of astronomical image repositories
through a uniform interface. The interface is meant to be
reasonably simple to implement by service providers. A
query defining a rectangular region on the sky is used to
query for candidate images.
– The service returns a list of candidate images formatted as
a VOTable. For each candidate image an access reference
URL may be used to retrieve the image. Images may be
returned in a variety of formats including FITS and various
graphics formats. Referenced images are often computed 46
on the fly, e.g., as cutouts from larger images.
IVOA (www.ivoa.net)
• E.g. Simple Spectrum Access Protocol
– http://ivoa.net/Documents/REC/DAL/SSA-20080201.pdf
– The Simple Spectrum Access (SSA) Protocol (SSAP)
defines a uniform interface to remotely discover and access
one dimensional spectra. SSA is a member of an integrated
family of data access interfaces altogether comprising the
Data Access Layer (DAL) of the IVOA.
– SSA is based on a more general data model capable of
describing most tabular spectrophotometric data, including
time series and spectral energy distributions (SEDs) as well
as 1-D spectra; however the scope of the SSA interface as
specified in this document is limited to simple 1-D spectra,
including simple aggregations of 1-D spectra.
47
Discussion
• Theoretical concepts- do we have any hope?
• Data models – could you develop one?
• Forms of Schema?
• Service paradigms?
• Relation to data management?
48
Summary
• Informatics in relation to data science
– Discuss?
• Data models and schema and the tools that
go with them are plentiful
• Modern use of XML and specific markup
languages obscure the underlying data
structure (physical and logical) but have other
advantages
• Data as service carry this to another level
49
What is next
• Next week – watch your email.
• Next lecture (#12) – Nov. 26th.
– Webs of data, data on the web, deep Web,
data discovery, data citation
• Reading:
– See web site for this week
50
How about those projects?
51