Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Citation Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 12, November 26,

Download Report

Transcript Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Citation Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 12, November 26,

Webs of Data and Data on
the Web, the Deep Web, Data
Discovery, Data Citation
Peter Fox
Data Science – CSCI/ERTH/ITWS-4350/6350
Week 12, November 26, 2013
1
Contents
•
•
•
•
•
•
•
•
Review of reading assignment
Webs of data and semantic web
Data on the web, linked data
Deep web
Data discovery
Data citation
Summary
Next week
2
Reading
• Mealy
• Wickett et al.
• Data Quality European Union Presentation
• ISO Technical Standards - General
Reference
3
Webs of data (science)
• Early Web - Web of pages
• http://www.ted.com/index.php/talks/tim_berne
rs_lee_on_the_next_web.html
• Semantic web started as a way to facilitate
“machine accessible content”
– Initially was available only to those with familiarity
with the languages and tools, e.g. your parents
could not use it
• Webs of data grew out of this
– One specific example is W3C’s Linked Open
Data
4
Semantic Web
• http://www.w3.org/2001/sw/
• “The Semantic Web provides a common
framework that allows data to be shared and
reused across application, enterprise, and
community boundaries. It is a collaborative
effort led by W3C with participation from a
large number of researchers and industrial
partners. It is based on the Resource
Description Framework (RDF)...”
5
Terminology
• Semantic Web
– An extension of the current web in which
information is given well-defined meaning, better
enabling computers and people to work in
cooperation, www.semanticweb.org
– Primer: http://www.ics.forth.gr/isl/swprimer/
• Ontology (n.d.). The Free On-line Dictionary of
Computing.
http://dictionary.reference.com/browse/ontology
– An explicit formal specification of how to
represent the objects, concepts and other entities
that are assumed to exist in some area of interest
and the relationships that hold among them.
6
Semantic Web Layers
7
http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/
Application Areas for SW
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Smart search
Annotation (even simple forms), smart tagging
Geospatial
Implementing logic (rules), e.g. in workflows
Data integration
Verification …. and the list goes on
Web services
Web content mining with natural language parsing
User interface development (portals)
Semantic desktop
Wikis - OntoWiki, SemanticMediaWiki
Sensor Web
Software engineering
Explanation
8
Semantic Web Basics
• The triple: {subject-predicate-object}
Interferometer is-a optical instrument
Optical instrument has focal length
• W3C is the primary (but not sole) governing org.
– RDF
– OWL 1.0 and 2.0 - Ontology Web Language
• RDF
– programming environment for 14+ languages, including C, C++,
Python, Java, Javascript, Ruby, PHP,...(no Cobol or Ada yet ;-( )
• OWL programming for Java
• Closed World - where complete knowledge is known
(encoded), AI relied on this
• Open World - where knowledge is incomplete/ evolving,
SW promotes this
9
Ontology Spectrum
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Selected
Formal Frames
Logical
is-a (properties) Constraints
(disjointness,
inverse, …)
Formal
instance
Value
Restrs.
General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;
– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
10
Semantic Web Myths
• ‘the Semantic Web is a reincarnation of Artificial Intelligence
on the Web’ (closed world versus open world)
• ‘it relies on giant, centrally controlled ontologies for
"meaning" (as opposed to a democratic, bottom-up control of
terms)’
• ‘one has to add metadata to all Web pages, convert all
relational databases, and XML data to use the Semantic
Web’
• ‘one has to learn formal logic, knowledge representation
techniques, description logic, etc, to use it’
• ‘it is, essentially, an academic project, of no interest for
industry’
11
Integrating Multiple Data Sources
• The Semantic Web lets us merge
statements from different sources
• The RDF Graph Model allows
programs to use data uniformly
regardless of the source
• Figuring out where to find such
data is a motivator for Semantic
Web Services
hasCoordinates
#Ionosphere
#magnetic
name
hasLowerBoundaryValue
“100”
“Terrestrial
Ionosphere”
hasLowerBoundaryUnit
“km”
Different line & text colors
12
represent different data sources
Drill Down /Focused Perusal
• The Semantic Web uses Uniform
Resource Identifiers (URIs) to
…#NeutralTemperature
name things
• These can typically be resolved
to get more information about the
resource
measuredby
• This essentially creates a web of
data analogous to the web of text
created by the World Wide Web
Internet
• Ontologies are represented using
the same structure as content
– We can resolve class and
property URIs to learn about the
ontology
…#Norway
locatedIn
...#ISR
...#FPI
type
operatedby
...#MilllstoneHill …#EISCAT
13
Statements about Statements
• The Semantic Web allows us to
make statements about
statements
– Timestamps
– Provenance / Lineage
– Authoritativeness / Probability /
Uncertainty
– Security classification
– …
#Danny’s
#Aurora
hasSource
hasDateTime
hascolor
• This is an unsung virtue of the
Semantic Web
20031031
Red
Ontologies Workshop, APL May 26, 2006
14
‘Collecting’ the ‘data’
• Part of the (meta)data information is present in tools
... but thrown away at output e.g., a business chart
can be generated by a tool: it ‘knows’ the structure,
the classification, etc. of the chart, but, usually, this
information is lost storing it in web data would be
easy!
• SW-aware tools are around (even if you do not
know it...), though more would be good:
– Photoshop CS stores metadata in RDF in, say, jpg files
(using XMP)
– RSS 1.0 feeds are generated by (almost) all blogging
systems (a huge amount of RDF data!)
15
‘Collecting’ the ‘data’
• Scraping - different tools, services, etc, come
around every day:
– get RDF data associated with images, for
example: service to get RDF from flickr images
– service to get RDF from XMP
– XSLT scripts to retrieve microformat data from
XHTML files
– scripts to convert spreadsheets to RDF – e.g. see
csv2rdf4lod and the tools, tutorials, demos at
http://logd.tw.rpi.edu
– schema.org and the datasets extension
16
‘Collecting’ the ‘data’
• SQL - A huge amount of data in Relational
Databases
– Although tools exist, it is not feasible to convert that data
into RDF
– Instead: SQL ⇋ RDF ‘bridges’ are being developed: a
query to RDF data is transformed into SQL on-the-fly
– Reading for this week, article by Berners Lee and Sahoo
et al.
– RDB2RDF W3 working group http://www.w3.org/2001/sw/rdb2rdf/
– D2RQ/ D2RServer
– Commercial solutions appearing
• NoSQL
• Other ‘graph’ forms…
17
More Collecting
• RDFa extends XHTML by:
– extending the link and meta to include child elements
– add metadata to any elements (a bit like the class in
microformats, but via dedicated properties)
– Used in schema.org/ datasets
• It is very similar to microformats, but with more
rigor:
– it is a general framework (instead of an ‘agreement’ on
the meaning of, say, a class attribute value)
– terminologies can be mixed more easily
• GRDDL - Gleaning Resource Descriptions from
Dialects of Languages
18
Linked open data
• http://linkeddata.org/guides-and-tutorials
• http://tomheath.com/slides/2009-02-austinlinkeddata-tutorial.pdf (we will look at some of
these slides now, #1-25 and 30-37)
• And of course:
– http://logd.tw.rpi.edu/
19
September 2011 - http://lod-cloud.net/
20
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
(Class 2) Management
•
•
•
•
•
•
Creation of logical collections
Physical data handling
Interoperability support
Security support
Data ownership
Metadata collection, management and
access.
• Persistence
• Knowledge and information discovery
• Data dissemination and publication
21
Data Management and WOD
• How is the data managed?
– Found?
– Curated?
• What about the metadata?
• What problems are introduced/ solved?
• See discussion in: Parsons and Fox (2012):
http://mp-datamatters.blogspot.com/
22
Data on the Web, Internet
• Data behind web services
• Data files on web sites
• We have covered data as service approaches
(week 11)
• Thinking you have found data when you have
really only found information and metadata
• The real difference between this topic and the
next one is:
– Access and dissemination
– Level of curation (and often description)
23
Data on the internet
• http://www.dataspaceweb.org/
• Data files on other protocols
– FTP
– RFTP
– GridFTP
– SABUL
– XMPP/AMQP
– Others…
24
Deep web
• Data behind web services
• Data behind query interfaces (databases or
files)
• Introduces a different curation problem
25
The loose definition
• Something that a crawler cannot find and/or
index
– Creates the other definition of shallow web
• Has many implications for discovery, access
and use
• Curation is more complex to satisfy this
definition, i.e. not a matter of just putting files
‘on the web’
• 50, 100, 1000 times the ‘shallow web’?
26
Managing (in) the deep web
• Sometimes, the deep web aspect of a data
source can be due to extreme obscurity,
language peculiarities, NO metadata, NO
documentation
• There are no known studies of how effective
data management (what you are learning)
could change the percentage of deep/
shallow
• Semantics are often put forward as a solution
http://www.mkbergman.com/458/new-currents-inthe-deep-web/
27
Internet impacts on management
• Management of data that is… on the Internet!
• Web –> ‘stateless’
• Curation, Preservation –> highly stateful (by
definition)
• You will hear terms such as digital curation
and digital preservation but what about
internet curation and internet preservation
(Internet Archive)?
28
Thus data frameworks are appearing
• Many – meaning they go beyond web sites,
they incorporate many of the data
management functions
• Initially syntactic – e.g. OPeNDAP, ADDE,
ODATA, OODT
• Application oriented – e.g. virtual
observatories
– Semantic – e.g. Virtual Solar-Terrestrial
Observatory
• ALL of these are changing the nature of data
management and role of data ‘providers’
29
30
Some Definitions
DAP = Data Access Protocol
 Model used to describe the data;
 Request syntax and semantics; and
 Response syntax and semantics.
OPeNDAP
 The software;
 Numerous reference implementations;
 Core/libraries and services (servers and clients).
OPeNDAP Inc.
 OPeNDAP is a 501.c(3) non-profit corporation;
 Formed to maintain, evolve and promote the
discipline neutral DAP that was the DODS core
infrastructure.
BOM, Melbourne, VIC
31
Considerations with regard to the
development of DAP and OPeNDAP
 Many data providers
 Many data formats
 Many different client types
 Many different semantic representations of
the data
 Many different security requirements
BOM, Melbourne, VIC
32
Broad Vision
A world in which a single data access protocol
is used for the exchange of data between
network based applications regardless of
discipline.
A layer above TCP/IP providing for syntactic and
semantic consistency not available in existing
protocols such as FTP.
BOM, Melbourne, VIC
33
Practical Considerations
The broad vision:
 Is syntactically achievable, but
 Was not semantically achievable, at least
not fully, but perhaps in the near term.
BOM, Melbourne, VIC
34
The Data Access Protocol (DAP)
 The DAP has been designed to be as
general as possible without being
constrained to a particular discipline or
world view.
 The DAP is a discipline neutral data access
protocol; it is being used in astronomy,
medicine, earth science,…
 Provides data format and location, and data organization
transparency
 Is metadata neutral
BOM, Melbourne, VIC
35
OPeNDAP V4 (Hyrax)
Architecture
Client
OLFS
BES
 OPeNDAP Lightweight Front end Server (OLFS)
 Receives requests and asks the BES to fill them
 Uses Java Servlets
 Does not directly ‘touch’ data
 Multi-protocol
 Back End Server (BES)
 Reads data files, Databases, et c., returns info
 May return DAP2 objects or other data
 Does not require web server
BOM, Melbourne, VIC
36
Data
OPeNDAP Clients
Internet
netCDF Java
netCDF C
Ferret
GrADS
IDV
Web
Browser
BOM, Melbourne, VIC
VisAD
NCL
Client
IDL
Client
Matlab
Client
ncBrowse
Access
NCL
37
Matlab
IDL
pyDAP
Excel
OPeNDAP
Data
Connector
ArcGIS
OPeNDAP Servers
CDM
ESML
netCDF HDF4
Data
Data
General
netCDF
Data
HDF5
DSP
Tables
SQL
FITS
CDF
Flat
Binary
CEDAR
Data
Data
Data
Data
Data
Data
Data
Data
HDF5
HDF4
JGOFS
DSP
JDBC
Internet
BOM, Melbourne, VIC
FITS
38
FreeForm
CDF
CEDAR
OPeNDAP Servers
(specialized processing)
pyDAP
ESG
FDS
GDS
DAPPER
CODAR
TDS
Data
Data
Data
Data
Data
Data
Data
General
netCDF
OPeNDAP
netCDF
OPeNDAP
GRIB
BUFR
OPeNDAP
netCDF
OPeNDAP
CODAR
netCDF
OPeNDAP
Internet
BOM, Melbourne, VIC
39
Servers
 Servers may also provide other services
 Directory traversal.
 Browser-based form to build URL.
 Ascii or other representations of data.
 Metadata associated with the data.
 Server side functions.
BOM, Melbourne, VIC
40
Summary
Discovery
Discovery
Inventory
Inventory
Detail
Detail
41
Search
Catalog
Tetherless World Constellation
Data
Data discovery
• Free text search on the internet/ web
• Data portals
• What makes discovery work?
– For Deep Web?
– For Linked Data?
42
Data discovery
• What makes discovery work?
– Metadata
– Logical organization
– Attention to the fact that someone would want to
discover it
– It turns out that file types are a key enabler or
inhibitor to discovery
• What does not work?
– Result ranking using *any* conventional
algorithms
43
Smart search
• Semantically aware search, e.g.
http://noesis.itsc.uah.edu
• Faceted search, e.g.
– mspace (http://mspace.fm )
– jSpace
– Exhibit (MIT)
– S2S – e.g. International Open Government
Dataset Catalog (IOGDC; http://logd.tw.rpi.edu )
44
NOESIS
45
Search Application integration!
Deep web dashboards…
47
http://logd.tw.rpi.edu
Intl. Open Govt. Data Cat.
Federated search
• “is the simultaneous search of multiple online
databases or web resources and is an emerging
feature of automated, web-based library and
information retrieval systems. It is also often
referred to as a portal or a federated search
engine.” wikipedia
• Libraries have been doing this for a long time
(Z39.50, ISO23950)
• Key is consistent search metadata fields (keywords)
• E.g. Geospatial One Stop http://www.geodata.gov
49
Data Citation
• “Sound, reproducible scholarship rests upon
a foundation of robust, accessible data. For
this to be so in practice as well as theory,
data must be accorded due importance in the
practice of scholarship and in the enduring
scholarly record. In other words, data should
be considered legitimate, citable products of
research. Data citation, like the citation of
other evidence and sources, is good research
practice.”
50
(http://www.force11.org/datacitation)
Landing page – a short form
http://data.rpi.edu/repository/handle/10833/24
Long form
http://data.rpi.edu/repository/handle/10833/24?show=full
Conneg
• Many examples, but what follows is ~ from:
http://www.crosscite.org/cn/
• Also see - http://labs.crossref.org/ and
http://data.datacite.org/
• What is it?
– Es ce que vous parlez Français?
– Do you speak html or JSON or RDF?
Conneg
Application level, e.g. as JSON
Supported content types..
Example formatting..
Other coolness…
• curl -LH "Accept:
application/vnd.crossref.unixref+xml;q=1,
application/rdf+xml;q=0.5"
http://dx.doi.org/10.1126/science.169.3946.63
5
• curl http://data.datacite.org/application/xdatacite+text/10.5524/100005
– Li, J; Zhang, G; Lambert, D; Wang, J; (2011):
Genomic data from the Emperor penguin
(Aptenodytes forsteri); GigaScience.
http://dx.doi.org/10.5524/100005
Further integration..
Additional refs.
• EPIC for identifier conventions:
http://pidconsortium.eu
• Dspace and Handle http://tw.rpi.edu/web/project/Data.rpi.edu/Arch
itecture (install/ config notes in PDF and in
Section 4.4.4, page 55 of Handle installation
manual for V 1.8)
Aren’t you happy this is the last lecture?
• Otherwise – we’d go into a
long discussion of the merit
of data citation to achieve the
business case in an earlier
slide
• It would be enlightening but
torture…

63
Summary
• Theme of data management in the chaotic
and enabling environment of the web, internet
• Emergence of frameworks that encompass
some aspects of data management
• Unlocking data in a useful way is an immense
challenge (discovery, citation?)
• Anything/ everything you can do by following
what you have learned in this course will help
64
What is next
• Dec. 3 – project presentations
• Final assignment to be handed in today!
• Reading for this week:
– Semantic Deep Web, James Geller, Soon Ae
Chun, and Yoo Jung An,
– The Deep Web (Internet Tutorials)
– Digital Image Resources on the Deep Web
– Parsons and Fox: Is Data Publication the Right
Metaphor?
• Class evaluations…
65