Document

Transcript Document

The Next Forty Years
March 2012
Michael Lang, CEO
Revelytix
Sixty Years Ago
“Turing's Cathedral” by George Dyson
• In 1945 the DOD funds the Institute of Advanced Study in Princeton,
NJ to build MANIAC (Mathematical and Numerical Integrator and
Computer)
• Command and memory address are in the same bit
• At that address was data represented in binary notation
• The only abstraction was the mapping of binary representation to
decimal notation
• Between 1945 and 1970, computers were referred to as Numerical
Computers
• So it began ...
The Last Forty Years
In 1970, E.F. Codd with the IBM Research Laboratory in San Jose,
California, wrote a paper published in ACM,
“A Relational Model of Data for Large Shared Data Banks”
•Codd wrote, “The problems treated here are those of data
independence – the independence of application programs from
the growth in data types and changes in data representation...”
•This problem is otherwise known as “abstraction”
•Codd’s paper set in motion the data management system
architecture for the next forty years. These systems are known as
relational database management systems (RDBMS)
Computers were referred to as “Information Technology” (IT)
RDBMS
The RDBMS solves only some of Codd’s issues
• Hardware and software were insufficient to solve the
whole problem at the time
• Applications continue to be severely impacted by the
growth in data types or changes in data representation
• But, applications are independent of the ordering of the
data and the indexing schemes
• RDBMS do provide ACID guarantees for CRUD
operations, --which was Codd's original goal
Paradigm Shift
In 1985 the mainframe/terminal paradigm was
replaced by the client/server paradigm
•Oracle, Sybase and others ported their new RDBMS to this
paradigm
•Thought RDBMSs had been around for ~8 years, market
acceptance did not take off until they beat IMS and VSAM to
the new client/server paradigm
•It’s hard to say which technology was the chicken and which
was the egg
Transactional Systems
The primary early use of RDBMS technology was
to create and store transactions
• RDBMS were and still are optimized for transactions;
they are very good at this task
• Later, businesses wanted to analyze the collections of
data being created
• Can systems optimized for transactions also be
optimized for analysis?
There are two large issues…
Issue # 1
Systems optimized for creating data in a
transactional framework require a fixed schema
• The meaning of the data elements are fixed by the
schema
• There is no requirement for schema evolution in RDBMS
because the primary mission is ACID/CRUD operations
• No way to say how data defined in one schema relates to
data defined by another schema
Issue # 2
The required data is typically stored in many
databanks
• It needs to be moved and combined
• What assurance is there that similar data in different
databanks represent the same thing?
• Analysis is not possible until precise meaning of all
required data in all databanks is known
• Data is not easily combined
Data Warehouse
We have twisted the RDBMS and the
client/server paradigm into the realm of
analysis through ETL and data warehousing
• All of the data is moved to the same databank
• Lots of highly custom, one-off work is done to determine
the meaning of each data element and how it needs to
be transformed for the new target schema
• It remains a rigid schema and a siloed server!
• We need to deal with massively distributed data
The Last Forty Years
Siloed Information Management Systems
• All data in a single shared databank
• Rigid schemas
• Data and metadata are different types of things
• Query processor only knows about its local data
expressed in a fixed schema
•
Schema not fixed for NoSQL
• Excellent ACID / CRUD capability
Enterprise data management remains an elusive goal
Timeline
1970 - Codd proposes the relational paradigm
1977 – First RDBMS arrives, Oracle, INGRES
1980 – SQL developed and several other RDBMS
arrive, Sybase, SQL Server, DB2, Informix
1985 – Client-server paradigm
1990 - RDBMS mainstream
Elapsed time = Twenty Years
Acceptance of New Paradigm
20 years required for large enterprises
to accept an idea introduced in 1970
Why?
• New products had to be created
• A new networking paradigm had to fall into place
• Strategic uses of the new technology had to be
articulated and translated to business uses
Paradigm Shift
DARPA and DAML
After DARPA created ARPAnet (TCP/IP) in 1990,
it turned its attention to the problem of
understanding the meaning of the data
• Their computers could “hear” each other, but could not
understand each other
DARPA created DAML (DARPA Agent Markup
Language) in 2000 to create a common language
www.daml.org
The World Wide Web
Consortium
The W3C had evolved ARPAnet into a highly
reliable, distributed system for managing
unstructured content using TCP/HTTP/HTML
• Grand slam for distributed information management
• The system did not work for structured content, data
2004 – DARPA hands off DAML to the W3C
• The W3C evolves DAML into the RDF, OWL and SPARQL
standards
• Collectively these standards comprise what most people
mean by “semantic technology”
The World Wide Web
The WWW brings the next paradigm shift in
information technology after client/server
• It is a highly distributed architecture, vastly more so
than client/server
• Domain Names
• Uniform Resource Locators (URL)
• Uniform Resource Identifiers (URI)
Can we build on this highly distributed infrastructure to
benefit enterprise information management?
Semantic Technology
This paradigm assumes data is completely
distributed, but that anyone/anything should
be able to find it and use it
• RDF is the data model
• OWL is the schema model
• SPARQL is the query language
• URIs are the unique identifiers
• URLs are the locators
Description
RDF and OWL are excellent formal description
languages
Anyone can say Anything about Anything, Anywhere
• Descriptions are both human and machine readable
• Locations are already described by URLs and identified
by IRIs
• The meaning and location of any data can now be
interpreted by computers, or humans
These technologies enable the new paradigm
The Next Forty Years
The information management technology for the
next forty years will all rest on precise,
formalized descriptions of “things”
• Schema, Data, The real world, Mappings, Rules, Business
terms, Processes, Logic, Relationships between
descriptions .....
• Descriptions provide a level of abstraction above current
information management infrastructure
• Descriptions are absolutely required to use distributed
data
The Next Forty Years
DIMS
Distributed Information
Management System
The Next Forty Years
Distributed Information Management Systems
• Data, metadata and logic are completely distributed but,
all machine readable
• All information is immediately accessible by computers
and people
Extensibility
• Constant change is assumed
• Distributed & Federated
Emergent Analytic Capability
• Reasoning
DIMS
A Distributed Information Management System
is a layer above your current DBMS, just like a
DBMS is a layer above a file system
• Both provide an additional level of abstraction
• Both bundle new computational capabilities into the
system
• Both simplify the access to and use of data by
applications and developers
Timeline
2002 – DARPA publishes work on DAML
2004 – W3C creates RDF and OWL
recommendations
2006 – the first triple stores and RDF editing tools
are available, SPARQL is recommendation
2011 – The first DIMS is available
We are just getting to the point of enterprise
adoption
DIMS
SPARQL
(data output)
Rules
(RIF)
Inferred Data
SPARQL
(data input)
SPARQL
(data input)
SPARQL
Data Validation
& Analysis
Domain
Ontology
SPARQL
SPARQL
Mappings
(R2RML)
RDB Schema
(Source
Ontology)
Mappings
(R2RML)
RDB
RDB
RDB Schema
(Source
Ontology)
Maturity Level 1
No Agility; Does Not Scale
APPLICATION
LAYER
Reporting &
Analytic Search
Data Marts
DATA/
APP LAYER
DW
Analytic
Operational
Data
Business Application
Data Services
Ad hoc data Services
Mainframe
OLTP App
Store
OLTP App
Store
OLTP App
Store
Multiple vendor DB
Text data
COBOL
VSAM
COBOL
Fixed
ASCII
Excel
CSV
Multiple file formats
<XML>
<XML>
<XML>
e-discovery &
live data services
Maturity Level 2
Better; Data Management Still an Issue
APPLICATION
LAYER
Reporting &
Analytic Search
Business Application
Data Services
DATA
SERVICE
LAYER
Ad hoc data Services
•File Services
•(ASCII, XML, Batch..)
Data Services
(SOA, Web service..)
e-discovery &
live data services
Connectivity
(JDBC, ODBC, Native..)
Workarounds
Rationalization & Virtualization of Data
VIRTUALIZATION
LAYER
Optional Cache DB
Data Marts
DATA/
APP LAYER
DW
Analytic
Operational
Data
Mainframe
OLTP App
Store
OLTP App
Store
OLTP App
Store
Multiple vendor DB
Text data
COBOL
VSAM
COBOL
Fixed
ASCII
Excel
CSV
Multiple file formats
<XML>
<XML>
<XML>
SOA
Data
Service
PubSub
Service
API
Data
Service
Application Data Services
Maturity Level 3
hoc data Services
Best Practice; Solid Data AdManagement
& Reduced Risk
APPLICATION
LAYER
Reporting &
Analytic Search
DATA
SERVICE
LAYER
Semantic Search
(RDF Search, SPARQL)
Business Application
Data Services
•File Services
•(ASCII, XML, Batch..)
Data Services
(SOA, Web service..)
e-discovery &
live data services
Connectivity
(JDBC, ODBC, Native..)
SEMANTIC/
CATALOG
LAYER
SEMANTIC
STORAGE
LAYER
RDF <XML>
data Storage
Semantic Integration Services
Meta data services
Rationalization & Virtualization of Data
VIRTUALIZATION
LAYER
Optional Cache DB
Data Marts
DATA/
APP LAYER
DW
Analytic
Operational
Data
Mainframe
OLTP App
Store
OLTP App
Store
OLTP App
Store
Multiple vendor DB
Text data
COBOL
VSAM
COBOL
Fixed
ASCII
Excel
CSV
Multiple file formats
<XML>
<XML>
<XML>
SOA
Data
Service
PubSub
Service
API
Data
Service
Application Data Services
APPLICATION
LAYER
Reporting &
Analytic Search
DATA
SERVICE
LAYER
Semantic Search
(RDF Search, SPARQL)
Where Revelytix Tools
Fit
e-discovery &
data Services
live data services
inAdahocSemantic
Framework
Business Application
Data Services
SPARQL
Queries
•File Services
•(ASCII, XML, Batch..)
Data Services
(SOA, Web service..)
Connectivity
(JDBC, ODBC, Native..)
SEMANTIC/
CATALOG
LAYER
SEMANTIC
STORAGE
LAYER
<XML> Store
RDFRDFData
Semantic Integration Services
Meta data services
data Storage
Rationalization & Virtualization of Data
VIRTUALIZATION
LAYER
Optional Cache DB
Data Marts
DATA/
APP LAYER
DW
Analytic
Operational
Data
Mainframe
OLTP App
Store
OLTP App
Store
OLTP App
Store
Multiple vendor DB
Text data
COBOL
VSAM
COBOL
Fixed
ASCII
Excel
CSV
Multiple file formats
<XML>
<XML>
<XML>
SOA
Data
Service
PubSub
Service
API
Data
Service
Application Data Services
Two Use Cases
Classifying swaps and aggregating risk by
counterparty using the FIBO ontology
• Working with EDMC and regulators
Information provenance to infer which data sets
to use for specific applications
• Working with customers to automate data discovery
and access in very complex, large data centers
Financial Industry Business Ontology
Graphical Displays
FpML
FIX
ISO 20022
FIBO
Business
Entities
Securities
Input
Industry
Standards
Loans
XBRL
Diverse Formats
Derivatives
Corporate
Actions
Built in
OMG
MISMO
11/17/201
Semantic Web Ontologies
UML
Tool
Generate
(via ODM)
Industry initiative to define financial industry terms, definitions
and synonyms using semantic web principles
RDF/OWL
30
Business and Operational Ontologies
Requirement #1: Define Uniform and Expressive Financial Data Standards
Business Ontology (AKA
“conceptual model”)
Defines Transaction types
Defines contract types
Defines leg roles
Defines contract terms
Model from Sparx
Systems
Enterprise Architect
Includes only those terms
which have corresponding
instance data
provides source for
Agreement
is a
has party
Operational Ontology
(Semantic Web)
11/17/201
swaps
IR Stream
IR Swap
has party
swaps
IR Stream
Narrowed for Operational
use
31
Demo Architecture
Data Set Inference
Data Set Relationships
• Version of, mirror of, index of...
Provenance
• History and origin of data
• Transformations, relocations...
Best Source Inference
• Describe activities and processes
• Describe goals
• Freshness, speed, completeness, authoritativeness..
• Infer best data source for your task
Why Now?
External
• Regulatory demands of robust data quality
controls and proof of data reliability
Internal
• Monitoring and controlling Operational Risk
• Internal expectations to find more
productivity and reduce expenses
Data Set Suitability
Many business activities require the use of multiple data sets
• Analytics, audits, risk, performance monitoring
Data landscapes in large enterprises are extremely complicated
• Lots of related data sets
• Poor metadata management tools
Finding the right data sets for a particular activity is difficult
• We need more description
• Data sets need to be described better
• Processes, activities, and goals must be described better
Ontology Overview
Suitability for Use
User describes activity
• E.g. External audit of manufacturing processes
Rules engine reads knowledgebase of
descriptions
• Data sets, activities, processes, goals, people...
Rules engine infers which data sets are best for
the activity
Closing
Paradigm shifts in IT*occur over a period of 20
years and last about 40 years
• We only have 2 examples, small sample
Highly distributed data is an expensive problem
• Applications take longer and longer to build
• Analysis is incomplete, because the data is
incomplete
• Compliance with policies, regulations and laws is very
hard to determine
*or numerical computers, depending on the era
The Shift is On
(we are in the middle of an IT paradigm shift)
A Distributed Information Management System
is available now
Revelytix.com for much additional information
Thank You