Boston KM Forum • How big data becomes actionable information – Tweaked version of Gilbane big data presentation • Other Gilbane Conference impressions –

Download Report

Transcript Boston KM Forum • How big data becomes actionable information – Tweaked version of Gilbane big data presentation • Other Gilbane Conference impressions –

Boston KM Forum
• How big data becomes actionable information
– Tweaked version of Gilbane big data presentation
• Other Gilbane Conference impressions
– And some open source/content management
market dynamics slides
• Discussion
1
Big Data 101 Agenda
•
•
•
•
Big data in context
Recap
Risks
Recommendations
2
Big Data in Context
• What is “big data”?
– Unhelpfully, both “big data” and “NoSQL,” generally
considered a key part of the big data wave, are defined
more in terms of what they aren’t than what they are
– A typical big data definition (Wikipedia):
• “[…] data sets that grow so large that they become awkward to
work with using on-hand database management tools”
– Often associated with Gartner’s volume, variety (and
complexity), and velocity model
• Also value and veracity considerations
3
Big Data in Context
• Why is big data a big deal now?
– The need to deal with really big data sources, e.g., Web
site logs, social network activities, and sensor network
feeds
– Commoditized hardware, software, and networking
• Capability and price/performance curves that continue to defy
all economic “laws”
• Cloud services with radical new capability/cost equations
– Maturation and uptake of related open source software,
especially Hadoop
• Powerful and often no- or low-cost
4
Big Data in Context
• Why is big data a big deal now (continued)?
– Market enthusiasm for “NoSQL” systems
• Which often simply means Hadoop
– Useful and often “open source”/public domain data
sources and services
– Mainstreaming of semantic tools and techniques
• Overall: many things that used to be complex,
expensive, and scarce
– Are now relatively straightforward, inexpensive, and
abundant
5
Big Data in Context
• Big data reality checks
– Most decision-makers don’t want big data per se;
instead, they probably want
• Relevant, accurate, and timely answers to big questions
– Including alerts pertaining to questions they may or may not
have asked yet
• The ability to purposefully analyze information without
having to master arcane technologies
– It’s more about the ability to formulate and ask big
questions (and to effectively analyze and act on
answers) than it is about related technologies
6
A Prime Minicomputer, c1982
7
Fast-Forward to 2012
8
Fast-Forward to 2012
9
Fast-Forward to 2012
10
Fast-Forward to 2012
11
Fast-Forward to 2012
12
Google BigQuery
13
Hadoop
• Hadoop is often considered central to big data
– Originating with Google’s MapReduce architecture,
Apache Hadoop is an open source architecture for
distributed processing on networks of commodity
hardware
– From Wikipedia:
• “’Map’ step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes
• ‘Reduce’ step: The master node then collects the answers to
all the sub-problems and combines them in some way to
form the output – the answer to the problem it was
originally trying to solve”
14
• Hadoop commercial application domains (from
Wikipedia) include
–
–
–
–
–
–
–
Log and/or clickstream analysis of various kinds
Marketing analytics
Machine learning and/or sophisticated data mining
Image processing
Processing of XML messages
Web crawling and/or text processing
General archiving, including of relational/tabular data,
e.g. for compliance
15
Hadoop
• Hadoop is popular and rapidly evolving
– Most leading information management vendors
have embraced Hadoop
– There is now a Hadoop ecosystem
16
Meanwhile, Back in the Googleplex
• Dremel, BigQuery, Spanner, and other really
big data projects
17
Meanwhile, Back in the Googleplex
18
Google Now
19
A NoSQL Taxonomy
• From the NoSQL Wikipedia article:
20
A View of the NoSQL Landscape
21
Another NoSQL Landscape View
NoSQL Perspectives
• The “NoSQL” meme confusingly conflates
– Document database requirements
• Best served by XML DBMS (XDBMS)
– Physical database model decisions on which only DBAs and
systems architects should focus
• And which are more complementary than competitive with DBMS
– Object databases, which have floundered for decades
• But with which some application developers are nonetheless
enamored, for minimized “impedance mismatch,” despite significant
information management compromises
– Semantic (e.g., RDF) models
• Also more complementary than competitive with RDBMS/XDBMS
• Also consider: the “traditional” DBMS players can leverage
the same underlying technology power curves
23
Modeling Abstractions
Conceptual
Logical
Physical
Resources
Relations
Documents and links; documents
focused primarily on narrative,
hierarchy, and sequence
Entities, attributes, relationships, and
identifiers
Model: hypertext
Language: XQuery (ideally…)
Model: extended relational
Language: SQL
Indexing (e.g., scalar data types, XML, and full-text), locking and isolation
levels (for transactions), federation, replication/synchronization, in-memory
databases, columnar storage, table spaces, caching, and more
24
Data as a Service
• The (single source of) truth is out there?...
– High-quality data sources are being commoditized
– Value is shifting to the ability to discern and leverage conceptual
connections, not just to manage big databases
• Some resources and developments to explore
–
–
–
–
–
–
–
–
Social networking graphs and activities
Data.com (Salesforce.com)
Data.gov
Google Knowledge Graph
Linked Data
Microsoft Windows Azure Data Marketplace
Wikidata.org
Wolfram Alpha
25
Mainstreaming Semantics
• Tools and techniques applied in search of
more meaning, e.g.,
– Vocabulary management
– Disambiguation and auto-categorization
– Text mining and analysis
– Context and relationship analysis
• It’s still ideal to help people capture and apply
data and metadata in context
– Semantic tools/techniques are complementary
26
Mainstreaming Semantics
• The Semantic Web is still more vision than reality
– But Google, Microsoft, and Yahoo, and Yandex, for
example, are improving Web searches by capturing
and applying more metadata and relationships via
schema.org schemas in Web pages
– And Google’s Knowledge Graph is about “things, not
strings,” with, as of mid-2012, “500 million objects, as
well as more than 3.5 billion facts about and
relationships between these different objects”
27
Recap
• Commoditization and cloud
– Very significant new opportunities
• Hadoop and related frameworks
– Complementary to RDBMS and XDBMS
• NoSQL
– Likely headed for meme-bust…
• Data services
– Game-changing potential
• Semantic tools and techniques
– Rapidly gaining momentum
28
Risks
• The potential for an ever-expanding set of information silos
– Focus on minimized redundancy and optimized integration
• GIGO (garbage in, garbage out) at super-scale
– New opportunities for unprecedented self-inflicted damage, for
organizations that don’t model or query effectively
• Cognitive overreach
– The potential for information workers to create and act on
nonsensical queries based on poorly-designed and/or
misunderstood information models
• Skills gaps can create competitive disadvantages
– Modeling, query formulation, and data analysis
– Critical thinking and information literacy
29
Recommendations
• Aim high: big data is in many respects just
getting started…
– A lot of technology recycling but also
significant and disruptive innovation
• Work to build consensus among stakeholders on the opportunities and risks
• Focus on human skills – e.g., critical
thinking and information literacy
– For now, an instance of the most creative and
powerful type of semantic big data processor
we know of is between your ears
[End of tweaked Gilbane presentation]
30
Gilbane 2012 Impressions
• The big themes
– Cloud
– Social
– Mobile
– Big data
– Web
• Other recurring themes
– Open source: enterprise-ready for many domains
31
Gilbane 2012 Impressions
• Projections
– Consolidation ahead for W*M and ECM vendors
• Likely to be accelerated by market uptake of native XML
information management systems
– And rediscovery of the utility of modern DBMSs
» Along with SQL/XML (e.g., XQuery) synergy
– Cloud as accelerator
• Ridiculously low entry cost and complexity, relative to
earlier on-premises alternatives
• Tipping point with other shifts to cloud, e.g., for social,
CRM/SFA, and public data sources
32
Gilbane 2012 Impressions
• Projections
– New challenges and opportunities for IT groups
• Potential to derive unprecedented value from both
existing and new information resources
• Transition systems to “the cloud”
– With or without IT assistance…
– Blurring boundaries
• Application, document, page…
• Ability to apply and capture data and metadata in
context, e.g., activity streams
33
Gilbane 2012 Impressions
• Projections
– The next critical IT scarcity is not about technology
• It is instead the number of people who can
– Think critically and structure problems/scenarios
– Understand and apply conceptual models
– Formulate queries and objectively analyze results
» And generally get into an event/action routine, for work and
personal activities
– Growing awareness of the critical need for
information responsibility
• Producer: information quality, integrity, context…
• Consumer: information literacy; critical and purposeful
thinking
34
Reference Slides
• Content management + open source
• Hypertext
35
Open source examples
36
Open source examples
37
Open source examples
38
Open source examples
39
Hypertext
• Criteria from a 2006 Burton Group report:
– A content model based on collections of
information items and links
– Pervasive support for info item labels
– Typed and bidirectional info item relationships
– A means of creating, organizing, and sharing info
item collections
– Journaling (tracking info item changes)
– Robust access control privilege management
40
Discussion
[email protected]
41