Boston KM Forum • How big data becomes actionable information – Tweaked version of Gilbane big data presentation • Other Gilbane Conference impressions –
Download ReportTranscript Boston KM Forum • How big data becomes actionable information – Tweaked version of Gilbane big data presentation • Other Gilbane Conference impressions –
Boston KM Forum • How big data becomes actionable information – Tweaked version of Gilbane big data presentation • Other Gilbane Conference impressions – And some open source/content management market dynamics slides • Discussion 1 Big Data 101 Agenda • • • • Big data in context Recap Risks Recommendations 2 Big Data in Context • What is “big data”? – Unhelpfully, both “big data” and “NoSQL,” generally considered a key part of the big data wave, are defined more in terms of what they aren’t than what they are – A typical big data definition (Wikipedia): • “[…] data sets that grow so large that they become awkward to work with using on-hand database management tools” – Often associated with Gartner’s volume, variety (and complexity), and velocity model • Also value and veracity considerations 3 Big Data in Context • Why is big data a big deal now? – The need to deal with really big data sources, e.g., Web site logs, social network activities, and sensor network feeds – Commoditized hardware, software, and networking • Capability and price/performance curves that continue to defy all economic “laws” • Cloud services with radical new capability/cost equations – Maturation and uptake of related open source software, especially Hadoop • Powerful and often no- or low-cost 4 Big Data in Context • Why is big data a big deal now (continued)? – Market enthusiasm for “NoSQL” systems • Which often simply means Hadoop – Useful and often “open source”/public domain data sources and services – Mainstreaming of semantic tools and techniques • Overall: many things that used to be complex, expensive, and scarce – Are now relatively straightforward, inexpensive, and abundant 5 Big Data in Context • Big data reality checks – Most decision-makers don’t want big data per se; instead, they probably want • Relevant, accurate, and timely answers to big questions – Including alerts pertaining to questions they may or may not have asked yet • The ability to purposefully analyze information without having to master arcane technologies – It’s more about the ability to formulate and ask big questions (and to effectively analyze and act on answers) than it is about related technologies 6 A Prime Minicomputer, c1982 7 Fast-Forward to 2012 8 Fast-Forward to 2012 9 Fast-Forward to 2012 10 Fast-Forward to 2012 11 Fast-Forward to 2012 12 Google BigQuery 13 Hadoop • Hadoop is often considered central to big data – Originating with Google’s MapReduce architecture, Apache Hadoop is an open source architecture for distributed processing on networks of commodity hardware – From Wikipedia: • “’Map’ step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes • ‘Reduce’ step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve” 14 • Hadoop commercial application domains (from Wikipedia) include – – – – – – – Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data, e.g. for compliance 15 Hadoop • Hadoop is popular and rapidly evolving – Most leading information management vendors have embraced Hadoop – There is now a Hadoop ecosystem 16 Meanwhile, Back in the Googleplex • Dremel, BigQuery, Spanner, and other really big data projects 17 Meanwhile, Back in the Googleplex 18 Google Now 19 A NoSQL Taxonomy • From the NoSQL Wikipedia article: 20 A View of the NoSQL Landscape 21 Another NoSQL Landscape View NoSQL Perspectives • The “NoSQL” meme confusingly conflates – Document database requirements • Best served by XML DBMS (XDBMS) – Physical database model decisions on which only DBAs and systems architects should focus • And which are more complementary than competitive with DBMS – Object databases, which have floundered for decades • But with which some application developers are nonetheless enamored, for minimized “impedance mismatch,” despite significant information management compromises – Semantic (e.g., RDF) models • Also more complementary than competitive with RDBMS/XDBMS • Also consider: the “traditional” DBMS players can leverage the same underlying technology power curves 23 Modeling Abstractions Conceptual Logical Physical Resources Relations Documents and links; documents focused primarily on narrative, hierarchy, and sequence Entities, attributes, relationships, and identifiers Model: hypertext Language: XQuery (ideally…) Model: extended relational Language: SQL Indexing (e.g., scalar data types, XML, and full-text), locking and isolation levels (for transactions), federation, replication/synchronization, in-memory databases, columnar storage, table spaces, caching, and more 24 Data as a Service • The (single source of) truth is out there?... – High-quality data sources are being commoditized – Value is shifting to the ability to discern and leverage conceptual connections, not just to manage big databases • Some resources and developments to explore – – – – – – – – Social networking graphs and activities Data.com (Salesforce.com) Data.gov Google Knowledge Graph Linked Data Microsoft Windows Azure Data Marketplace Wikidata.org Wolfram Alpha 25 Mainstreaming Semantics • Tools and techniques applied in search of more meaning, e.g., – Vocabulary management – Disambiguation and auto-categorization – Text mining and analysis – Context and relationship analysis • It’s still ideal to help people capture and apply data and metadata in context – Semantic tools/techniques are complementary 26 Mainstreaming Semantics • The Semantic Web is still more vision than reality – But Google, Microsoft, and Yahoo, and Yandex, for example, are improving Web searches by capturing and applying more metadata and relationships via schema.org schemas in Web pages – And Google’s Knowledge Graph is about “things, not strings,” with, as of mid-2012, “500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects” 27 Recap • Commoditization and cloud – Very significant new opportunities • Hadoop and related frameworks – Complementary to RDBMS and XDBMS • NoSQL – Likely headed for meme-bust… • Data services – Game-changing potential • Semantic tools and techniques – Rapidly gaining momentum 28 Risks • The potential for an ever-expanding set of information silos – Focus on minimized redundancy and optimized integration • GIGO (garbage in, garbage out) at super-scale – New opportunities for unprecedented self-inflicted damage, for organizations that don’t model or query effectively • Cognitive overreach – The potential for information workers to create and act on nonsensical queries based on poorly-designed and/or misunderstood information models • Skills gaps can create competitive disadvantages – Modeling, query formulation, and data analysis – Critical thinking and information literacy 29 Recommendations • Aim high: big data is in many respects just getting started… – A lot of technology recycling but also significant and disruptive innovation • Work to build consensus among stakeholders on the opportunities and risks • Focus on human skills – e.g., critical thinking and information literacy – For now, an instance of the most creative and powerful type of semantic big data processor we know of is between your ears [End of tweaked Gilbane presentation] 30 Gilbane 2012 Impressions • The big themes – Cloud – Social – Mobile – Big data – Web • Other recurring themes – Open source: enterprise-ready for many domains 31 Gilbane 2012 Impressions • Projections – Consolidation ahead for W*M and ECM vendors • Likely to be accelerated by market uptake of native XML information management systems – And rediscovery of the utility of modern DBMSs » Along with SQL/XML (e.g., XQuery) synergy – Cloud as accelerator • Ridiculously low entry cost and complexity, relative to earlier on-premises alternatives • Tipping point with other shifts to cloud, e.g., for social, CRM/SFA, and public data sources 32 Gilbane 2012 Impressions • Projections – New challenges and opportunities for IT groups • Potential to derive unprecedented value from both existing and new information resources • Transition systems to “the cloud” – With or without IT assistance… – Blurring boundaries • Application, document, page… • Ability to apply and capture data and metadata in context, e.g., activity streams 33 Gilbane 2012 Impressions • Projections – The next critical IT scarcity is not about technology • It is instead the number of people who can – Think critically and structure problems/scenarios – Understand and apply conceptual models – Formulate queries and objectively analyze results » And generally get into an event/action routine, for work and personal activities – Growing awareness of the critical need for information responsibility • Producer: information quality, integrity, context… • Consumer: information literacy; critical and purposeful thinking 34 Reference Slides • Content management + open source • Hypertext 35 Open source examples 36 Open source examples 37 Open source examples 38 Open source examples 39 Hypertext • Criteria from a 2006 Burton Group report: – A content model based on collections of information items and links – Pervasive support for info item labels – Typed and bidirectional info item relationships – A means of creating, organizing, and sharing info item collections – Journaling (tracking info item changes) – Robust access control privilege management 40 Discussion [email protected] 41