Transcript NoSQL_AND_Big_Data - GlobalsDB
Big Data, NoSQL . . . So What?
Iran Hutchinson
Me
• I work for InterSystems who: – Drives http://globalsdb.org
– – – NoSQL project.
Has 20+ years of NoSQL production deployments Has 20+ years of Big Data production deployments Built a ~250 million Euro business on the above • Email: [email protected]
• Twitter: #iranic
#iranic
Big Data
is … • Important data in varying formats and volumes that is being generated across all areas affecting your business that is generally not centrally correlated or managed.
• Examples include: – Word Files, PowerPoint, PDFs – – Emails, Instant Messaging, Texts Blogs and Social Media – – Automated data from machine activities Stream data from financial stock markets
#iranic
Some Big Data Numbers
• Source: McKinsey Global Institute • 5 Billion mobile phones used in 2010 • 30 Billion pieces of info shared on Facebook each month • 40% projected growth in global data generated • 235 Terabytes collected by US Library of Congress 04/11 – 15 out of 17 sectors in US have more data stored per company than this.
#iranic
Some Big Data Numbers …
• Source: McKinsey Global Institute • $300 Billion in potential value in US Healthcare system • €250 Billion in Europe’s public sector administration • $600 Billion in annual consumer surplus using location data • 60% Potential increase in retail operating margins • 140,000 – 190,000 analytical talent positions in US • 1.5 Million data-savvy managers needed in US
#iranic
Case Study: Credit Suisse
• Key Challenges: – Revamp order routing architecture – – – Revamp order management architecture Serve current demand and scale to new levels Address downtime challenges
#iranic
Case Study: Credit Suisse …
• Big Data in the form of volumes of transactions • Leveraged Caché’s: – In-memory architecture for performance – – On-disk resiliency for availability Distributed architecture for data coherency • Can easily process 1,000,000,000 transactions – During business hours
#iranic
Case Study: European Space Agency (ESA) • Key Challenges – Make the largest, most precise 3-D map of our Galaxy – – Monitor 1,000,000,000 stars over 5 years, precisely charting position, movement, and brightness Along the way discover hundreds of thousands of new celestial objects
#iranic
Case Study: ESA Continued …
• Challenge Calculation: • Capture data for 1 Billion Celestial Objects • http://www.intersystems.com/cache/whitepapers/pdf/Charting_th e_Galaxy.pdf
X X 1,000,000,000 objects 100 observations per object 600 bytes per observation 60,000,000,000,000 (60TB) Solution: Caché/XEP, delivering 100,000+ sustained inserts per second per server, stored as real objects with SQL access
#iranic
Enabling Technology
• Focus on Caché • A quick look at the architecture
#iranic
Enabling Technology …
• Java + C database kernel run in same process
#iranic
Enabling Technology …
• ECP, Distributed Computing
#iranic
Enabling Technology …
• Multiple, simultaneous data to disk writers
#iranic
Who is this Guy?
• Edgar Frank “Ted” Codd • Known for 12 Rules (0 ~ 12) for Relational Data Systems
#iranic
NoSQL … Breaking the Rules
• Rule 1: The information Rule – All information is represented in 1 and only 1 way, namely by values in column positions within rows of tables • Rule 12: The no subversion Rule – If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system i.e. relational security or integrity constraints.
#iranic
Why NoSQL?
• No to ACID transactions • No to the impedance mismatch with SQL • Dealing with Big Data and Web Scale • High prices from RDBMS vendors • Use commodity hardware • Flexible data models • It’s a cool movement ….
#iranic
Is NoSQL a new Concept?
• No • Remember MUMPS?
– SET ^Car("Door","Color")="BLUE” • Remember Multi-value/PICK – MATWRITE array.variable ON file.variable,id. ….
• Ever heard of the NoSQL RDB?
– Carlo Strozzi – http://www.strozzi.it/cgi bin/CSA/tw7/I/en_US/nosql/Home%20Page
#iranic
CAP Theorem
• Consistent – A service that is consistent operates fully or not.
• Availability – The service is available to operate fully or not.
• Partition Tolerance – Managing data on multiple nodes. 1 node is 1 partition so it works or does not when it comes to processing data.
• Significant as you can get 2 of these only …
#iranic
CAP Theorem …
• Arguments and links – http://www.julianbrowne.com/article/viewer/brewers cap-theorem – – http://ksat.me/a-plain-english-introduction-to-cap theorem/ http://voltdb.com/company/blog/clarifications-cap theorem-and-data-related-errors
#iranic
CAP Theorem …: Consistency
DB1 DB7 DB2 DB6 DB3 DB5 DB4
#iranic
CAP Theorem …: Consistency
Spoke DB1 Spoke DB4 Hub Spoke DB2 Spoke DB3
#iranic
CAP Theorem …: Consistency
DB1
#iranic
DB3 DB2
Distributed computing
• Fallacies (Peter Deutsch) – The network is reliable – – Latency is zero Bandwidth is infinite – – – – – The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous •
#iranic
Remember JINI? (See Apache River project)
NoSQL: Which Model to Use?
Key-Value Graph Data Document Column
#iranic
NoSQL: Which project?
• http://nosql-database.org/ lists 122 today.
• Depends on your model selection.
• Most likely choose well-known project.
• Don’t forget about shared risk!
#iranic
NoSQL: Querying
• Some solutions have no querying • When available query languages differ • Lack of general AD-Hoc querying – “no” SQL • Have you heard of UnQL?
– http://www.unqlspec.org/display/UnQL/Home • NOTE: Toad for Cloud
#iranic
NoSQL: How to Succeed?
• Know your application • Don’t forget the past lessons • Consider a hybrid approach • Fight the desire to Roll-Your-Own-DB • Start small but significant
#iranic
NoSQL: Hybrid Approach 1
• Two Systems • NoSQL System • SQL/RDBMS NoSQL Data Mapper / Translator SQL/RDBMS
#iranic
NoSQL: Hybrid Approach 2
• One system does both NoSQL and SQL Relational ?
Data Graph Key-Value Document Column
#iranic
GlobalsDB.org Project
• Name comes from the underlying data structure – Multi-dimensional array – Basis for commercial Caché data system • Free for development and production deployment • NoSQL DB with Java and Node.js APIs • Code base is same as commercial product • APIs are open sourced or being open sourced • Database kernel is not open source
#iranic
A “Global” Definition
• A Global is persistent sparse multi-dimensional array, which consists of one or more storage elements or "nodes". Each node is identified by a node reference (which is, essentially, its logical address) – – simple =="some data” complex["subscript-1", "subscript-2"] =="some data” •
#iranic
Example – product[item,type,os,proccessor] == quantity – product[“computer”,”laptop”,”Mac”,”i7”] == 3
GlobalsDB Architecture
• Current Architecture
#iranic
GlobalsDB, NoSQL, Big Data
• http://nosql.mypopescu.com/ • http://highscalability.com/ • http://nosqltapes.com/ • http://globalsdb.wordpress.com
#iranic