Transcript Slide 1
Innovations in Database Technology
IRMAC BI/DW SIG May 28, 2009
2
Agenda
About Infobright Data Warehousing Challenge Use Cases Infobright Approach Infobright Architecture Infobright Versions & System Requirements
3
About Infobright
About Infobright
Founded Headquarters The Infobright Data Warehouse The Open Source Solution MySQL Integration
2006 Toronto, Canada; offices in Boston, MA and Warsaw, Poland Simplicity: No new schemas, no indices, no data partitioning, easy to maintain Scalability: Designed for rapidly growing volumes. Ideal for up to 30 TB Low TCO: Industry-leading compression, less storage, industry standard servers, low software costs, minimal ongoing operational expenses Community (open source) and Enterprise Editions are available Leverages MySQL connectivity to ETL and BI Provides MySQL customers with scalable, enterprise-ready data warehouse MySQL/SUN Microsystems invests in Infobright Sept 15, 2008 4
5
Data Warehousing Challenge
.
Data Warehousing Challenges
More Data, More Data Sources Real time data 01 1 Multiple databases 1010 External Sources 01 Limited Resources and Budget More Kinds of Output Needed by More Users, More Quickly
6 6
0 10 1 1 1 1 010 1
Traditional Data Warehousing
Labor intensive, heavy indexing and partitioning Hardware intensive: massive storage; big servers Expensive and complex
7
Data Warehousing – Raising The Bar
New Demands:
Larger transaction volumes driven by the internet Impact of Cloud Computing More -> Faster -> Cheaper
Data Warehousing Matures:
Near real time updates Integration with master data management Data mining using discrete business transactions Provision of data for business critical applications
Early Data Warehouse Characteristics:
Integration of internal systems Monthly and weekly loads Heavy use of aggregates
8
Use Cases
9
Use Cases
Infobright is a good fit for; • • • • Loading millions of transactions with a limited batch window Summarizing transactional data for trend analysis Extracting transactional detail based on specific constraints Ad hoc query support across many dimensional attributes Avoid using Infobright for; • • • Real-time transactional updates (operational data entry) Full data extracts (select * from …) Row based operations that need to access all columns of a table are typically better suited to row based databases
Customer Experience – Load Speed
Business Requirement
• • • Mavenir - OEM customer deploying a world wide telco application Application provides operators with access to detailed SMS traffic Needed a low cost solution with the ability to load 20K records per second • Peak of 70M messages per hour during Chinese New year
Solution
• • • • • Custom front end developed using MySQL JDBC driver Completed design, test, deployment in < 3 months with no assistance from Infobright Allowed for expansion from 7 to 90 days of online SMS history Supports plan for 70% annual growth Rollout to allow for 120 concurrent users 10
Customer Experience – Query Performance
Business Requirement
• Sulake - Online Social Networking service with 126M users across 31 countries • 990M page impressions per month • Need to quickly analyze online spend on a daily basis to enhance online experience and drive additional revenue • Existing InnoDB solution was able to process business queries in a reasonable time frame (queries taking hours to complete) • Business opportunities were being lost due to inability to analyze subscriber behavior using transactions
Solution
• • • • Customer used existing data model and deployed the application using Business Objects – Data Integrator for ETL, Web-Intelligence for BI Existing ETL workflows were converted to Infobright in less than 4 weeks without assistance Historically long running queries (hours) now running in minutes and seconds Additional benefits due to compression were a reduced need for disk storage and an overall reduction in I/O and network traffic 11
Customer Experience - TCO
Business Requirement
• A global provider of electronic trading solutions across 22 time zones and 700 financial exchanges • Wanted to expand analytical access to financial transactions to include both current (30 days) and archived transactions (4 years) • Expansion of existing Sybase solution was too costly
Solution
• • • Infobright was able to achieve performance benchmarks within the first 3 days of a proof of concept using production data • 28,000 records per second load speed • Join 100M row with a 30Mrow table -> 400k rows, returned in 185 seconds • Additional queries that did not complete using Sybase, finished in minutes using Infobright Final solution deployed using Pentaho Kettle for ETL and Crystal Reports for BI Success with modest data size (150GB) has opened opportunities for additional more detailed transactional analysis 12
Customer Experience – Query Performance and TCO
13
Business Requirement
• TradeDoubler – Based in Sweden, a global digital marketing company, serving 1600+ online advertisers across Europe and Asia. • TradeDoubler optimizes Web marketing campaigns by analyzing Web clicks, impressions and purchases. • Analyzing terabytes of data about the results of its programs is central to the company’s success. • Selected Infobright to produce analytical results rapidly, seamless interoperability with their MySQL database and low TCO
Solution
• • • • • • Deployed solution using a single, $12,500 Dell server with 8 CPU cores and 16 GB RAM Used Pentaho Kettle for ETL and Jaspersoft Server Pro Reports for BI Needed to process and analyze data 20 billion online transactions/month In POC, loaded > 3.2 billion rows at > 300,000 rows / second In production, achieved 30x data compression Extremely fast query speed. 3 queries that previously did not return, now returned within a minute
14
Infobright Approach
Introducing Infobright
Knowledge Grid
– statistics and metadata “describing” the super compressed data
Data Packs
– data stored in manageably sized, highly compressed data packs Data compressed using algorithms tailored to data type 15 1 5
Column Orientation Smarter architecture
Load data and go No indices or partitions to build and maintain
Knowledge Grid
automatically updated as data packs are created or updated Super-compact data foot print can leverage off-the shelf hardware
Column vs. Row-Oriented
EMP_ID
1 2 3
FNAME
Moe Curly Larry
LNAME
Howard Joe Fine
SALARY
10000 12000 9000
Row Oriented
(
1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;
)
Works well if all the columns are needed for every query.
Efficient for transactional processing if all the data for the row is available
Column Oriented
(
1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;
)
) Works well with aggregate results (sum, count, avg. Only columns that are relevant need to be touched Consistent performance with any database design Allows for very efficient compression 16
17
Data Packs and Compression
64K 64K 64K 64K
Data Packs
Each data pack contains 65, 536 data values Compression is applied to each individual data pack The compression algorithm varies depending on data type and data distribution Patent Pending Compression Algorithms
Compression
Results vary depending on the distribution of data among data packs A typical overall compression ratio seen in the field is 10:1 Some customers have seen results have been as high as 40:1
Knowledge Grid
Data Pack Nodes (DPN)
A separate DPN is created for every data pack created in the database to store basic statistical information
Character Maps (CMAPs)
Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character 18
Histograms
Histograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals.
Pack-to-Pack Nodes (PPN)
PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used.
This metadata layer = 1% of the compressed volume
A Simple Query using the Knowledge Grid
19 1.
SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; Find the Data Packs with salary > 50000 Rows 1 to 65,536 2.
Find the Data Packs that contain age < 65 65,537 to 131,072 3.
4.
Find the Data Packs that have job = ‘Shipping’ Find the Data Packs that have City = “Toronto’ 5.
Now we eliminate all rows that have been flagged as irrelevant.
6.
Finally we have identified the data pack that needs to be decompressed 131,073 to …… salary age job Only this pack will be decompressed Completely Irrelevant Suspect All values match city All packs ignored All packs ignored All packs ignored
A Join Query using the Knowledge Grid
id Car Sales sale discount prov date Sales Person id name SELECT MIN(sale), MAX(discount), name FROM carsales, salesperson WHERE carsales.id = salesperson.id
AND carsales.prov = ‘ON’ AND carsales.date = ‘2008-02-29’ GROUP BY name; 20 Pack-to-Pack carsales_id vs salesperson_id salesperson.id
0 1 0 1 1 0 carsales.id
Indicates that the Data Packs are related 1.
Eliminate the Car Sales Data Packs that are irrelevant based on constraints in the SQL 2.
Determine the related Sales Person Data Packs based on the values of carsales_id found in the relevant Car Sales Data Packs.
3.
4.
Create a Pack-to-Pack node that stores the results of the join condition between Car Sales and Sales Person.
Any subsequent queries will be able to use the PPN to resolve joins between Car Sales and Sales Person
21
Infobright Architecture
Infobright – Embedded With MySQL
MySQL/Infobright Architecture CONNECTORS: Native C API, JDBC, ODBC, .NET, PHP, Python, Perl, Ruby, VB CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches Management Services & Utilities Infobright Optimizer and Executor My SQL Optimizer Infobright Loader / Unloader MyISAM • Views • Users • Permissions • Tables Defs Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Infobright ships with the full MySQL binaries. The MySQL architecture is used to support database components such as connectors, security and memory management. 22 • •
Infobright Components
• IB Storage Engine consisting of 64Kb Data Packs, Compressor, and the Knowledge Grid IB Optimizer that uses rough set algorithms and the knowledge grid to navigate the database IB Loader supports text based and binary data formats
Optimized SQL for Infobright
MySQL
The Infobright Optimizer supports a large amount of MySQL syntax and functions. When the optimizer encounters SQL syntax that is not supported, then the query is executed using the MySQL optimizer.
Infobright Optimized SQL
• Select Statements • Comparison Operators • Logical Operators • String Comparison Functions (LIKE, ..) • Aggregate Functions • Arithmetic Operators • Data Manipulation Language (I/U/D) • • • • • • Data Definition Language (CREATE & DROP) String Functions Date/Time Functions Numeric Functions Trigonometric Functions Case Statements 23
Infobright Data Types
Most of the data types expected for a MySQL database engine are fully supported. The data types that are currently not implemented within Infobright include BLOB, ENUM, SET and Auto Increment.
24
ETL Integration
Leverage existing IT tools and resources for fast, simple deployments and low TCO Increased efficiency with popular platforms Deeper ETL Integration Jaspersoft, Talend, Pentaho Leverages end-to-end data management provided by ETL tools Improved support for Data Manipulation Language (DML) 25
Data Loading with & without custom ETL connectors
Loading Infobright tables with custom connectors:
Kettle from Pentaho
Talend ETL from Talend
Jaspersoft ETL (Talend) from Jaspersoft Two ways to invoke Infobright loader without connectors
1.
Generate a CSV or binary file and invoke the Infobright loader to load the file 2.
Named pipe technique: Create a named pipe (i.e. mkfifo /home/mysql/s_mysession1.pipe
) Launch the Infobright loader in the background to read from the pipe Launch the ETL process that writes data to the named pipe When the ETL process runs, as records are written to the named pipe, the loader reads them and writes them to an Infobright database table 26
27
Infobright Versions & System Requirements
Comparison of ICE and IEE
Features Technical Support Warranty and Indemnification INSERT/UPDATE/DELETE Infobright Loader Data Load Types MySQL Loader Platform Support
Forums and/or one-time 4-hr support pack No Available Included No Supported Up to 50 GB/hr Text only Multi-threaded, Up to 300 GB/hr Text & Binary (100% faster) No Supported 64-bit Intel and AMD RHEL 5, CentOS 5, Debian 32-bit Intel and AMD for Windows XP, Ubuntu 8.04, Fedora 9 64-bit Intel and AMD Windows Server 2003, Windows Server 2008, RHEL 5, CentOS 5, Debian, Solaris 10 28
System Requirements
29
For More Information
Data Warehouse Evangelist Bob Newell
Or join our open source community at www.infobright.org
Thank you
30
Query performance
# Query Query name
1 2 Affiliate/minor/sum(order)/year Affiliate/major/sum(order)/year 3 4 3 4 1 2 Affiliate/minor/sum(order)/month Affiliate/major/sum(order)/month Events/Cat=2/Country/sum(no of)/year Events/Cat=*/Country/sum(no of)/year Events/Cat=2/Country/sum(no of)/month Events/Cat=*/Country/sum(no of)/month
Intervall
20060101-20061231 20060101-20061231 20060101-20060131 20060101-20060131 20060101-20061231 20060101-20061231 20060101-20060131 20060101-20060131
Infobright No cache Cache Traditional DB No cache Cache
7,72 31,52 0,99 7,81 13,00,91 N/A 4,03,21 N/A 1,32 3,23 37,16 41,67 15,16 22,12 0,43 0,65 24,42 29,62 7,15 8,01 1,00,43 2,12,34 N/A N/A 8,08,13 15,08,32 10,69 18,55 N/A N/A 2,10,15 3,12,82 Time in minutes, seconds, milliseconds 31 3 1