Transcript Slide 1

Innovations in Database Technology

IRMAC BI/DW SIG May 28, 2009

2

Agenda

 About Infobright  Data Warehousing Challenge  Use Cases  Infobright Approach  Infobright Architecture  Infobright Versions & System Requirements

3

About Infobright

About Infobright

Founded Headquarters The Infobright Data Warehouse The Open Source Solution MySQL Integration

2006 Toronto, Canada; offices in Boston, MA and Warsaw, Poland Simplicity: No new schemas, no indices, no data partitioning, easy to maintain Scalability: Designed for rapidly growing volumes. Ideal for up to 30 TB Low TCO: Industry-leading compression, less storage, industry standard servers, low software costs, minimal ongoing operational expenses Community (open source) and Enterprise Editions are available    Leverages MySQL connectivity to ETL and BI Provides MySQL customers with scalable, enterprise-ready data warehouse MySQL/SUN Microsystems invests in Infobright Sept 15, 2008 4

5

Data Warehousing Challenge

.

Data Warehousing Challenges

More Data, More Data Sources Real time data 01 1 Multiple databases 1010 External Sources 01 Limited Resources and Budget More Kinds of Output Needed by More Users, More Quickly

6 6

0 10 1 1 1 1 010 1

Traditional Data Warehousing

  

Labor intensive, heavy indexing and partitioning Hardware intensive: massive storage; big servers Expensive and complex

7

Data Warehousing – Raising The Bar

New Demands:

Larger transaction volumes driven by the internet Impact of Cloud Computing More -> Faster -> Cheaper

Data Warehousing Matures:

Near real time updates Integration with master data management Data mining using discrete business transactions Provision of data for business critical applications

Early Data Warehouse Characteristics:

Integration of internal systems Monthly and weekly loads Heavy use of aggregates

8

Use Cases

9

Use Cases

Infobright is a good fit for; • • • • Loading millions of transactions with a limited batch window Summarizing transactional data for trend analysis Extracting transactional detail based on specific constraints Ad hoc query support across many dimensional attributes Avoid using Infobright for; • • • Real-time transactional updates (operational data entry) Full data extracts (select * from …) Row based operations that need to access all columns of a table are typically better suited to row based databases

Customer Experience – Load Speed

Business Requirement

• • • Mavenir - OEM customer deploying a world wide telco application Application provides operators with access to detailed SMS traffic Needed a low cost solution with the ability to load 20K records per second • Peak of 70M messages per hour during Chinese New year

Solution

• • • • • Custom front end developed using MySQL JDBC driver Completed design, test, deployment in < 3 months with no assistance from Infobright Allowed for expansion from 7 to 90 days of online SMS history Supports plan for 70% annual growth Rollout to allow for 120 concurrent users 10

Customer Experience – Query Performance

Business Requirement

• Sulake - Online Social Networking service with 126M users across 31 countries • 990M page impressions per month • Need to quickly analyze online spend on a daily basis to enhance online experience and drive additional revenue • Existing InnoDB solution was able to process business queries in a reasonable time frame (queries taking hours to complete) • Business opportunities were being lost due to inability to analyze subscriber behavior using transactions

Solution

• • • • Customer used existing data model and deployed the application using Business Objects – Data Integrator for ETL, Web-Intelligence for BI Existing ETL workflows were converted to Infobright in less than 4 weeks without assistance Historically long running queries (hours) now running in minutes and seconds Additional benefits due to compression were a reduced need for disk storage and an overall reduction in I/O and network traffic 11

Customer Experience - TCO

Business Requirement

• A global provider of electronic trading solutions across 22 time zones and 700 financial exchanges • Wanted to expand analytical access to financial transactions to include both current (30 days) and archived transactions (4 years) • Expansion of existing Sybase solution was too costly

Solution

• • • Infobright was able to achieve performance benchmarks within the first 3 days of a proof of concept using production data • 28,000 records per second load speed • Join 100M row with a 30Mrow table -> 400k rows, returned in 185 seconds • Additional queries that did not complete using Sybase, finished in minutes using Infobright Final solution deployed using Pentaho Kettle for ETL and Crystal Reports for BI Success with modest data size (150GB) has opened opportunities for additional more detailed transactional analysis 12

Customer Experience – Query Performance and TCO

13

Business Requirement

• TradeDoubler – Based in Sweden, a global digital marketing company, serving 1600+ online advertisers across Europe and Asia. • TradeDoubler optimizes Web marketing campaigns by analyzing Web clicks, impressions and purchases. • Analyzing terabytes of data about the results of its programs is central to the company’s success. • Selected Infobright to produce analytical results rapidly, seamless interoperability with their MySQL database and low TCO

Solution

• • • • • • Deployed solution using a single, $12,500 Dell server with 8 CPU cores and 16 GB RAM Used Pentaho Kettle for ETL and Jaspersoft Server Pro Reports for BI Needed to process and analyze data 20 billion online transactions/month In POC, loaded > 3.2 billion rows at > 300,000 rows / second In production, achieved 30x data compression Extremely fast query speed. 3 queries that previously did not return, now returned within a minute

14

Infobright Approach

Introducing Infobright

Knowledge Grid

– statistics and metadata “describing” the super compressed data

Data Packs

– data stored in manageably sized, highly compressed data packs Data compressed using algorithms tailored to data type 15 1 5

Column Orientation Smarter architecture

 Load data and go  No indices or partitions to build and maintain 

Knowledge Grid

automatically updated as data packs are created or updated  Super-compact data foot print can leverage off-the shelf hardware

Column vs. Row-Oriented

EMP_ID

1 2 3

FNAME

Moe Curly Larry

LNAME

Howard Joe Fine

SALARY

10000 12000 9000

Row Oriented

(

1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;

)

  Works well if all the columns are needed for every query.

Efficient for transactional processing if all the data for the row is available

Column Oriented

(

1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;

)

   ) Works well with aggregate results (sum, count, avg. Only columns that are relevant need to be touched Consistent performance with any database design  Allows for very efficient compression 16

17

Data Packs and Compression

64K 64K 64K 64K

Data Packs

 Each data pack contains 65, 536 data values  Compression is applied to each individual data pack  The compression algorithm varies depending on data type and data distribution Patent Pending Compression Algorithms

Compression

 Results vary depending on the distribution of data among data packs  A typical overall compression ratio seen in the field is 10:1  Some customers have seen results have been as high as 40:1

Knowledge Grid

Data Pack Nodes (DPN)

A separate DPN is created for every data pack created in the database to store basic statistical information

Character Maps (CMAPs)

Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character 18

Histograms

Histograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals.

Pack-to-Pack Nodes (PPN)

PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used.

This metadata layer = 1% of the compressed volume

A Simple Query using the Knowledge Grid

19 1.

SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’; Find the Data Packs with salary > 50000 Rows 1 to 65,536 2.

Find the Data Packs that contain age < 65 65,537 to 131,072 3.

4.

Find the Data Packs that have job = ‘Shipping’ Find the Data Packs that have City = “Toronto’ 5.

Now we eliminate all rows that have been flagged as irrelevant.

6.

Finally we have identified the data pack that needs to be decompressed 131,073 to …… salary age job Only this pack will be decompressed Completely Irrelevant Suspect All values match city All packs ignored All packs ignored All packs ignored

A Join Query using the Knowledge Grid

id Car Sales sale discount prov date Sales Person id name SELECT MIN(sale), MAX(discount), name FROM carsales, salesperson WHERE carsales.id = salesperson.id

AND carsales.prov = ‘ON’ AND carsales.date = ‘2008-02-29’ GROUP BY name; 20 Pack-to-Pack carsales_id vs salesperson_id salesperson.id

0 1 0 1 1 0 carsales.id

Indicates that the Data Packs are related 1.

Eliminate the Car Sales Data Packs that are irrelevant based on constraints in the SQL 2.

Determine the related Sales Person Data Packs based on the values of carsales_id found in the relevant Car Sales Data Packs.

3.

4.

Create a Pack-to-Pack node that stores the results of the join condition between Car Sales and Sales Person.

Any subsequent queries will be able to use the PPN to resolve joins between Car Sales and Sales Person

21

Infobright Architecture

Infobright – Embedded With MySQL

MySQL/Infobright Architecture CONNECTORS: Native C API, JDBC, ODBC, .NET, PHP, Python, Perl, Ruby, VB CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches Management Services & Utilities Infobright Optimizer and Executor My SQL Optimizer Infobright Loader / Unloader MyISAM • Views • Users • Permissions • Tables Defs Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Knowledge Node Data Pack Node Infobright ships with the full MySQL binaries. The MySQL architecture is used to support database components such as connectors, security and memory management. 22 • •

Infobright Components

• IB Storage Engine consisting of 64Kb Data Packs, Compressor, and the Knowledge Grid IB Optimizer that uses rough set algorithms and the knowledge grid to navigate the database IB Loader supports text based and binary data formats

Optimized SQL for Infobright

MySQL

The Infobright Optimizer supports a large amount of MySQL syntax and functions. When the optimizer encounters SQL syntax that is not supported, then the query is executed using the MySQL optimizer.

Infobright Optimized SQL

• Select Statements • Comparison Operators • Logical Operators • String Comparison Functions (LIKE, ..) • Aggregate Functions • Arithmetic Operators • Data Manipulation Language (I/U/D) • • • • • • Data Definition Language (CREATE & DROP) String Functions Date/Time Functions Numeric Functions Trigonometric Functions Case Statements 23

Infobright Data Types

Most of the data types expected for a MySQL database engine are fully supported. The data types that are currently not implemented within Infobright include BLOB, ENUM, SET and Auto Increment.

24

ETL Integration

Leverage existing IT tools and resources for fast, simple deployments and low TCO  Increased efficiency with popular platforms  Deeper ETL Integration  Jaspersoft, Talend, Pentaho  Leverages end-to-end data management provided by ETL tools  Improved support for Data Manipulation Language (DML) 25

Data Loading with & without custom ETL connectors

Loading Infobright tables with custom connectors:

Kettle from Pentaho

Talend ETL from Talend

Jaspersoft ETL (Talend) from Jaspersoft Two ways to invoke Infobright loader without connectors

1.

Generate a CSV or binary file and invoke the Infobright loader to load the file 2.

Named pipe technique:  Create a named pipe (i.e. mkfifo /home/mysql/s_mysession1.pipe

)  Launch the Infobright loader in the background to read from the pipe  Launch the ETL process that writes data to the named pipe  When the ETL process runs, as records are written to the named pipe, the loader reads them and writes them to an Infobright database table 26

27

Infobright Versions & System Requirements

Comparison of ICE and IEE

Features Technical Support Warranty and Indemnification INSERT/UPDATE/DELETE Infobright Loader Data Load Types MySQL Loader Platform Support

Forums and/or one-time 4-hr support pack No Available Included No Supported Up to 50 GB/hr Text only Multi-threaded, Up to 300 GB/hr Text & Binary (100% faster) No Supported 64-bit Intel and AMD RHEL 5, CentOS 5, Debian 32-bit Intel and AMD for Windows XP, Ubuntu 8.04, Fedora 9 64-bit Intel and AMD Windows Server 2003, Windows Server 2008, RHEL 5, CentOS 5, Debian, Solaris 10 28

System Requirements

29

For More Information

Data Warehouse Evangelist Bob Newell

[email protected]

Or join our open source community at www.infobright.org

Thank you

30

Query performance

# Query Query name

1 2 Affiliate/minor/sum(order)/year Affiliate/major/sum(order)/year 3 4 3 4 1 2 Affiliate/minor/sum(order)/month Affiliate/major/sum(order)/month Events/Cat=2/Country/sum(no of)/year Events/Cat=*/Country/sum(no of)/year Events/Cat=2/Country/sum(no of)/month Events/Cat=*/Country/sum(no of)/month

Intervall

20060101-20061231 20060101-20061231 20060101-20060131 20060101-20060131 20060101-20061231 20060101-20061231 20060101-20060131 20060101-20060131

Infobright No cache Cache Traditional DB No cache Cache

7,72 31,52 0,99 7,81 13,00,91 N/A 4,03,21 N/A 1,32 3,23 37,16 41,67 15,16 22,12 0,43 0,65 24,42 29,62 7,15 8,01 1,00,43 2,12,34 N/A N/A 8,08,13 15,08,32 10,69 18,55 N/A N/A 2,10,15 3,12,82 Time in minutes, seconds, milliseconds 31 3 1