Building Scalable OLAP Applications with Mondrian and MySQL

Download Report

Transcript Building Scalable OLAP Applications with Mondrian and MySQL

Building Scalable OLAP Applications
with Mondrian and MySQL
Julian Hyde: Chief Architect, OLAP, at Pentaho
and Mondrian Project Founder
Tuesday, April 24th 2007
Agenda
Pentaho Introduction
What is OLAP?
Key features and architecture
Getting started
Schemas & queries
Tuning & scalability
Active Data Warehousing
Case Studies
Business Intelligence suite
Q&A
Pentaho Introduction
World’s most popular enterprise open source BI Suite
2 million lifetime downloads, averaging 100K / month
Founded in 2004: Pioneer in professional open source BI
Management - proven BI and open source veterans
from Business Objects, Cognos, Hyperion, JBoss, Oracle, Red Hat, SAS
Board of Directors – deep expertise and proven success in open source
Larry Augustin - founder, VA Software, helped coin the phrase “open source”
New Enterprise Associates – investors in SugarCRM, Xensource, others
Index Ventures – investors in MySQL, Zend, others
Widely recognized as the leader in open source BI
Distributed worldwide by Red Hat via the Red Hat Exchange
Embedded in next release of OpenOffice (40 million users worldwide)
What is OLAP?
View data “dimensionally”
i.e. Sales by region, by channel, by
time period
Navigate and explore
Ad Hoc analysis
“Drill-down” from year to quarter
Pivot
Select specific members for
analysis
Interact with high performance
Technology optimized for rapid
interactive response
A brief history of OLAP
2010
Open-source BI suites
2000
1990
1980
1970
Open-source OLAP
JRubik
Pentaho
JPivot
BEE
OpenI
Palo
Mondrian
MySQL
Microsoft OLAP
Oracle partitions,
Services
bitmap indexes Informix MetaCube
RDBMS support for DW
Codd’s “12
Sybase IQ
Business Objects
Rules of OLAP”
Desktop OLAP
MicroStrategy
Essbase
Cognos PowerPlay
Spreadsheets
Comshare
Mainframe OLAP
Express
APL
Mondrian features and architecture
Key Features
On-Line Analytical Processing (OLAP)
cubes
automated aggregation
speed-of-thought response times
Open Architecture
100% Java
J2EE
Supports any JDBC data source
MDX and XML/A
Analysis Viewers
Enables ad-hoc, interactive data
exploration
Ability to slice-and-dice, drill-down, and
pivot
Provides insights into problems or
successes
How Mondrian Extends MySQL for OLAP Applications
MySQL Provides
Mondrian Provides
Data storage
Dimensional view of data
SQL query execution
MDX parsing
Heavy-duty sorting, correlation,
SQL generation
aggregation
Caching
Integration point for all BI tools
Higher-level calculations
Aggregate awareness
Open Architecture
Viewers
Microsoft Excel
(via Spreadsheet
Services)
Open Standards (Java, XML,
MDX, XML/A, SQL)
Web Server
Cross Platform (Windows &
JPivot servlet
Unix/Linux)
J2EE Application Server
J2EE Architecture
Server Clustering
Fault Tolerance
Data Sources
JPivot servlet
Cube
Schema
XML
Cube
Schema
XML
Cube
Schema
XML
XML/A servlet
Mondrian
cube
cube
cube
JDBC
JDBC
JDBC
File or RDBMS Repository
JDBC
JNDI
RDBMS
RDBMS
Getting started
Getting started with Mondrian
1.
Download mondrian from http://mondrian.pentaho.org and unzip
2.
Copy mondrian.war to
3.
Create a database:
TOMCAT_HOME/webapps.
$ mysqladmin create foodmart
$ mysql
mysql> grant all privileges on *.* to 'foodmart'@'localhost' identified
by 'foodmart';
Query OK, 0 rows affected (0.00 sec)
mysql> quit
Bye
4.
Install demo dataset (300,000 rows):
$ java -cp "/mondrian/lib/mondrian.jar:/mondrian/lib/log4j1.2.9.jar:/mondrian/lib/eigenbase-xom.jar:/mondrian/lib/eigenbaseresgen.jar:/mondrian/lib/eigenbaseproperties.jar:/usr/local/mysql/mysql-connector-java-3.1.12-bin.jar"
mondrian.test.loader.MondrianFoodMartLoader
-verbose -tables -data -indexes
-jdbcDrivers=com.mysql.jdbc.Driver
-inputFile=/mondrian/demo/FoodMartCreateData.sql
-outputJdbcURL=
"jdbc:mysql://localhost/foodmart?user=foodmart&password=foodmart“
5.
Start tomcat, and hit http://localhost:8080/mondrian
demonstration
Schemas and queries
A Mondrian schema consists of…
A dimensional model (logical)
Cubes & virtual cubes
Shared & private dimensions
Calculated measures in cube and in
query language
Parent-child hierarchies
… mapped onto a star/snowflake
schema (physical)
Fact table
Dimension tables
Joined by foreign key relationships
Writing a Mondrian Schema
Regular cubes, dimensions,
hierarchies
Shared dimensions
Virtual cubes
Parent-child hierarchies
Custom readers
Access-control
<!-- Shared dimensions -->
<Dimension name="Region">
<Hierarchy hasAll="true“
allMemberName="All Regions">
<Table name="QUADRANT_ACTUALS"/>
<Level name="Region" column="REGION“
uniqueMembers="true"/>
</Hierarchy>
</Dimension>
<Dimension name="Department">
<Hierarchy hasAll="true“
allMemberName="All Departments">
<Table name="QUADRANT_ACTUALS"/>
<Level name="Department“
column="DEPARTMENT“
uniqueMembers="true"/>
</Hierarchy>
</Dimension>
(Refer to http://mondrian.pentaho.org/documentation/schema.php )
Tools
Schema
Workbench
Pentaho cube
designer
cmdrunner
MDX – Multi-Dimensional Expressions
A language for multidimensional queries
Plays the same role in Mondrian’s API as SQL does in JDBC
SQL-like syntax
SELECT {[Measures].[Unit Sales]} ON COLUMNS,
{[Store].[USA], [Store].[USA].[CA]} ON ROWS
FROM [Sales]
WHERE [Time].[1997].[Q1]
… but un-SQL-like semantics
(Refer to http://mondrian.pentaho.org/documentation/mdx.php )
Scalability & Tuning
Tuning is a Process
Some techniques:
Test performance on a realistic data set!
Use tracing to identify long-running SQL statements
mondrian.trace.level=2
Tune SQL queries
Run the same query several times, to examine cache effectiveness
When Mondrian starts, prime the cache by running queries
Get to know Mondrian’s system properties
Tuning MySQL for Mondrian
Fact table:
Index foreign keys
Try indexing multiple combinations of foreign keys
Use MyISAM or MERGE (also known as MRG_MyISAM)
Partitioned tables
Dimension tables:
Index primary and foreign keys
Use MyISAM for large dimension tables, MEMORY for smaller dimension tables
Create aggregate tables:
Contain pre-computed summaries
Prevent full table scan followed by massive GROUP BY
Parent-child hierarchies
Create closure tables
Scaling to large datasets
Tune MySQL
Partitions allow parallelism
Aggregate tables allow you to do the hard
work to load-time, not runtime
Increase cache size
Scaling to large numbers of concurrent users
Viewers
Viewers
Viewers
Tune MySQL
Increase JVM memory
Make sure that
connections are using
Web Server
WebServer
Server
Web
WebServer
Server
Web
Web Server
JPivot servlet
JPivotservlet
servlet
JPivot
JPivotservlet
servlet
JPivot
JPivot servlet
the same cache
Multiple web servers
Multiple mondrian
J2EE Application Server
J2EE
J2EE Application
Application Server
Server
J2EE Application Server
XML/A servlet
JPivot
XML/Aservlet
servlet
XML/A servlet
Mondrian
Mondrian
Mondrian
Mondrian
instances
cube
cube
cube
cube
Multiple DBMS instances
JDBC
JDBC
JDBC
JDBC
with read-only copies of
the data
RDBMS
RDBMS
RDBMS
RDBMS
RDBMS
Active Data Warehousing
Active Data Warehousing
An Active Data Warehouse:
Extends the traditional DW to support operational processing
Provides a single view of the business
Is updated in near real-time
Cost/benefit:
High cost (technical challenges)
High reward (time is money)
Technical challenges of Active Data Warehousing
DBMS has mixed work-load
Short, fast queries to query specific records
Longer queries to look at changing patterns
Emphasis on recent data
Background DML to keep database up to date
Continuous, incremental ETL
Continuous or near real-time
Incremental
Consider message-driven ETL
OLAP cache
OLAP servers use cache and/or MOLAP database for performance
Cache consistency
Mondrian and Active Data Warehouse
“ROLAP with caching”
No MOLAP data store to maintain
Disable aggregate tables
Unless your ETL process can maintain these incrementally
Either turn off cache
mondrian.rolap.star.disableCaching=true
“Pure ROLAP” mode
Results are cached for the lifetime of a single MDX statement
Or use new CacheControl API to selectively flush mondrian’s cache
Cache control
In an Active Warehouse, the most recent data are modified much more
often than historic data
year
quarter month
day
unit_sales
2007 Q1
1
1
1500
2007 Q1
1
2
1700
2007 Q1
1
3
1250
2007 Q1
1
4
1800
…
2007 Q2
4
22
1378
2007 Q2
4
23
1920
2007 Q2
4
24
525
450
350
Cache control API allows the application to selectively flush the cache
(Refer to http://mondrian.pentaho.org/documentation/cache_control.php )
Cache control: A simple example
Let’s run a simple MDX query:
SELECT
{[Time].[1997],
[Time].[1997].[Q1],
[Time].[1997].[Q2]} ON COLUMNS,
{[Customers].[USA],
[Customers].[USA].[OR],
[Customers].[USA].[WA]} ON ROWS
FROM [Sales]
Cache contents after running query
year=1997, nation=USA, measure=unit_sales
year=1997, nation=USA, state={OR, WA},
measure=unit_sales
year
nation unit_sales
1997 USA
266,773
year=1997, quarter=any, nation=USA,
measure=unit_sales
year
year
nation state unit_sales
1997 USA
OR
67,659
1997 USA
WA
124,366
year=1997, quarter=*, nation=USA,
state={OR, WA}, measure=unit_sales
quarter nation unit_sales
year
quarter nation state unit_sales
1997 Q1
USA
66,291
1997 Q1
USA
OR
19,287
1997 Q2
USA
62,610
1997 Q1
USA
WA
30,114
1997 Q2
USA
OR
15,079
1997 Q2
USA
WA
29,479
Flushing part of a segment
Cache
manager
OR
WA
Q1 19,287 30,114
Q2 15,079 29,479
year=1997, quarter=*, nation=USA,
state={OR, WA}, measure=unit_sales;
measure=unit_sales
exclude: state=WA and quarter=Q2
year
quarter nation state unit_sales
1997 Q1
CA
WA
Q1 16,890 30,114
Q2 18,052 30,079
USA
OR
19,287
year=1997,
quarter=*,
nation=USA,
1997 Q1
USA
WA 30,114
state={CA, WA}, measure=unit_sales
1997 Q2
USA
OR
15,079
1997 Q2
USA
WA
29,479
Code using CacheControl to flush part of a segment
Connection connection;
CacheControl cacheControl = connection.getCacheControl(null);
Cube salesCube = connection.getSchema().lookupCube("Sales", true);
SchemaReader schemaReader = salesCube.getSchemaReader(null);
// Create region containing [Customer].[USA].[OR].
Member memberOregon = schemaReader.getMemberByUniqueName(
new String[]{“Customer", “USA", “OR"}, true);
CacheControl.CellRegion regionOregon =
cacheControl.createMemberRegion(memberOregon, true);
// Create region containing [Customer].[USA].[OR].
Member memberTimeQ2 = schemaReader.getMemberByUniqueName(
new String[]{“Time", “1997", “Q2"}, true);
CacheControl.CellRegion regionTimeQ2 =
cacheControl.createMemberRegion(memberTimeQ2, true);
// Create region containing ([Customer].[USA].[OR], [Time].[1997].[Q2])
CacheControl.CellRegion regionOregonQ2 =
cacheControl.createCrossjoinRegion(regionOregon, regionTimeQ2);
cacheControl.flush(regionOregonQ2);
Case studies
Case Study: Frontier Airlines
Frontier Airlines
Key Challenges
“The competition is intense in the
airline industry and Frontier is
committed to staying ahead of the
curve by leveraging technology that
will help us offer the best prices and
the best flight experience…. [the
application] fits right in with our
philosophy of providing world-class
performance at a low price.”
Understanding and optimizing fares to ensure
Maximum occupancy (no empty seats)
Maximum profitability (revenue per seat)
Pentaho Solution
Pentaho Analysis (Mondrian)
Chose Open Source RDBMS and Mondrian over
Oracle
500 GB of data, 6 server cluster
Results
Comprehensive, integrated analysis to set
strategic pricing
Improved per-seat profitability (amount not
disclosed)
Why Pentaho
Rich analytical and MDX functionality
Cost of ownership
Case Study: U.S.-based, Public Software Company
OEM Example
Key Requirements
“The old solution (MicroStrategy) was
built to be a standalone application. It
expected to own the browser session,
to have separate security, and its own
user interface standards. We were
able to make Mondrian a seamless part
of the application, and it was much
easier to do so.”
Embedding analysis within a pricing planning
application
Incumbent Vendor
MicroStrategy
Decision Criteria
Embeddability
Extensibility
Cost of ownership
Best Practices
Evaluate replacement costs holistically
Treat migrations as an opportunity to improve
a deployment, not just move it
Understand and prioritize requirements –
functionality, usability, architecture, etc.
The big picture
Business Intelligence Suite
Mondrian OLAP
Analysis tools:
Pivot table
Charting
Dashboards
ETL (extract/transform/load)
Integration with operational reporting
Integration with data mining
Actions on operational data
Design/tuning tools
Pentaho Open Source BI Offerings
All available in a Free Open Source license
A Sample of Joint MySQL-Pentaho Users
“Pentaho provided a robust, open
source platform for our sales reporting
application, and the ongoing support
we needed. The experts at OpenBI
provided outstanding services and
training, and allowed us to deploy and
start generating results very quickly.”
“We selected Pentaho for its ease-ofuse. Pentaho addressed many of our
requirements -- from reporting and
analysis to dashboards, OLAP and ETL,
and offered our business users the
Excel-based access that they wanted.”
Next Steps and Resources
Contact Information
Julian Hyde, Chief Architect, [email protected]
More information http://www.pentaho.org and
http://mondrian.pentaho.org
Pentaho Community Forum http://community.pentaho.org
Go to Developer Zone
Discussions
Pentaho BI Platform including Mondrian
http://www.pentaho.org/download/latest
Mondrian OLAP Library only
http://sourceforge.net/project/showfiles.php?group_id=35302
Birds of a Feather session:
MySQL Data Warehousing and BI
Speakers:
James Dixon, Senior Architect and Chief Technology (a.k.a. "Chief Geek"),
Pentaho Corporation
Roland Bouman, Certification Developer, MySQL A.B.
Matt Casters, Data Integration Architect and Kettle Project Founder, Pentaho
Julian Hyde, OLAP Architect and Mondrian Project Founder, Pentaho
Brian Miezejewski, Principal Consultant, MySQL inc.
Track: Data Warehousing and Business Intelligence
Date: TONIGHT!!
Time: 7:30pm - 9:00pm
Location: Camino Real
Thank you for attending!