THE ROLAP DATA FLOWS SCHEMA DATA MINING FORENSICS HOLAP NETWORK SECURITY ONLINE ANALYSIS MOLAP STREAMING DATA MULTI-DIMENSIONAL HIERARCHIES CUBOID BINARY TREE FOREST SQL GROUP BY DATA CUBE RELATIONAL ALGEBRA Stephen A.
Download ReportTranscript THE ROLAP DATA FLOWS SCHEMA DATA MINING FORENSICS HOLAP NETWORK SECURITY ONLINE ANALYSIS MOLAP STREAMING DATA MULTI-DIMENSIONAL HIERARCHIES CUBOID BINARY TREE FOREST SQL GROUP BY DATA CUBE RELATIONAL ALGEBRA Stephen A.
THE ROLAP DATA FLOWS SCHEMA DATA MINING FORENSICS HOLAP NETWORK SECURITY ONLINE ANALYSIS MOLAP STREAMING DATA MULTI-DIMENSIONAL HIERARCHIES CUBOID BINARY TREE FOREST SQL GROUP BY DATA CUBE RELATIONAL ALGEBRA Stephen A. Broeker 1 Acronyms OLAP – Online Analytical Processing OLTP – Online Transaction Processing 2 Conclusion DATA VALUE HYPERCUBES exceed the performance of existing hypercubes by enabling OLAP to drill down to individual data values. Therefore, DATA VALUE HYPERCUBES extend OLAP’s ability to render valuable information and insight. 3 Vision Analyze Streaming Data Improve Network Security Freedom from figuring out how to answer routine questions in order to think about what extraordinary questions could be asked. 4 Mutually Exclusive Approaches OLAP OLTP Broad views of data: Narrow views of data: Finds patterns obscured by detail. Finds detail obscured by patterns. 5 Distinct Purposes OLAP OLTP Online Analytical Processing Online Transaction Processing Seeks detailed answers to complex questions based on large data sets. Discover information hiding in data. Operate and Control: Snapshots of operational status. The priority is depth and breadth of understanding, speed is secondary. The priority is speed & detail. Example: Find the purchase patterns for men for all dental hygiene products in all stores. Example: John Smith used a debit card to buy toothpaste from a gas station. 6 Opportunity Eliminate the mutually exclusive tradeoff between OLTP versus OLAP. Now we can have the best of both worlds. VS Today. VS Tomorrow. 7 Capability of the Data Value Hypercube Detect trends and tendencies among measures, attributes or parameters. Find a “Needle in Haystack” by drillingdown to specific details. Spot data clusters, relationships and magnitudes of size, disparity, or distribution. Detect outliers, anomalies, exceptions, and data errors. Enables the composition of totals from aggregates or data values. 8 Data Mining Data Mining uses OLAP. Example: Associations People who buy bread “also” buy ‘X’. “Also” is presented as a percentage. Purchased with Bread Concurrence Butter 90% Grape Jelly 20% Cinnamon Spice 10% Sample Size Confidence 100 Transactions 10% 1,000,000 Transactions 90% Building a Data Warehouse: $1M Building a DBMS Team: $2M Having confidence in your results: Priceless. 9 Compare Roles OLTP Relational Database DATA VALUES OPERATIONS BUSINESS INTELLIGIENCE Data Warehouse DATA AGGREGATES OLAP 10 Compare OLTP to OLAP OLAP OLTP RESPONSE TIME TO QUERIES: SPACE REQUIREMENT: DATA SOURCE: GRANULARITY: OUTPUT: SCOPE: FLUX: FIND: SLOW FAST LARGE RELATIVELY SMALL DATA CUBES RELATIONAL DATABASE GENERALIZED DETAILED DATA AGGREGATES SPECIFIC DATA VALUES ANALYZE, DECIDE, PLAN OPERATE & CONTROL BATCH OPERATIONS FREQUENT UPDATES TRENDS ANOMALIES 11 Hypercube Hypercubes are constructed so that each cell corresponds to a unique combination of database attribute values. 12 Dependencies OLAP Engines are implemented as multi-dimensional data cubes. Data cubes with many dimensions are called hypercubes. 13 Disambiguation In geometry, the tesseract is the four-dimensional analog of the cube. The tesseract is to the cube as the cube is to the square. A generalization of the cube to dimensions greater than three is called a “hypercube”. Created by Jason Hise with Maya and Macromedia Fireworks. A 3D projection of an 8-cell performing a [[SO(4)#Geometry_of_4D_rotations|single rotation]] about a plane which bisects the figure from front to back and top to bottom. Released by the author into the public domain: Jason Hise grants anyone the right to use this work for any purpose, without any conditions, unless such conditions are required by law. 14 Disambiguation In this context, hypercubes are data structures. This picture is merely an abstract visual representation of a hypercube. 15 Concepts Implementation versus Implementation Later A Balanced B-Tree Forest is the data structure that is used to represent a Hypercube. Each dimension in the Hypercube is represented by a separate B-Tree. 16 Hypercubes Have Dimensions Part Customer Supplier 17 Attributes Ordered into Hierarchies Part 4 Part 3 Part 2 Part 1 Customer A Supplier 4 Customer B Supplier 3 Customer C Supplier 2 Customer D Supplier 1 18 Multiple Attributes in a Single Dimension Dimensions are organized as hierarchies of attributes. Example, the time dimension of Year, Month, Day Drill-down is viewing data at progressively finer detail. Example: Sales per year, then month, then day. Roll-up is viewing data in progressively less detail. Example: Sales per day, then month, then year. UP DOWN 19 Attribute Complexity Attribute complexity increases in the presence of hierarchies. Example, queries that group on time. These queries, (day), (month), (year), each represents a different granularity of the time dimension. (year) <= (month) <= (day) If we have total sales grouped by month, then we can use the results to compute the total sales grouped by year. 20 Query Dependencies Hierarchies introduce query dependencies that we must account for when determining what queries to materialize. Often, hierarchies are not total orders, but partial orders on the attributes that make up a dimension. Example: Months and years cannot be divided evenly into weeks. If we group by week then we can’t determine the grouping by month or year. (month) !<= (week), (week) !<= (month), and similarly for week and year. 21 Limit of the Visual Analogy Although mathematicians can project geometric shapes having at least 10 dimensions onto a flat surface beyond 3 dimensions, the visual analogy of a hypercube as a data structure stops working, even though the logic of the analogy remains perfectly valid. The construction of 4 dimensional hypercube on a flat surface makes it obvious why we don’t go beyond 3D representations. 22 Cuboids This is a single data cell. 23 Cuboids Any subset of a hypercube is a cuboid. 24 Cuboids Slice. 25 Cuboids Dice. 26 Cuboids Also a dice. 27 Drilling down. 28 Rolling up More Detail Less Detail Narrower Summaries Broader Summaries 29 Challenges Large databases Data in a rapid and constant state of flux, i.e., streaming data. Constraints: Time, RAM, computing power Data Cube Materialization is problematic. 30 Predicted % Change to Data Warehouse Feeds Instant Messaging Content Management Voice recognition Wikis OLAP OLTP Email Document Management Multimedia UNSTRUCTURED DATA XML RSS Web Pages EDI Web Log MultiDimensional Databases +80% Legacy Databases Spreadsheets Relational Databases Main Frame Databases Taxonomies, Ontologies STRUCTURED DATA -80% 31 Existing Methods Name: Data Source: SQL: ROLAP MOLAP HOLAP Relational OLAP Multi-Dimensional OLAP Hybrid OLAP Relational Database Hypercube ROLAP for Data Values MOLAP for Data Aggregates Yes No Yes 32 Disambiguation Symbol: Name: Pi Pi Use: Most often in Classical Geometry Arithmetic, Set Theory Function: Mathematical Constant Cartesian Product. Similar to summation, as indicated by the capital letter sigma: 3.14159 33 Bottleneck Number of Cells in a Data Cube Given a database with l number of attributes, the number of cells in the corresponding fully populated data cube is 1≤ i ≤ l (ai+1) where each attribute i has ai values. +1 The additional data cell accommodates the value “all”. 34 Cuboid Example Consider a network hypercube with 3 dimensions:1. Content 2. Source IP 3. Time Stamp Limit the hypercube to one million streams. Typically there are: 201 unique Content types 100 unique Source IPs the number of 275 unique Time-Stamps possible cells is: 3 ( Li 1) = 202 x 101 x 276 ≈ 5.8 million i 1 In actuality, the network data stream has 16 dimensions. 35 Data Expansion 20 Dimensions B 15 Dimensions 10 Dimensions 5 Dimensions 5 10 15 20 Unique Data Values per Dimension Number of Cells Note: Log Scale 36