Data, Data everywhere yet ... We can’t find the data we need  data is scattered over the network  We can’t get the data we.

Download Report

Transcript Data, Data everywhere yet ... We can’t find the data we need  data is scattered over the network  We can’t get the data we.

Data, Data everywhere yet ...

We can’t find the data we need data is scattered over the network   We can’t get the data we need  need an expert to get the data  We can’t understand the data we found  available data is poorly documented  We can’t use the data we found  data needs to be transformed from one form to other 2

What is Data Warehouse?

Definition by Inmon  “A data warehouse is a subject-oriented , and non-volatile integrated, time-variant collection of data in support of management’s , decision-making process”

Data Warehouse —Subject-Oriented

Organized around major subjects, such as customer, product, sales

Data Warehouse —Integrated

Constructed by integrating multiple, heterogeneous data sources  relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied   Ensure consistency in naming conventions, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted

Data Warehouse —Time Variant

The time horizon for the data warehouse is significantly longer than that of operational systems  Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Data Warehouse —Non-Volatile

Operational update of data does not occur in the data warehouse environment  Requires only two operations in data accessing:

initial loading of data

and

access of data

Data Warehouse vs. Operational DBMS

OLTP (On-Line Transaction Processing)   Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (On-Line Analytical Processing)   Major task of data warehouse system Data analysis and decision making

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on  multidimensional data model which views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimensions ( such as sales )   Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold ) and keys to each of the related dimension tables

Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures  Star schema A fact table in the middle connected to a set of dimension tables  Snowflake schema A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Time time_key day day_of_the_week month quarter year

Example of Star Schema

Sales Fact Table Branch branch_key branch_name branch_type Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_type Location location_key street city province_or_street country Measures

Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type

Example of Snowflake Schema

Sales Fact Table Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_key Location location_key street city_key Measures Supplier supplier_key supplier_type City city_key city province_or_street country

Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type Measures

Example of Fact Constellation

Shipping Fact Table Sales Fact Table Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_key Location location_key street city Province/street country Time_key Item_key shipper_key from_location to_location Euros_sold unit_shipped shipper shipper_key shipper_name location_key shipper_type

A Sample Data Cube

PC VCR

sum

TV 1Qtr 2Qtr

Date

3Qtr 4Qtr

sum

Total annual sales of TV in Ireland

Ireland France Germany

sum

Typical OLAP Operations Roll up (drill-up): summarize data 

by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up 

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice 

project and select

Pivot (rotate) 

reorient the cube, visualization, 3D to series of 2D planes.

Data Warehouse Architecture

Relational Databases ERP Systems Purchased Data Extraction Cleansing Optimized Loader Data Warehouse Engine Analyze Query Legacy Data 16 Metadata Repository

Data Warehouse Architecture

Data Extraction

- Data Extraction involves gathering the data from multiple heterogeneous sources.

Data Cleaning

- Data Cleaning involves finding and correcting the errors in data.

Data Transformation

- Data Transformation involves converting data from legacy format to warehouse format.

Data Loading

- Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.

Refreshing

- Refreshing involves updating from data sources to warehouse.

Data Warehouse Models

Enterprise warehouse  collects all of the information about subjects spanning the entire organization Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

Introduction to Data Mining

What Motivated Data Mining?

We are drowning in data, but starving for knowledge!

What Is Data Mining?

Data mining (knowledge discovery from data)  Extraction of interesting (implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

21

Why Data Mining?

—Potential Applications

Data analysis and decision support    Market analysis and management  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management  Forecasting, customer retention, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) 22

Integration of Multiple Technologies

Machine Learning Artificial Intelligence Database Management Statistics Visualization Algorithms Data Mining

23

What Can Data Mining Do?

Cluster Classify  Categorical, Regression Summarize  Summary statistics, Summary rules Link Analysis / Model Dependencies  Association rules Detect Deviations 24

Clustering Find groups of similar data items

“Group people with similar travel profiles”  George, Patricia   Jeff, Evelyn, Chris Rob Clusters 25

Classification

Find ways to separate data items into pre defined groups A bank loan officer wants to analyse the data in order to know which customer (loan applicant) are risky or which are safe.

Training Data

tool produces

Groups

classifier

26

Association Rules

Identify dependencies in the data:  X makes Y likely Indicate significance of each dependency “Find groups of items commonly purchased together”  People who purchase X are likely to purchase Y 27

Deviation Detection

Find unexpected values, Uses: Failure analysis Anomaly discovery for analysis “Find unusual occurrences in stock prices” 28

Knowledge Discovery (KDD) Process

Pattern Evaluation

 Data mining —core of knowledge discovery process

Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Selection Databases

Knowledge Process

1.

2.

3.

4.

5.

6.

7.

Data cleaning

data – to remove noise and inconsistent

Data integration

– to combine multiple source

Data selection

– to retrieve relevant data for analysis

Data transformation

– to transform data into appropriate form for data mining

Data mining Evaluation Knowledge presentation

Knowledge Process

Although data mining is only one step in the entire process, it is an essential one since it uncovers hidden patterns for evaluation

Knowledge Process

Based on this view, the architecture of a typical data mining system may have the following major components:      Database, data warehouse, world wide web, or other information repository Database or data warehouse server Data mining engine Pattern evaluation model User interface