Data, Data everywhere yet ... We can’t find the data we need data is scattered over the network We can’t get the data we.
Download ReportTranscript Data, Data everywhere yet ... We can’t find the data we need data is scattered over the network We can’t get the data we.
Data, Data everywhere yet ...
We can’t find the data we need data is scattered over the network We can’t get the data we need need an expert to get the data We can’t understand the data we found available data is poorly documented We can’t use the data we found data needs to be transformed from one form to other 2
What is Data Warehouse?
Definition by Inmon “A data warehouse is a subject-oriented , and non-volatile integrated, time-variant collection of data in support of management’s , decision-making process”
Data Warehouse —Subject-Oriented
Organized around major subjects, such as customer, product, sales
Data Warehouse —Integrated
Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied Ensure consistency in naming conventions, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted
Data Warehouse —Time Variant
The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Data Warehouse —Non-Volatile
Operational update of data does not occur in the data warehouse environment Requires only two operations in data accessing:
initial loading of data
and
access of data
Data Warehouse vs. Operational DBMS
OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
OLAP (On-Line Analytical Processing) Major task of data warehouse system Data analysis and decision making
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on multidimensional data model which views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimensions ( such as sales ) Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold ) and keys to each of the related dimension tables
Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema A fact table in the middle connected to a set of dimension tables Snowflake schema A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
Time time_key day day_of_the_week month quarter year
Example of Star Schema
Sales Fact Table Branch branch_key branch_name branch_type Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_type Location location_key street city province_or_street country Measures
Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type
Example of Snowflake Schema
Sales Fact Table Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_key Location location_key street city_key Measures Supplier supplier_key supplier_type City city_key city province_or_street country
Time time_key day day_of_the_week month quarter year Branch branch_key branch_name branch_type Measures
Example of Fact Constellation
Shipping Fact Table Sales Fact Table Time_key Item_key Branch_key Location_key Unit_sold Euros_sold Avg_sales Item item_key item_name brand type supplier_key Location location_key street city Province/street country Time_key Item_key shipper_key from_location to_location Euros_sold unit_shipped shipper shipper_key shipper_name location_key shipper_type
A Sample Data Cube
PC VCR
sum
TV 1Qtr 2Qtr
Date
3Qtr 4Qtr
sum
Total annual sales of TV in Ireland
Ireland France Germany
sum
Typical OLAP Operations Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice
project and select
Pivot (rotate)
reorient the cube, visualization, 3D to series of 2D planes.
Data Warehouse Architecture
Relational Databases ERP Systems Purchased Data Extraction Cleansing Optimized Loader Data Warehouse Engine Analyze Query Legacy Data 16 Metadata Repository
Data Warehouse Architecture
Data Extraction
- Data Extraction involves gathering the data from multiple heterogeneous sources.
Data Cleaning
- Data Cleaning involves finding and correcting the errors in data.
Data Transformation
- Data Transformation involves converting data from legacy format to warehouse format.
Data Loading
- Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.
Refreshing
- Refreshing involves updating from data sources to warehouse.
Data Warehouse Models
Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
Introduction to Data Mining
What Motivated Data Mining?
We are drowning in data, but starving for knowledge!
What Is Data Mining?
Data mining (knowledge discovery from data) Extraction of interesting (implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
21
Why Data Mining?
—Potential Applications
Data analysis and decision support Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) 22
Integration of Multiple Technologies
Machine Learning Artificial Intelligence Database Management Statistics Visualization Algorithms Data Mining
23
What Can Data Mining Do?
Cluster Classify Categorical, Regression Summarize Summary statistics, Summary rules Link Analysis / Model Dependencies Association rules Detect Deviations 24
Clustering Find groups of similar data items
“Group people with similar travel profiles” George, Patricia Jeff, Evelyn, Chris Rob Clusters 25
Classification
Find ways to separate data items into pre defined groups A bank loan officer wants to analyse the data in order to know which customer (loan applicant) are risky or which are safe.
Training Data
tool produces
Groups
classifier
26
Association Rules
Identify dependencies in the data: X makes Y likely Indicate significance of each dependency “Find groups of items commonly purchased together” People who purchase X are likely to purchase Y 27
Deviation Detection
Find unexpected values, Uses: Failure analysis Anomaly discovery for analysis “Find unusual occurrences in stock prices” 28
Knowledge Discovery (KDD) Process
Pattern Evaluation
Data mining —core of knowledge discovery process
Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Selection Databases
Knowledge Process
1.
2.
3.
4.
5.
6.
7.
Data cleaning
data – to remove noise and inconsistent
Data integration
– to combine multiple source
Data selection
– to retrieve relevant data for analysis
Data transformation
– to transform data into appropriate form for data mining
Data mining Evaluation Knowledge presentation
Knowledge Process
Although data mining is only one step in the entire process, it is an essential one since it uncovers hidden patterns for evaluation
Knowledge Process
Based on this view, the architecture of a typical data mining system may have the following major components: Database, data warehouse, world wide web, or other information repository Database or data warehouse server Data mining engine Pattern evaluation model User interface