Introduction to Data Mining Dr. Hany Saleeb Why Data Mining? — Potential Applications  Direct Marketing  identify which prospects should be included in a.

Download Report

Transcript Introduction to Data Mining Dr. Hany Saleeb Why Data Mining? — Potential Applications  Direct Marketing  identify which prospects should be included in a.

Introduction to Data Mining

Dr. Hany Saleeb

Why Data Mining? — Potential Applications

    Direct Marketing  identify which prospects should be included in a mailing list Market segmentation  identify common characteristics of customers who buy same products Market Basket Analysis  Identify what products are likely to be bought together Insurance Claims Analysis  discover patterns of fraudulent transactions  compare current transactions against those patterns

What Is Data Mining?

 Combination of AI and statistical analysis to discover information that is “hidden” in the data  associations (e.g. linking purchase of pizza with beer)  sequences (e.g. tying events together: marriage and purchase of furniture)  classifications (e.g. recognizing patterns such as the attributes of employees that are most likely to quit)  forecasting (e.g. predicting buying habits of customers based on past patterns) Expert systems or small ML/statistical programs

What can data mining do?

   Classification – – Classify credit applicants as low, medium, high risk Classify insurance claims as normal, suspicious Estimation – – Estimate the probability of a direct mailing response Estimate the lifetime value of a customer Prediction – – Predict which customers will leave within six months Predict the size of the balance that will be transferred by a credit card prospect

What can data mining do? (cont’d)

  Association – – Find out items customers are likely to buy together Find out what books to recommend to Amazon.com users Clustering – Difference from classification: classes are unknown!

Market Analysis and Management

    Where are the data sources for analysis?

 Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc.

Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information

Data Mining: Confluence of Multiple Disciplines

Database Technology Statistics Machine Learning Data Mining Visualization Information Science Other Disciplines

Data Mining: On What Kind of Data?

    Relational databases Data warehouses Transactional databases Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial databases  Time-series data and temporal data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW

Data Mining Process

Learning Collecting relevant data Model building Understanding of business Problem identification Action Business strategy and evaluation

Requirements/challenges in Data Mining

 User interface  Mining methodology  Performance  Data source  Social and Security

Requirements/challenges in Data Mining(2)

 User interface - Data Visualization  Understandability and interpretation of results  Information representation and rendering  Screen real-estate - Interactivity  Manipulation of mined knowledge  focus and refine mining tasks  Focus and refine mining results

Requirements/challenges in Data Mining(3)

 Mining Methodology  Mining different kinds of knowledge in databases  Interactive mining of knowledge at multiple levels of abstraction  Incorporation of background knowledge  Query languages  Expression and visualization of results  Handling noise and incomplete data  Pattern evaluation

Requirements/challenges in Data Mining (4)

 Performance  Efficiency and scalability of data mining algorithms  Linear algorithms needed  Parallel and distributed methods  Incremental methods  Divide and conquer?

Requirements/challenges in Data Mining(5)

 Data Source  Diversity of data types  Handling complex types of data  Mining information from heterogenous data bases or information repositories  Can we expect a DM algorithm to do well on all types of data ?

 Data glut  Are we collecting the right data for the right answer?

 Distinguish between important and unimportant data

Requirements/challenges in Data Mining(6)

 Social and Security -Social Impact  Private and sensitive data is gathered and mined without individual’s knowledge and/or consent  Appropriate use and distribution of discovered knowledge - Regulations Need for privacy and DM policies

Data Mining Tools

DBMiner : A free tool

 DBMiner: A data mining system originated in Intelligent Database Systems Lab and further developed by DBMiner Technology Inc.

 OLAM (on-line analytical mining) architecture for interactive mining of multi-level knowledge in both RDBMS and data warehouses  Mining knowledge on Microsoft SQLServer 7.0 databases and/or data warehouses  Multiple mining functions: discovery-driven OLAP, association, classification and clustering

Input and Output

  Input: SQLServer 7.0 data cubes which are constructed from single or multiple relational tables, data warehouses or spread sheets (with OLEDB and RDBMS connections) Multiple outputs  Summarization and discovery-driven OLAP: crosstabs and graphical outputs using MS/Excel2000  Association: rule tables, rule planes and ball graphs  Classification: decision trees and decision tables  Clustering: maps and summarization graphs  Others:  Data and cube views  Visualization of concept hierarchies  Visualization for task management  Visualization of 2-D and 3-D boxplots

Data Mining Tasks

 DBMiner covers the following functions  Discovery-driven, OLAP-based multi-dimensional analysis  Association and frequent pattern analysis  Classification (decision tree analysis)  Cluster analysis  3-D cube viewer and analyzer  Other function  OLAP service, cube exploration, statistical analysis  Sequential pattern analysis  Visual classification

Summary

 The benefits of knowing one’s business is critical; technologies are coming together to support data mining.  Data mining is the process and result of knowledge production, knowledge discovery and knowledge management.