Transcript Document

Historical Perspective The Relational Model revolutionized transaction processing systems DBMS gave access to the data stored OLTP's are good at putting data into databases The data explosion Increase in use of electronic data gathering devices e.g. point-of-sale, remote sensing devices etc.

Data storage became easier and cheaper with increasing computing power Problems DBMS gave access to the data stored but no analysis of data Analysis required to unearth the hidden relationships within the data i.e. for decision support Size of databases has increased e.g. VLDBs, need automated techniques for analysis as they have grown beyond manual extraction Obstacles typical scientific user knew nothing of commercial business applications the business database programmers, knew nothing of massively parallel principles solution was for database software producers to create easy-to-use tools and form strategic relationships with hardware manufacturers What is data mining?

the non trivial extraction of implicit, previously unknown, and potentially useful information from data

William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.

The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.

It is possible to `strike gold' in unexpected places as the data mining software extracts patterns not previously discernible or so obvious that no-one has noticed them before.

Mining analogy: large volumes of data are sifted in an attempt to find something worthwhile in a mining operation large amounts of low grade materials are sifted through in order to find something of value.

Books: • Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN 1-55860-489-8. • Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 1999, ISBN 1-55860-552-5.

Data Mining vs. DBMS DBMS - queries based on the data held e.g.

• last months sales for each product • sales grouped by customer age etc.

• list of customers who lapsed their policy Data Mining - infer knowledge from the data held to answer queries e.g.

• what characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies?

• why is the Cleveland division so profitable?

Characteristics of a data mining system Large quantities of data • volume of data so great it has to be analyzed by automated techniques e.g. POS, satellite information, credit card transactions etc.

Noisy, incomplete data • imprecise data is characteristic of all data collection • databases - usually contaminated by errors, cannot assume that the data they contain is entirely correct e.g. some attributes rely on subjective or measurement judgments Complex data structure - conventional statistical analysis not possible Heterogeneous data stored in legacy systems Who needs data mining?

Who(ever) has information fastest and uses it wins

Don McKeough, former president of Coke Cola

Data Mining Applications Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.

Finance - stock market prediction, credit assessment, fraud detection etc.

Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc.

Knowledge Acquisition Expert systems are models of real world processes Much of the information is available straight from the process e.g.

in production systems, data is collected for monitoring the system knowledge can be extracted using data mining tools experts can verify the knowledge Engineering - automotive diagnostic expert systems, fault detection etc.

Data Mining Goals

Classification

DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules Example - customer database in a bank Question - Is a new customer applying for a loan a good investment or not?

Typical rule formulated: if STATUS = married and INCOME > 10000 and HOUSE_OWNER = yes then INVESTMENT_TYPE = good

Association

Rules that associate one attribute of a relation to another Set oriented approaches are the most efficient means of discovering such rules Example - supermarket database 72% of all the records that contain items A and B also contain item C the specific percentage of occurrences, 72 is the confidence factor of the rule

Sequence/Temporal

Sequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of time Difference between sequence rules and other rules is the temporal factor Example - retailers database Can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven

Data Mining and Machine Learning Data Mining (DM) or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge Machine Learning (ML) is concerned with improving performance of an agent training a neural network to balance a pole is part of ML, but not of KDD Efficiency of the algorithm and scalability is more important in DM or KDD DM is concerned with very large, real-world databases ML typically looks at smaller data sets ML has laboratory type examples for the training set DM deals with `real world' data. Real world data tend to have problems such as: missing values dynamic data noise Statistical Data Analysis Ill-suited for Nominal and Structured Data Types Completely data driven - incorporation of domain knowledge not possible Interpretation of results is difficult and daunting Requires expert user guidance

Stages of the Data Mining Process Data pre-processing • heterogeneity resolution • data cleansing • data warehousing Applying Data Mining Tools: extraction of patterns from the pre-processed data Interpretation and evaluation: the user bias can direct DM tools to areas of interest • attributes of interest in databases • goal of discovery • domain knowledge • prior knowledge or belief about the domain Techniques Machine Learning methods Statistics: can be used in several data mining stages • data cleansing i.e. the removal of erroneous or irrelevant data • EDA, exploratory data analysis e.g. frequency counts, histograms etc.

• data selection - sampling facilities and so reduce the scale of computation • attribute re-definition • data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.

Visualization: enhances EDA, makes patterns more visible Clustering (Cluster Analysis) • Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric • Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of molecules • Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base • It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases

Knowledge Representation Methods

Neural Networks

• a trained neural network can be thought of as an "expert" in the category of information it has been given to analyze • provides projections given new situations of interest and answers "what if" questions • problems include: • the resulting network is viewed as a black box • no explanation of the results is given i.e. difficult for the user to interpret the results • difficult to incorporate user intervention • slow to train due to their iterative nature

Decision trees

• used to represent knowledge • built using a training set of data and can then be used to classify new objects • problems are: • opaque structure - difficult to understand • missing data can cause performance problems • they become cumbersome for large data sets

Rules

• probably the most common form of representation • tend to be simple and intuitive • unstructured and less rigid • problems are: • difficult to maintain • inadequate to represent many types of knowledge • Example format: if X then Y

Related Technologies: Data Warehousing

Definition A data warehouse can be defined as any centralized data repository which can be queried for business benefit warehousing makes it possible to: • extract archived operational data • overcome inconsistencies between different legacy data formats • integrate data throughout an enterprise, regardless of location, format, or communication requirements • incorporate additional or expert information Characteristics of a data warehouse • subject-oriented - data organized by subject instead of application e.g.

• an insurance company would organize their data by customer, premium, and claim, instead of by different products (auto, life, etc.) • contains only the information necessary for decision support processing • integrated - encoding of data is often inconsistent e.g. gender might be coded as "m" and "f" or 0 and 1 but when data are moved from the operational environment into the data warehouse they assume a consistent coding convention • time-variant - the data warehouse is a place for storing data that are five to 10 years old, or older e.g.

• this data is used for comparisons, trends, and forecasting • these data are not updated • non-volatile • data are not updated or changed in any way once they enter the data warehouse • data are only loaded and accessed

Data warehousing Processes • insulate data - i.e. the current operational information • preserves the security and integrity of mission-critical OLTP applications • gives access to the broadest possible base of data • retrieve data - from a variety of heterogeneous operational databases • data is transformed and delivered to the data warehouse/store based on a selected model (or mapping definition) • metadata - information describing the model and definition of the source data elements • data cleansing - removal of certain aspects of operational data, such as low-level transaction information, which slow down the query times.

• transfer - processed data transferred to the data warehouse, a large database on a high performance box

Criteria for a data warehouse

Load Performance require incremental loading of new data on a periodic basis must not artificially constrain the volume of data Load Processing data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update Data Quality Management ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive database size Query Performance must not be slowed or inhibited by the performance of the data warehouse RDBMS Terabyte Scalability Data warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations. It must support modular and parallel management.

Mass User Scalability Access to warehouse data must not be limited to the elite few has to support hundreds, even thousands, of concurrent users while maintaining acceptable query performance.

Networked Data Warehouse Data warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses from a single client workstation Warehouse Administration large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility The RDBMS must Integrate Dimensional Analysis dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools Advanced Query Functionality End users require advanced analytic calculations, sequential and comparative analysis, and consistent access to detailed and summarized data

Data warehousing vs. OLTP OLTP systems designed to maximize transaction capacity but they: cannot be repositories of facts and historical data for business analysis cannot quickly answer ad hoc queries rapid retrieval is almost impossible data is inconsistent and changing, duplicate entries exist, entries can be missing OLTP offers large amounts of raw data which is not easily understood Typical OLTP query is a simple aggregation e.g.

what is the current account balance for this customer?

Data warehouses are interested in query processing as opposed to transaction processing Typical business analysis query e.g.

which product line sells best in middle-America and how does this correlate to demographic data?

OLAP (On-line Analytical processing) Problem is how to process larger and larger databases OLAP involves many data items (many thousands or even millions) which are involved in complex relationships Fast response is crucial in OLAP Difference between OLAP and OLTP OLTP servers handle mission-critical production data accessed through simple queries OLAP servers handle management-critical data accessed through an iterative analytical investigation OLAP operations Consolidation - involves the aggregation of data i.e. simple roll-ups or complex expressions involving inter-related data e.g. sales offices can be rolled-up to districts and districts rolled-up to regions Drill-Down - can go in the reverse direction i.e. automatically display detail data which comprises consolidated data "Slicing and Dicing" - ability to look at the data base from different viewpoints e.g.

one slice of the sales database might show all sales of product type within regions; another slice might show all sales by sales channel within each product type often performed along a time axis in order to analyze trends and find patterns