Transcript Chapter 1

Chapter 1
Why & What is Data Mining?
Based on
Data Mining Techniques (2nd Ed.), Berry and Linoff, 2004, Wiley.
Slides by Prof. Norman of the National University in La Jolla, CA.
Adapted by Peter Auer.
What, Who
Data Mining – Definition & Goal
• Definition
– DM is the exploration and analysis of large quantities
of data in order to discover meaningful patterns and
rules.
• Goal
– To allow an “enterprise”* to IMPROVE its ______
through better understanding of its ______ .
– Potential for Competitive Advantage.
* Synonyms include: corporation, firm, non-profit organization, government agency
2
Foundations of Data Mining
 Data mining is the process of using “raw” data to infer
important “business” relationships.
 Despite a consensus on the value of data mining, a
great deal of confusion exists about what it is.
 Data Mining is a collection of powerful techniques
intended for analyzing large amounts of data.
 There is no single data mining approach, but rather a set
of techniques that can be used stand alone or in
combination with each other.
3
How
Customer Relationship Management (CRM)
4
Customer Relationship
Management (CRM)
How
In order to form a learning relationship with its
customers, an enterprise (firm) must be able to:
1. Notice – what its customers are doing
2. Remember – what it and its customers have
done over time
3. Learn – from what it has remembered
4. Act On – what it has learned to make
customers more profitable
5
How
Based on “Transaction” Data
6
Definitions of a Data Warehouse
“A subject-oriented, integrated, time-variant and
1.
non-volatile collection of data in support of
management's decision making process”
- W.H. Inmon
2.
“A copy of transaction data, specifically
structured for query and analysis”
- Ralph Kimball
7
Data Warehouse
• For organizational learning to take place, data
from many sources must be gathered together
and organized in a consistent and useful way –
hence, Data Warehousing (DW)
• DW allows an organization (enterprise) to
remember what it has noticed about its data
• Data Mining techniques make use of the data in
a DW
8
Data Warehouse
Enterprise
“Database”
Customers
Orders
Transactions
Etc…
Vendors
Etc…
Copied,
organized
summarized
Data
Warehouse
Data Mining
9
Data Warehouse
•
•
•
•
Data, data, data…everywhere!
Information…that’s another story!
Especially, the right information @ the right time!
Data warehousing’s goal is to make the right
information available @ the right time
• Data warehousing is a data store (eg., a
database of some sort) and a process for
bringing together disparate data from throughout
an organization for decision-support purposes
10
Data warehousing
• Data warehouses are natural allies for
data mining (work together well)
• Data mining can help fulfill some of the
goal of data warehouses – right
information @ the right time
• Relational database management systems
(RDBMS), such as Oracle, DB2, Sybase,
Informix, Focus, SQL Server, etc. are often
used for data warehousing
11
Data of different kind
12
Transaction (Operational) Data
• Operational (production) systems create (massive
number of) transactions, such as sales, purchases,
deposits, withdrawals, returns, refunds, phone calls, toll
roads, web site “hits”, etc…
• Transactions are the base level of data – the raw
material for understanding customer behavior
• Unfortunately, operational systems change due to
changing business needs
• Fortunately, operational systems can usually be changed
to support changing business needs
• Data warehousing strategies need to be aware of
operational system changes
13
Operational Summary Data
Summaries are for a
specific time period
and utilize the
transaction data for
that time period
Other Examples???
14
Database Schema
• Database schema defines the structure of data,
not the values of the data (e.g., first name, last
name = structure; Ron Norman = values of the
data)
• In RDBMS:
– Columns = fields = attributes (A,B,C)
– Rows = records = tuples (1-7)
15
Metadata
• General definition: Data about data !!!
– Examples:
• A library’s card catalog (metadata) describes publications (data)
• A file system maintains permissions (metadata) about files (data)
• A form of system documentation including:
–
–
–
–
–
Values legally allowed in a field (e.g., AZ, CA, OR, UT, WA, etc.)
Description of the contents of each field (e.g., start date)
Date when data were loaded
Indication of currency of the data (last updated)
Mappings between systems (e.g., A.this = B.that)
• Invaluable, otherwise have to research to find it
16
Business Rules
• Highest level of abstraction from operational
(transaction) data
• Describes why relationships exist and how they are
applied
• Examples:
– Need to have 3 forms of ID for credit
– Only allow a maximum daily withdrawal of $200
– After the 3rd log-in attempt, lock the log-in screen
– Accept no bills larger than $20
– Others???
17
General Architecture for Data Warehousing
• End users (business)
• Metadata repository
• Central repository
• Extraction, (Clean),
Transformation, &
Load (ETL)
• Source systems
18
OLAP – Online Analytical Processing
• A definition:
• Data representation is in the form of a CUBE
• OLAP goes beyond SQL with its analysis
capabilities
• Key feature of OLAP: Relevant multi-dimensional
views such as products, time, geography
19
OLAP Overview
gender
• Interactive, exploratory analysis of
multidimensional data to discover patterns
ts
n
e
age
id
c
c
a
20
Data Mining versus OLAP
• OLAP - Online Analytical
Processing
– Provides you with a very
good view of what is
happening, but can not
predict what will happen
in the future or why it is
happening
21
Results of Data Mining Include:
• Forecasting what may happen in the future
• Classifying people or things into groups by
recognizing patterns
• Clustering people or things into groups
based on their attributes
• Associating what events are likely to occur
together
• Sequencing what events are likely to lead
to later events
22
Data Mining Flavors
• Directed – Attempts to explain or
categorize some particular target field
such as income or response.
• Undirected – Attempts to find patterns or
similarities among groups of records
without the use of a particular target field
or collection of predefined classes.
23
Data Mining Tasks
• Classification – example: Jr, Sr
• Estimation – example: household income
• Prediction – example: predict credit card
balance transfer average amount
• Affinity Grouping – Example: people who buy
X, often buy Y also
• Clustering – similar to classification but no
predefined classes
• Description and Profiling – behavior begets an
explanation such as “More guys prefer In-n-Out
Burger than do gals.”
24
Automatic Cluster Detection
Automatic Cluster Detection
• DM techniques used to find patterns in
data
– Not always easy to identify
• No observable pattern
• Too many patterns
• Automatic Cluster Detection is useful to
find “better behaved” clusters of data
within a larger dataset; seeing the forest
without getting lost in the trees
26
Automatic Cluster Detection
• K-Means clustering algorithm depends on a geometric
interpretation of the data
• Other automatic cluster detection (ACD) algorithms
include:
–
–
–
–
Gaussian mixture models
Agglomerative clustering
Divisive clustering
Self-organizing maps (SOM) – Ch. 7 – Neural Nets
• ACD is a tool used primarily for undirected data mining
– No preclassified training data set
– No distinction between independent and dependent variables
• ACD rarely used in isolation – other methods follow up
27
Clustering Examples
• “Star Power” ~ 1910
Hertzsprung-Russell
• Group of Teens
• 1990’s US Army – women’s uniforms:
•100 measurements for each of 3,000 women
•Using K-means algorithm reduced to a handful
28
K-means Clustering
• “K” – circa 1967 – this algorithm looks for a fixed
number of clusters which are defined in terms of
proximity of data points to each other
• How K-means works (see next slide figures):
– Algorithm selects K data points randomly
– Assigns each of the remaining data points to one of K
clusters (via perpendicular bisector)
– Calculate the centroids of each cluster (uses
averages in each cluster to do this)
29
K-means Clustering
30
K-means Clustering
• Resulting clusters
describe underlying
structure in the data,
however, there is no
one right description
of that structure (Ex:
Figure 11.6 – playing
cards K=2, K=4)
31
Similarity & Difference
• Automatic Cluster Detection is quite
simple for a software program to
accomplish – data points, clusters mapped
in space
• However, business data points are not
about points in space but about
purchases, phone calls, airplane trips, car
registrations, etc. which have no obvious
connection to the dots in a cluster diagram
32
Similarity & Difference
• Clustering business data requires some notion of natural
association – records (data) in a given cluster are more
similar to each other than to those in another cluster
• For DM software, this concept of association must be
translated into some sort of numeric measure of the
degree of similarity
• Most common translation is to translate data values (eg.,
gender, age, product, etc.) into numeric values so can be
treated as points in space
• If two points are close in geometric sense then they
represent similar data in the database
33
Similarity & Difference
• Business variable (fields) types:
–
–
–
–
Categorical (eg., mint, cherry, chocolate)
Ranks (eg., freshman, soph, etc.)
Intervals (eg., 56 degrees, 72 degrees, etc)
True measures – interval variables that measure from a
meaningful zero point
• Fahrenheit, Celsius not good examples
• Age, weight, height, length, tenure are good
• Geometric standpoint the above variable types go from
least effective to most effective (top to bottom)
• Finally, there are dozens/hundreds of published
techniques for measuring the similarity of two data
records
34
Evaluating Clusters
• What does it mean to say that a cluster is
“good”?
– Clusters should have members that have a
high degree of similarity
– Standard way to measure within-cluster
similarity is variance* – clusters with lowest
variance is considered best
– Cluster size is also important so alternate
approach is to use average variance**
* The sum of the squared differences of each element from the mean
** The total variance divided by the size of the cluster
35
Evaluating Clusters
• Finally, if detection identifies good clusters
along with weak ones it could be useful to
set the good ones aside (for further study)
and run the analysis again to see if
improved clusters are revealed from only
the weaker ones
36
But…
• Finding patterns is not enough
• Business (individuals) must:
– Respond to the pattern(s) by taking action
– Turning:
• Data into Information
• Information into Action
• Action into Value
37
Data Mining’s Business Cycle
1. Identify the business opportunity*
2. Mining data to transform it into
actionable information
3. Acting on the information
4. Measuring the results
38
1. Identify the Business Opportunity
• Many business processes are good candidates:
– New product introduction
– Direct marketing campaign
– Evaluating the results of a test market
• Measurements from past DM efforts:
– What types of customers responded to our last
campaign?
– Where do the best customers live?
– What products should be promoted with our XYZ
product?
39
2. Mining data to transform it into actionable information
• Success is making business sense of the data
• Numerous data “issues”:
– Bad data formats (alpha vs numeric, missing, null,
bogus data)
– Confusing data fields (synonyms and differences)
– Lack of functionality (“I wish I could…”)
– Legal ramifications (privacy, etc.)
– Organizational factors (unwilling to change “our ways”)
– Lack of timeliness
40
3. Acting on the Information
• This is the purpose of Data Mining – with the
hope of adding value
• What type of action?
– Interactions with customers, prospects, suppliers
– Modifying service procedures
– Adjusting inventory levels
– Consolidating
– Expanding
– Etc…
41
4. Measuring the Results
• Assesses the impact of the action taken
• Often overlooked, ignored, skipped
• Planning for the measurement should begin when
analyzing the business opportunity, not after it is “all over”
• Assessment questions (examples):
– Did this ____ campaign do what we hoped?
– Did some offers work better than others?
– Did these customers purchase additional products?
– Tons of others…
42
What Does All of This Mean?
• On a regular basis, data miners utilize their data
warehouses to give guidance for and/or answer a
limitless variety of questions.
• Nothing is free, however, and the benefits do come with
a cost.
• The value of a data warehouse and subsequent data
mining is a result of the new and changed business
processes it enables – competitive advantage also.
• There are limitations, though - A Data Warehouse cannot
correct problems with its data, although it may help to
more clearly identify them.
43