Transcript Slide 1

Introduction to Business
Intelligence (CIT625)
Data Mining Techniques and
Applications
Data Mining Techniques
• In data mining techniques we focus on
understanding ways, methods used in
analysing (sub) set of data.
• In this case you have to understand four
classes of task involved in data mining
– Classification - Arranges the data into
predefined groups. For example an email
program might attempt to classify an email as
legitimate or spam. Common algorithms
include Nearest neighbor, Naive Bayes
classifier and Neural network.
Data Mining Techniques
– Clustering - Is like classification but the
groups are not predefined, so the algorithm
will try to group similar items together.
– Regression - Attempts to find a function which
models the data with the least error. A
common method is to use Genetic
Programming.
Data Mining Techniques
• Association rule learning (Mining) - Searches
for relationships between variables. For
example a supermarket might gather data of
what each customer buys. Using association
rule learning, the supermarket can work out
what products are frequently bought together,
which is useful for marketing purposes. This is
sometimes referred to as "market basket
analysis".
Applications of Techniques
• Data mining is used for a variety of
purposes in both the private and public
sectors.
• These techniques can be applied in
companies with a strong consumer focus –
retail, financial communication, and
marketing organisations as will be shown
below:
Data Mining Techniques
• The ultimate goal of data mining is prediction
• Predictive data mining is the most common
type of data mining and one that has the
most direct business applications.
• The process of data mining consists of three
stages:
– The initial exploration
– Model building or pattern identification with
validation/verification
– Deployment (i.e., the application of the model to
new data in order to generate predictions).
More on Classification
• Classification can provide a valuable support
for informed decision making in the
organisation.
• It may classify each person as a potential
buyer or non-buyer based on their
personal information such as income,
occupation, lifestyle, and credit ratings.
Classification
Table 1. Vertebrate Data Set
Classification
• In the above slide, the table shows a
sample data set used for classifying
vertebrates into one of the following
categories: mammal, bird, fish, reptile, or
amphibian.
• The attribute set includes properties of a
vertebrate such as its body temperature,
skin cover, method of reproduction, ability
to fly and ability to live in water.
What is Classification
• Classification can be described as a task
of assigning objects to one of several
predefined categories.
Input
Attribute Set
(x)
Output
Classification
Model
Class label
(y)
The diagram show the classification as task of mapping an input
attribute set x into its class label y
Simple Definition
• Classification is the task of learning a
target function f that maps each attribute
set x into one of the pre-defined class
labels y.
• The target function is also known
informally as a classification model.
Usefulness of Classification
Model
• A classification model is useful for the
following purposes:
– It may serve as an explanatory tool to
distinguish between objects of different
classes (Descriptive Modeling).
– It may also be used to predict the class label
of unknown records (Predictive Modeling).
Consider the table below:
Usefulness of Classification
Model
• A classification model can be treated as a
black box that automatically assigns a
class label when presented with the
attribute set of an unknown record.
• Example you can be given the
characteristics of creature known as gila
monster.
Usefulness of Classification
Model
• By building a classification model from the
data set shown in Table 1, you may use
the model to determine the class to which
the creature belongs.
• Classification models are most suited for
predicting or describing data sets with
binary or nominal target attributes.
Classification Technique
• A classification technique is a systematic
approach for building classification models
from an input data set.
• Examples of classification techniques
include:
– Decision Tree Classifiers
– Rule-Based Classifiers
– Neural Networks
– Support Vector Machines
– Naıve Bayes Classifiers
– Nearest-Neighbor Classifiers
Classification Technique
• Each technique employs a learning
algorithm to identify a model that best fits
the relationship between the attribute set
and class label of the input data (produces
outputs consistent with the class labels of
the input data).
Classification Technique
• A good classification model must predict
correctly the class labels of records it has
never seen before.
• Building models with good generalization
capability, i.e., models that accurately
predict the class labels of previously
unseen records, is therefore a key
objective of the learning algorithm.
General Approach to Solve a
Classification Problem
• A general strategy to solving a classification
problem is that:
– First, the input data is divided into two disjoint
sets, known as the training set and test set,
respectively.
• The training set will be used for building a
classification model.
• The induced model is later applied to the test
set to predict the class label of each test
record.
Classification Applications
• Often used as a means for detecting fraud,
assessing risk in finance and banking.
• Homeland security department, use
classification to identify terrorist activities,
such as money transfers and
communications, and to identify and track
individual terrorists themselves, such as
through travel and immigration records.
Association
• Association analysis can be used in
promoting/improving marketing strategy by
analysing frequent itemset.
• As a marketing manager of a Company X
for instance you would like to determine
which items are frequently purchased
together within the same transactions.
Application of Association
• An example of such a rule, mined from the
X Company transactional database, is
buys(X; “computer”)=>buys(X; “software”)
[support = 1%; confidence = 50%] where X
is a variable representing a customer.
• A confidence, or certainty, of 50%
means that if a customer buys a computer,
there is a 50% chance that she will buy
software as well.
Application of Association
• A 1% support means that 1% of all of the
transactions under analysis showed that
computer and software were purchased
together.
• This association rule involves a single
attribute or predicate (i.e., buys) that
repeats. Association rules that contain a
single predicate are referred to as singledimensional association rules.
Application of Association
• In addition to the marketing application,
the same sort of question has the following
uses:
• Baskets = documents; items = words.
Words appearing frequently together in
documents may represent phrases or
linked concepts. Can be used for
intelligence gathering.
Application of Association
• Baskets = sentences, items =
documents. Two documents with
many of the same sentences could
represent plagiarism or mirror sites on
the Web.
Clustering
• Data clustering is a method in which we
make cluster of objects that are somehow
similar in characteristics.
• The criterion for checking the similarity is
implementation dependent.
Clustering
• Clustering is often confused with
classification, but there is some difference
between the two.
• In classification the objects are assigned to
pre defined classes, whereas in clustering
the classes are also to be defined.
Clustering
• Precisely, Data Clustering is a technique in
which, the information that is logically similar is
physically stored together.
• In order to increase the efficiency in the
database systems the number of disk accesses
are to be minimized.
• In clustering the objects of similar properties are
placed in one class of objects and a single
access to the disk makes the entire class
available.
Clustering
• By definition a cluster is an ordered list of
objects, which have some common
characteristics. The objects belong to an
interval [a , b], in our case [0 , 1] [1]
What can be Clustered?
• Images (astronomical data)
• Patterns (e.g. Robot vision data)
• Shopping Items
• Feet (i.e. anatomical data)
• Words
• Documents, etc.
Application of Clustering
Similarity searching in Medical Image
Database
• This is a major application of the clustering
technique. In order to detect many
diseases like Tumor etc, the scanned
pictures or the x-rays are compared with
the existing ones and the dissimilarities
are recognized.
Application of Clustering
• This technique supports the development
of population segmentation models, such
as demographic-based customer
segmentation.
• For example, the buying habits of multiple
population segments might be compared
to determine which segments to target for
a new sales campaign.
Application of Clustering
• For example, a company that sales a variety of
products may need to know about the sale of all
of their products in order to check that what
product is giving extensive sale and which is
lacking.
• This is done by data mining techniques. But if
the system clusters the products that are giving
less sale then only the cluster of such products
would have to be checked rather than
comparing the sales value of all the products.
This is actually to facilitate the mining process.
Applications Miscellaneous
• With data mining, a retailer could use
point-of-sale (PoS) records of customer
purchases to send targeted promotions
based on an individual's purchase history.
By mining demographic data from
comment or warranty cards, the retailer
could develop products and promotions to
appeal to specific customer segments.
Applications Miscellaneous
• The National Basketball Association (NBA)
is exploring a data mining application that
can be used in conjunction with image
recordings of basketball games.
– The “Advanced Scout” software analyzes the
movements of players to help coaches
orchestrate plays and strategies.
Applications Miscellaneous
• For example, an analysis of the play-by-play
sheet of the game played between the New York
Knicks and the Cleveland Cavaliers on January
6, 1995 reveals that when Mark Price played the
Guard position, John Williams attempted four
jump shots and made each one! “Advanced
Scout” not only finds this pattern, but explains
that it is interesting because it differs
considerably from the average shooting
percentage of 49.30% for the Cavaliers during
that game.
How does Data Mining Work?
• Data mining software analyzes
relationships and patterns in stored
transaction data based on open-ended
user queries.
• Four types of relationships are sought
using several types of available analytical
software:-
How does Data Mining Work?
– Classes: Stored data is used to locate data in
predetermined groups. For example, a
restaurant chain could mine customer
purchase data to determine when customers
visit and what they typically order. This
information could be used to increase traffic
by having daily specials.
– Clusters: Data items are grouped according
to logical relationships or consumer
preferences. For example, data can be mined
to identify market segments or consumer
affinities.
How does Data Mining Work?
– Associations: Data can be mined to identify
associations. The bread-cheese example is
an example of associative mining.
– Sequential patterns: Data is mined to
anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could
predict the likelihood of a backpack being
purchased based on a consumer's purchase
of sleeping bags and hiking shoes.
How does Data Mining Work?
• Data mining consists of five major
elements:
– Extract, transform, and load transaction data
onto the data warehouse system
– Store and manage the data in a
multidimensional database system.
– Provide data access to business analysts and
information technology professionals.
– Analyze the data by application software.
– Present the data in a useful format, such as a
graph or table.