Classification Techniques - The Institute of Finance

Download Report

Transcript Classification Techniques - The Institute of Finance

Classification Techniques
http://www.ifm.ac.tz/staff/bajuna
What is Classification
• Classification is the task of assigning objects to
their respective categories.
– Examples include classifying email messages as
spam or non-spam based upon the message
header and content, and classifying galaxies based
upon their respective shapes.
What is Classification
• Classification can provide a valuable support for
informed decision making in the organisation.
• For example, suppose a mobile phone company
would like to promote a new cell-phone product
to the public. Instead of mass mailing the
promotional catalog to everyone, the company
may be able to reduce the campaign cost by
targeting only a small segment of the population
What is Classification
• It may classify each person as a potential
buyer or non-buyer based on their personal
information such as income, occupation,
lifestyle, and credit ratings.
Discrete Data
• Discrete Data – A set of data is said to be
discrete if the values / observations belonging
to it are distinct and separate, i.e. they can be
counted (1,2,3,....). Examples might include
the number of kittens in a litter; the number
of patients in a doctors surgery; the number of
flaws in one metre of cloth; gender (male,
female); blood group (O, A, B, AB).
Discrete Data
• Any data measurements that are not
quantified on an infinitely divisible numeric
scale. Includes items like counts, proportions,
ratios, or percentage of a characteristics, (i.e.
sex, loan forms, department attendance, etc.)
that have measurements like pass or fail, leak
or no leak, small, medium, or large, go or no
go tests. (SixSigma.com Dictonary)
Continuous Data
• Continuous/Variable Data – A set of data is
said to be continuous if the values /
observations belonging to it may take on any
value within a finite or infinite interval. You
can count, order and measure continuous
data. For example height, weight,
temperature, the amount of sugar in an
orange, the time required to run a mile.
Continuous Data
• Variable data type have real numbers in the
measurement like 2.34, 2.55, etc. (i.e. data
that can be measured on a continuous scale)
Categorical Data
• Categorical Data – A set of data is said to be
categorical if the values or observations
belonging to it can be sorted according to
category. Each value is chosen from a set of
non-overlapping categories. For example,
shoes in a cupboard can be sorted according
to colour: the characteristic 'colour' can have
non-overlapping categories 'black', 'brown',
'red' and 'other'. People have the
characteristic of 'gender' with categories
'male' and 'female'.
Nominal Data
• Nominal Data – A set of data is said to be
nominal if the values / observations belonging
to it can be assigned a code in the form of a
number where the numbers are simply labels.
You can count but not order or measure
nominal data. For example, in a data set males
could be coded as 0, females as 1; marital
status of an individual could be coded as Y if
married, N if single.
Ordinal Data
• Ordinal Data - A set of data is said to be
ordinal if the values / observations belonging
to it can be ranked (put in order) or have a
rating scale attached. You can count and order,
but not measure, ordinal data.
Ordinal Data
• The categories for an ordinal set of data have
a natural order, for example, suppose a group
of people were asked to taste varieties of
biscuit and classify each biscuit on a rating
scale of 1 to 5, representing strongly dislike,
dislike, neutral, like, strongly like. A rating of 5
indicates more enjoyment than a rating of 4,
for example, so such data are ordinal.
Preliminaries
• The input data for classification task is given in
the form of collection of records.
• Each record also known as instance or
example is characterised by a tuple (x,y),
where x is the attribute set and y is the class
label
Preliminaries
Table 1. Vertebrate Data Set
Preliminaries
• In the above slide, the table shows a sample
data set used for classifying vertebrates into
one of the following categories: mammal,
bird, fish, reptile, or amphibian.
• The attribute set includes properties of a
vertebrate such as its body temperature, skin
cover, method of reproduction, ability to fly
and ability to live in water.
Preliminaries
• The attribute set may contain discrete and
continuous features, however on the table above
attribute set contains mostly discrete values.
• The class label on the other hand, must be a
discrete attribute.
• This is a key characteristics that distinguishes
classification from another predictive modeling
task known as regression, where y is a continuous
attribute.
What is Classification
• Classification can be described as a task of
assigning objects to one of several predefined
categories.
Input
Attribute Set
(x)
Output
Classification
Model
Class label
(y)
The diagram show the classification as task of mapping an input attribute
set x into its class label y
Simple Definition
• Classification is the task of learning a target
function f that maps each attribute set x into
one of the pre-defined class labels y.
• The target function is also known informally as
a classification model.
Usefulness of Classification Model
• A classification model is useful for the
following purposes:
– It may serve as an explanatory tool to distinguish
between objects of different classes (Descriptive
Modeling).
– It may also be used to predict the class label of
unknown records (Predictive Modeling). Consider
the table below:
Usefulness of Classification Model
• A classification model can be treated as a
black box that automatically assigns a class
label when presented with the attribute set of
an unknown record.
• Example you can be given the characteristics
of creature known as gila monster.
Usefulness of Classification Model
• By building a classification model from the
data set shown in Table 1, you may use the
the model to determine the class to which the
creature belongs.
• Classification models are most suited for
predicting or describing data sets with binary
or nominal target attributes.
Classification & Prediction
• Classification:
– Predicts categorical class labels
– Classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
• Prediction:
– Models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical Applications
– Credit approval
– Target marketing
– Medical diagnosis
– Treatment effectiveness analysis
Classification
Techniques
Classification Technique
• A classification technique is a systematic
approach for building classification models
from an input data set.
• Examples of classification techniques include:
– Decision Tree Classifiers
– Rule-Based Classifiers
– Neural Networks
– Support Vector Machines
– Naıve Bayes Classifiers
– Nearest-Neighbor Classifiers
Classification Technique
• Each technique employs a learning algorithm
to identify a model that best fits the
relationship between the attribute set and
class label of the input data (produces outputs
consistent with the class labels of the input
data).
Classification Technique
• A good classification model must predict
correctly the class labels of records it has
never seen before.
• Building models with good generalization
capability, i.e., models that accurately predict
the class labels of previously unseen records,
is therefore a key objective of the learning
algorithm.
General Approach to Solve a
Classification Problem
• A general strategy to solving a classification
problem is that:
– First, the input data is divided into two disjoint sets,
known as the training set and test set, respectively.
• The training set will be used for building a
classification model.
• The induced model is later applied to the test set
to predict the class label of each test record.
Why are we dividing the data into two
set?
• This strategy of dividing the data into
independent training and test sets allows us to
obtain an unbiased estimate of the
performance of a model on previously unseen
records.
• A figure below in the next slide depicts
General Approach to Solve a
Classification Problem
Performance Measurement of Model
• Evaluation of the performance of a
classification model is based upon the number
of test records predicted correctly and
wrongly by the model.
• The counts are tabulated in a table known as a
confusion matrix.
Performance Measurement of Model
• Table 2 depicts the confusion matrix for a
binary classification problem.
Performance Measurement of Model
• Each entry fij in this table denotes the number
of records from class i predicted to be of class
j.
• For instance, f01 is the number of records
from class 0 wrongly predicted as class 1
• Based on the entries in the confusion matrix,
the total number of correct predictions made
by the model is (f11 + f00) and the total
number of wrong predictions is (f10 + f01).
Performance Measurement of Model
• Although a confusion matrix provides the
information needed to determine how good is
a classification model, it is useful to
summarize this information into a single
number.
• This would make it more convenient to
compare the performance of different
classification models.
Performance Measurement of Model
• There are several performance metrics
available for doing this. One of the most
popular metrics is model accuracy, which is
defined as:
• Accuracy = Number of correct predictions
Total number of predictions
= f11 + f00
f11 + f10 + f01 + f00
Performance Measurement of Model
• Equivalently, the performance of a model can
be expressed in terms of its error rate given by
the following equation:
• Error rate = Number of wrong predictions
Total number of predictions
= f10 + f01
f11 + f10 + f01 + f00
Decision Trees