#### Transcript Naive Bayes Classifiers

**Naive Bayes Classifiers, an Overview**

### By Roozmehr Safi

### What is Naive Bayes Classifier (NBC)?

• • • NBC is a probabilistic classification method.

Classification (A.K.A. discrimination, or supervised learning) is assigning new cases to one of a the pre-defined classes given a sample of cases for which the true classes are known.

**NBC **is one of the oldest and simplest classification methods.

**Some NBC Applications**

• • • • • • Credit scoring Marketing applications Employee selection Image processing Speech recognition Search engines…

### How does NBC Work?

• • NBC applies Bayes’ theorem with (naive) independence assumptions.

A more descriptive term for it would be "independent feature model".

•

### How does NBC work, Cntd.

Let *X1,…, Xm denote our features (Height, weight, foot size…), Y is the class number * *(1 for men,2 for women), and C is the number *of classes (2). The problem consists of classifying the case (x1,…, xm) to the class *c *maximizing P(Y=c| X1=x1,…, Xm=xm) over c=1,…, C. Applying Bayes’ rule gives: • P(Y=c| X1=x1,…, Xm=xm) = P(X1=x1,…, Xm=xm | Y=c) P(Y=c) / P(X1=x1,…, Xm=xm) .

Under the NB’s assumption of conditional independence, P(X1=x1,…, Xm=xm | Y=c) is replaced by .

• And NB reduces the original problem to:

### An example:

• • • P(Obserevd Height|Male) = a P(Observed Weight|Male) = b P(Observed Foot size|Male) = c P(Male|observed case)≈ P(male) × a × b × C • • • P(Observed Height|Female) = d P(Observed Weight|Female) = e P(Observed Foot size|Female) = f P(Female|observed case)≈ P(Female) × d × e × f * Pick the one that is larger

### NBC advantage

• • • Despite unrealistic assumption of independence, NBC is remarkably successful even when independence is violated.

Due to its simple structure the NBC it is appealing when the set of variables is large.

NBC requires a small amount of training data: – It only needs to estimate means and variances of the variables – No need to form the covariance matrix. – Computationally inexpensive.

### A Demonstration

• • • • • • • • • **Data: **From an online B2B exchange (1220 cases).

**Purpose: **To distinguish cheaters of good sellers.

**Predictors:**

Member type: Enterprise, personal, other Years since joined: 1 to 10 years.

No. of months since last membership renewal Membership Renewal duration. Type of service bought: standard, limited edition… If the member has a registered company. If the company page is decorated.

Number of days in which member logged in during past 60 days.

Industry: production, distribution, investment… **Target: **to predict if a seller is likely to cheat buyers based on data from old sellers.

### Issues involved: Prob. distribution

• • • • With discrete (categorical) features, estimating the probabilities can be done using frequency counts.

With continuous features one can assume a certain form of quantitative probability distribution.

There is evidence that discretization of data before applying NB is effective.

Equal Frequency Discretization (EFD) divides the sorted values of a continuous variable into *k *

*equally populated bins.*

### Issues involved: Zero probabilities

• • The case when a class and a feature value never occur together in the training set creates a problem, because assigning a probability of zero to one of the terms causes the whole expression to evaluate to zero*.* The zero probability can be replaced by a small constant, such as 0.5/n where *n is the * *number of observations in the *training set.

### Issues involved: Missing values

• • In some applications, values are missing not at random and can be meaningful. Therefore, missing values are treated as a separate category.

If one does not want to treat missing values as a separate category, they should be handled prior to applying this macro with either a missing value imputation or excluding cases where they are present.