Document 7685536

Download Report

Transcript Document 7685536

2020/4/30

KDD’99 Knowledge Discovery Contest

Advisor: Dr. Hsu Graduate: Yu-Wei Su Intelligent Database System Lab, IDSL

Outline

 Motivation  Objective    Contest target KDD’99 Competition: Knowledge Discovery Contest Knowledge Discovery in a Charitable Organization’s Donor Database  Profiling Your Customers Using Bayesian Networks  Opinion 2020/4/30 Intelligent Database System Lab, IDSL

Motivation

 Direct mail to all customers are inefficiency and high costs  Utilizing unsupervised clustering method instead of supervised classification method in the 1998 competition 2020/4/30 Intelligent Database System Lab, IDSL

Objective

 Discovering higher-level knowledge from data  Maximizing the profit for predictive model 2020/4/30 Intelligent Database System Lab, IDSL

Contest target

 The database is the same used in the 1998 competition  Data is implemented by an American Charity in the June ’97 renewal campaign  Techniques included   Unsupervised clustering Knowledge-driven segmentation   Association rule discovery Causal modeling 2020/4/30 Intelligent Database System Lab, IDSL

Contest target( cont’)

 Each team had to build a profit-maximizing predictive model for the ’98 competition task 2020/4/30 Intelligent Database System Lab, IDSL

KDD’99 Competition: Knowledge Discovery Contest

 Introduction  Exploratory data analysis  A two-stage prediction model  Understanding the model  Conclusion 2020/4/30 Intelligent Database System Lab, IDSL

Introduction

  The paper discusses SAS Institute’s findings Expand on the ’98 KDD cup competition     To reveal unusual data anomalies A two-stage prediction model yields superior results to those in ’98 Use decision tree to better understanding Apply a confidence interval to judge model performance 2020/4/30 Intelligent Database System Lab, IDSL

Introduction( cont’)

 The models were built using a 95412 case training data set with known response  To judge model efficacy, expected gift amount were calculated for a validation data set with concealed response 2020/4/30 Intelligent Database System Lab, IDSL

Exploratory data analysis

 Successful statistical prediction models’ elements    Problem-specific knowledge Historical data Analytical savvy 2020/4/30 Intelligent Database System Lab, IDSL

Exploratory data analysis

 Four anomalies are immediately apparent 2020/4/30 Intelligent Database System Lab, IDSL

Exploratory data analysis( cont’)

2020/4/30 Intelligent Database System Lab, IDSL

Exploratory data analysis( cont’)

2020/4/30 Intelligent Database System Lab, IDSL

A two-stage prediction model

 To accurately estimate the probability distribution of gift amount to a potential donor  Prediction is done in two way   Directly estimating expected gift Separately estimating expected donation probability and the expected gift amount and multiplying them 2020/4/30 Intelligent Database System Lab, IDSL

A two stage prediction model( cont’)

 Task was done by using two multi-layer perceptron(MLP) neural networks  Gift amount model  First, using the cases where gift occurred(5% of the data)  Input reflecting historical patterns in the gift amount( fitting class-probability decision tree) 2020/4/30 Intelligent Database System Lab, IDSL

A two stage prediction model( cont’)

 Gift probability model   Using all the cases Input reflecting recency, frequency, amount(RFA) and demographic data and the patterns noted in the exploratory data analysis 2020/4/30 Intelligent Database System Lab, IDSL

A two stage prediction model( cont’)

 The first stage MLP  Input layer with five inputs fully connected to 20 hidden units and them fully connected to a target unit  4843 cases with TAGET_B=1 2020/4/30 Intelligent Database System Lab, IDSL

A two stage prediction model( cont’)

2020/4/30 Intelligent Database System Lab, IDSL

A two stage prediction model( cont’)

 The second stage MLP  Input layer with eight inputs fully connected to 20 hidden units and them fully connected to a target unit  Expected gift amount for each case was calculate as the product from the first- and second-stage models 2020/4/30 Intelligent Database System Lab, IDSL

A two stage prediction model( cont’)

 Net revenue of the model using validation data is $14877.77  This is $165.53 more than the Gold-medal winner at KDD-98 2020/4/30 Intelligent Database System Lab, IDSL

Understanding the model

2020/4/30 Intelligent Database System Lab, IDSL

Brief summary

 Smaller mailing size will lead to smaller variability in expected total gift because the variance sum will have fewer terms  Further development incorporates both profit maximization and risk minimization as determinants of optimum mailing depth 2020/4/30 Intelligent Database System Lab, IDSL

Knowledge Discovery in a Charitable Organization’s Donor Database

 Introduction  Main results  Detailed results  comments 2020/4/30 Intelligent Database System Lab, IDSL

Introduction

 Data set contains about 95000 customers, with an average net donation of slightly over 11 cents per customers and a total net donation of around $10500 from the “mail to all” policy  The task utilizing standard 2-class knowledge discovery and with Value Weighted Analysis(VWA) 2020/4/30 Intelligent Database System Lab, IDSL

Main results

  Maximal net profit of $15515 when checked against the evaluation data set, compared to KDD Cup 98’ best result of $14712 net profit Built a “white-box” model comprised of 11 customer segments and bring a combined net donation of $13397 for the evaluation data set 2020/4/30 Intelligent Database System Lab, IDSL

Main results( cont’)

 Donation segments with highly profitable and actionable  Approximately 14000 people who live in an area where over 5% of renters pay over $400 per month   Have donated over $100 in the past Have an average donation of over $12  Account for $8200 net donation in the training set 2020/4/30 Intelligent Database System Lab, IDSL

Main results( cont’)

2020/4/30 Intelligent Database System Lab, IDSL

Main results( cont’)

 Identifying donors is a different task than maximizing donation  The variability of profit results  Profit difference less than $500 cannot be considered significant  Difference of $2000 in profit is not significant on different data sets 2020/4/30 Intelligent Database System Lab, IDSL

Main results( cont’)

 Main discovery & modeling approach was a 1-stage 2-class model based on VWA 2020/4/30 Intelligent Database System Lab, IDSL

Detailed results

 The most significant variables for predicting a customer’s donation behavior are the previous donation behavior summaries  The NK phenomenon  US-census data turns out to be quite strongly connected to the donation performance of the population 2020/4/30 Intelligent Database System Lab, IDSL

Detailed results( cont’)

  5 models was chosen at last Two “white box” models and one relatively simple model, based on 40 variables and two candidates for “best overall” model 2020/4/30 Intelligent Database System Lab, IDSL

Detailed results( cont’)

 Building the white-box model  Total net donation of 11 segments if $13397 for 55086 customers 2020/4/30 Intelligent Database System Lab, IDSL

Detailed results( cont’)

2020/4/30 Intelligent Database System Lab, IDSL

Detailed results( cont’)

 Best single model  Selected 31 original variables, plus 9 additional demographic summary variables

Optimal point Predict point

2020/4/30 Intelligent Database System Lab, IDSL

Detailed results( cont’)

2020/4/30 Intelligent Database System Lab, IDSL

Detailed results( cont’)

 Improving prediction by averaging 2020/4/30 Intelligent Database System Lab, IDSL

Brief summary

 A good leave-out test set performance can hardly be considered a reliable indication of good future performance  The net profit has a very large variance 2020/4/30 Intelligent Database System Lab, IDSL

Profiling Your Customers Using Bayesian Networks

 Introduction  Data manipulation and preprocessing  Result  Conclusion 2020/4/30 Intelligent Database System Lab, IDSL

Introduction

 Build two causal models to understand the characteristics of respondents to direct mail fund raising campaigns  The first model( response-net) captures the dependency of the probability of response to the mailing campaign on the independent variables  96376 lapsed donors data set 2020/4/30 Intelligent Database System Lab, IDSL

Introduction( cont’)

 The second network( donation-net) models the dependency of the dollar amount of the gift  5%respondents to the ’97 mailing campaign 2020/4/30 Intelligent Database System Lab, IDSL

Data manipulation and preprocessing

 Remove redundant variables and more than 99% of missing values and variables with only one state  All continuous variables were discretized into four bins of equal length  30 variables at last were available 2020/4/30 Intelligent Database System Lab, IDSL

Data manipulation and preprocessing( cont’)

 These variables can be divided into three group   Variables about personal information about the donors Variable about information of donors neighborhood as socio-economic and urbanicity indicators  Variables about history and promotion history file of the donors 2020/4/30 Intelligent Database System Lab, IDSL

Result-

profiling respondents

2020/4/30 Intelligent Database System Lab, IDSL

Result-

profiling donors

2020/4/30 Intelligent Database System Lab, IDSL

Result

 Profit prediction  Both two model can be used to predict the expected profit 2020/4/30 Intelligent Database System Lab, IDSL

Brief summary

 Shown an application of Bayesian methods to a Knowledge Discovery task  To maintain a high response rate to direct fund raising is to continuously update the database of donors 2020/4/30 Intelligent Database System Lab, IDSL

Opinion

 The aspect of this papers are too higher concepts  Simple methodology can make great achievement 2020/4/30 Intelligent Database System Lab, IDSL