Transcript Document 7685536
2020/4/30
KDD’99 Knowledge Discovery Contest
Advisor: Dr. Hsu Graduate: Yu-Wei Su Intelligent Database System Lab, IDSL
Outline
Motivation Objective Contest target KDD’99 Competition: Knowledge Discovery Contest Knowledge Discovery in a Charitable Organization’s Donor Database Profiling Your Customers Using Bayesian Networks Opinion 2020/4/30 Intelligent Database System Lab, IDSL
Motivation
Direct mail to all customers are inefficiency and high costs Utilizing unsupervised clustering method instead of supervised classification method in the 1998 competition 2020/4/30 Intelligent Database System Lab, IDSL
Objective
Discovering higher-level knowledge from data Maximizing the profit for predictive model 2020/4/30 Intelligent Database System Lab, IDSL
Contest target
The database is the same used in the 1998 competition Data is implemented by an American Charity in the June ’97 renewal campaign Techniques included Unsupervised clustering Knowledge-driven segmentation Association rule discovery Causal modeling 2020/4/30 Intelligent Database System Lab, IDSL
Contest target( cont’)
Each team had to build a profit-maximizing predictive model for the ’98 competition task 2020/4/30 Intelligent Database System Lab, IDSL
KDD’99 Competition: Knowledge Discovery Contest
Introduction Exploratory data analysis A two-stage prediction model Understanding the model Conclusion 2020/4/30 Intelligent Database System Lab, IDSL
Introduction
The paper discusses SAS Institute’s findings Expand on the ’98 KDD cup competition To reveal unusual data anomalies A two-stage prediction model yields superior results to those in ’98 Use decision tree to better understanding Apply a confidence interval to judge model performance 2020/4/30 Intelligent Database System Lab, IDSL
Introduction( cont’)
The models were built using a 95412 case training data set with known response To judge model efficacy, expected gift amount were calculated for a validation data set with concealed response 2020/4/30 Intelligent Database System Lab, IDSL
Exploratory data analysis
Successful statistical prediction models’ elements Problem-specific knowledge Historical data Analytical savvy 2020/4/30 Intelligent Database System Lab, IDSL
Exploratory data analysis
Four anomalies are immediately apparent 2020/4/30 Intelligent Database System Lab, IDSL
Exploratory data analysis( cont’)
2020/4/30 Intelligent Database System Lab, IDSL
Exploratory data analysis( cont’)
2020/4/30 Intelligent Database System Lab, IDSL
A two-stage prediction model
To accurately estimate the probability distribution of gift amount to a potential donor Prediction is done in two way Directly estimating expected gift Separately estimating expected donation probability and the expected gift amount and multiplying them 2020/4/30 Intelligent Database System Lab, IDSL
A two stage prediction model( cont’)
Task was done by using two multi-layer perceptron(MLP) neural networks Gift amount model First, using the cases where gift occurred(5% of the data) Input reflecting historical patterns in the gift amount( fitting class-probability decision tree) 2020/4/30 Intelligent Database System Lab, IDSL
A two stage prediction model( cont’)
Gift probability model Using all the cases Input reflecting recency, frequency, amount(RFA) and demographic data and the patterns noted in the exploratory data analysis 2020/4/30 Intelligent Database System Lab, IDSL
A two stage prediction model( cont’)
The first stage MLP Input layer with five inputs fully connected to 20 hidden units and them fully connected to a target unit 4843 cases with TAGET_B=1 2020/4/30 Intelligent Database System Lab, IDSL
A two stage prediction model( cont’)
2020/4/30 Intelligent Database System Lab, IDSL
A two stage prediction model( cont’)
The second stage MLP Input layer with eight inputs fully connected to 20 hidden units and them fully connected to a target unit Expected gift amount for each case was calculate as the product from the first- and second-stage models 2020/4/30 Intelligent Database System Lab, IDSL
A two stage prediction model( cont’)
Net revenue of the model using validation data is $14877.77 This is $165.53 more than the Gold-medal winner at KDD-98 2020/4/30 Intelligent Database System Lab, IDSL
Understanding the model
2020/4/30 Intelligent Database System Lab, IDSL
Brief summary
Smaller mailing size will lead to smaller variability in expected total gift because the variance sum will have fewer terms Further development incorporates both profit maximization and risk minimization as determinants of optimum mailing depth 2020/4/30 Intelligent Database System Lab, IDSL
Knowledge Discovery in a Charitable Organization’s Donor Database
Introduction Main results Detailed results comments 2020/4/30 Intelligent Database System Lab, IDSL
Introduction
Data set contains about 95000 customers, with an average net donation of slightly over 11 cents per customers and a total net donation of around $10500 from the “mail to all” policy The task utilizing standard 2-class knowledge discovery and with Value Weighted Analysis(VWA) 2020/4/30 Intelligent Database System Lab, IDSL
Main results
Maximal net profit of $15515 when checked against the evaluation data set, compared to KDD Cup 98’ best result of $14712 net profit Built a “white-box” model comprised of 11 customer segments and bring a combined net donation of $13397 for the evaluation data set 2020/4/30 Intelligent Database System Lab, IDSL
Main results( cont’)
Donation segments with highly profitable and actionable Approximately 14000 people who live in an area where over 5% of renters pay over $400 per month Have donated over $100 in the past Have an average donation of over $12 Account for $8200 net donation in the training set 2020/4/30 Intelligent Database System Lab, IDSL
Main results( cont’)
2020/4/30 Intelligent Database System Lab, IDSL
Main results( cont’)
Identifying donors is a different task than maximizing donation The variability of profit results Profit difference less than $500 cannot be considered significant Difference of $2000 in profit is not significant on different data sets 2020/4/30 Intelligent Database System Lab, IDSL
Main results( cont’)
Main discovery & modeling approach was a 1-stage 2-class model based on VWA 2020/4/30 Intelligent Database System Lab, IDSL
Detailed results
The most significant variables for predicting a customer’s donation behavior are the previous donation behavior summaries The NK phenomenon US-census data turns out to be quite strongly connected to the donation performance of the population 2020/4/30 Intelligent Database System Lab, IDSL
Detailed results( cont’)
5 models was chosen at last Two “white box” models and one relatively simple model, based on 40 variables and two candidates for “best overall” model 2020/4/30 Intelligent Database System Lab, IDSL
Detailed results( cont’)
Building the white-box model Total net donation of 11 segments if $13397 for 55086 customers 2020/4/30 Intelligent Database System Lab, IDSL
Detailed results( cont’)
2020/4/30 Intelligent Database System Lab, IDSL
Detailed results( cont’)
Best single model Selected 31 original variables, plus 9 additional demographic summary variables
Optimal point Predict point
2020/4/30 Intelligent Database System Lab, IDSL
Detailed results( cont’)
2020/4/30 Intelligent Database System Lab, IDSL
Detailed results( cont’)
Improving prediction by averaging 2020/4/30 Intelligent Database System Lab, IDSL
Brief summary
A good leave-out test set performance can hardly be considered a reliable indication of good future performance The net profit has a very large variance 2020/4/30 Intelligent Database System Lab, IDSL
Profiling Your Customers Using Bayesian Networks
Introduction Data manipulation and preprocessing Result Conclusion 2020/4/30 Intelligent Database System Lab, IDSL
Introduction
Build two causal models to understand the characteristics of respondents to direct mail fund raising campaigns The first model( response-net) captures the dependency of the probability of response to the mailing campaign on the independent variables 96376 lapsed donors data set 2020/4/30 Intelligent Database System Lab, IDSL
Introduction( cont’)
The second network( donation-net) models the dependency of the dollar amount of the gift 5%respondents to the ’97 mailing campaign 2020/4/30 Intelligent Database System Lab, IDSL
Data manipulation and preprocessing
Remove redundant variables and more than 99% of missing values and variables with only one state All continuous variables were discretized into four bins of equal length 30 variables at last were available 2020/4/30 Intelligent Database System Lab, IDSL
Data manipulation and preprocessing( cont’)
These variables can be divided into three group Variables about personal information about the donors Variable about information of donors neighborhood as socio-economic and urbanicity indicators Variables about history and promotion history file of the donors 2020/4/30 Intelligent Database System Lab, IDSL
Result-
profiling respondents
2020/4/30 Intelligent Database System Lab, IDSL
Result-
profiling donors
2020/4/30 Intelligent Database System Lab, IDSL
Result
Profit prediction Both two model can be used to predict the expected profit 2020/4/30 Intelligent Database System Lab, IDSL
Brief summary
Shown an application of Bayesian methods to a Knowledge Discovery task To maintain a high response rate to direct fund raising is to continuously update the database of donors 2020/4/30 Intelligent Database System Lab, IDSL
Opinion
The aspect of this papers are too higher concepts Simple methodology can make great achievement 2020/4/30 Intelligent Database System Lab, IDSL