A Predictive Model of Inquiry to Enrollment

Download Report

Transcript A Predictive Model of Inquiry to Enrollment

A Predictive Model of Inquiry to Enrollment
Cullen F. Goenner, PhD
Department of Economics
University of North Dakota
[email protected]
www.business.und.edu/goenner
Kenton Pauls
Director of Enrollment Services
University of North Dakota
[email protected]
Issues Facing Enrollment Managers

Finding new “markets”




Need to attract a particular type of
student


Increasing Tuition
Declining population (ND)
Increasing competition
Diversity/Quality
Data driven analysis

Accountability
Questions we will answer
today

What is predictive modeling?

How does one build a predictive
model?

How can predictive modeling be used
by institutions of higher education to
improve enrollment?
What is Predictive Modeling?

Predictive modeling uses
statistical/econometric methods to
quantitatively predict the future behavior of
individuals.

Steps include




Data collection on the subject of interest
Build the model based on data analysis
Predictions made out of sample
Model validation/testing
College Choice
3 stage process - Hossler and Gallagher
(1987)

Predisposition/aspiration for higher
education
Encouragement, coursework, and interest.

Search of potential schools
Councilors, campus contacts, program
availability

Selection
SES, Ability, Fit, Geography
Factors Influencing Choice
Economic perspective:


Education an investment in human capital
Cost vs Benefit calculus
Psychological perspective:

Need of self to find sense of belonging and
fulfillment of needs.
Sociological perspective:

Social interaction dictated by societal/family
norms.
Existing Empirical Work
Search Choice

Applications:



DesJardin, Dundar, Hendel (1999)
Weiler (1994)
Interest: SAT scores sent

Toutkoushian (2001)
Existing Models of Enrollment Choice



Model a student’s binary choice to enroll at
a particular college while controlling for a
student’s characteristics.
Logistic models used
Conditional on students have

Applied



Bruggink and Gambhir (1996)
Thomas, Dawes, and Reznik (2001)
Admitted


DesJardins (2002)
Leppel (1993)
Our Predictive Model



Builds on the models of DesJardins (2002)
and Thomas, Dawes, Reznik (2001)
Focus here is on prediction of enrollment of
students that inquired of our institution.
“Inquiry model” is relevant because:


Time of information exchange, opinion formation
Allows for early intervention in a student’s
decision making process (Target Marketing)
Inquiry Model Challenges

Data collection


Data already collected on those who are
admitted or apply. Typically not collected
for inquiries.
Quality of data

Applicants provide detailed data
describing themselves (demographic
data test scores, HSGPA, etc.), which are
not available for most student inquiries.
Types of Inquiries We Recorded







Return of information card
Attendance of college fair
Campus visit
Contact via e-mail
Contact via phone
Referral from faculty, coach, or alumni
ACT automatically submitted
How these data were captured



Enrollment Services Prospective
Student Network relational database
(ESPSN)
Customized system
SQL 2000/Visual Basic
Information Collected From
Information Request Card




Name
High School attended
Interested Major (if any)
Address
Lacks the demographic data typical to
application records and use in most
predictive models.
Geodemography

Process of attaching demographic characteristics to
geographic characteristics.

Notion is that “Birds of a Feather Flock Together”,
i.e. individuals living in the same neighborhood will
tend to have similar behavior patterns.

Ex: Neighborhoods homogenous in terms of
household income, occupations, family size, and
purchases.
Implementation

US Census data aggregated to zip code
level

“Geodemographic” variables considered for
our model specification:





College age demographic
Population
Average Income
White demographic
Median age
Building the model

Binary choice model: Model whether
students, who inquire of UND, either enroll
or do not enroll.

15,827 students made inquiries for Fall
2003 enrollment. Of these students 2067
actually enrolled.

Logistic regression model used.
Candidate Control Variables





Type and Frequency of Contact
Geographic
Academic
Geodemographic
Interaction Effects
Contact Variables
Predictor
Description
contacts
Number of inquiries
autoact
1 if automatically submitted ACT score; 0
otherwise
Number of campus visits
visit
referral
1 if referred by faculty, coach, alumni; 0
otherwise
www
1 if inquiry made by internet; 0 otherwise
phone
1 if inquiry made by phone; 0 otherwise
Geographic Variables
Predictor
Description
distance
Distance in miles from our institution
hystate
Resident of MN or ND
hyschool
Historically high yield school
compete
dist1
Distance in miles to closest regional
competitor
Distance between 100-300 miles
dist2
Distance between 300-500 miles
dist3
Distance between 500-1000 miles
dist4
Distance greater than 1000 miles
Academic/Geodemographic
Predictor
Description
acadint
1 if academic interest expressed; 0 otherwise
aviation
1 if academic interest is aviation; 0 otherwise
colldemo
% of population who completed some college
totalpop
medage
Total population of zip code
Median age of zip code
whitedem
% of population white (Non-Hispanic)
avginc
Average income in dollars of zip code
Interaction Terms
vismile
avitmile
aviatinc
incmile1
incmile2
incmile3
incmile4
vismile
# of visits x distance
Aviation x Distance
Aviation x Average income
Average income * Distance 1
Average income x Distance 2
Average income x Distance 3
Average income x Distance 4
# of visits x distance
Model Specification

Researchers typically assume their model
specification is the true model which
generates the data.

Difficult to justify a priori the choice of
variables to include in model, given each by
design is theoretically relevant.

With k candidate variables there are 2k
different linear models one could consider.
Consider the case in which several
models {M1, … MK} are theoretically
possible.
 Basing inference on the results of a
single model is risky.
 Bayesian model averaging (BMA)
allows us to account for this type of
uncertainty.
BMA
The posterior distribution of the
parameters given the data in the
presence of uncertainty is the posterior
distribution under each of the K models,
with weights equal to the posterior model
probabilities P(Mk/D) .
K
P (  / D)   P (  / M k , D) P ( M k / D)
(1)
k 1
Posterior Model Probability is
(2)
P( M k / D) 
P( D / M k ) P( M k )
K
 P( D / M
l 1
l
) P( M l )
Where P(D/Mk) is the likelihood and P(Mk)
is the prior probability that model Mk is the
true model, given one of the K models is
the true model.
Posterior Model Probability
Assuming a non-informative prior, (P(M1)
= … P(Mk) = 1/K)
(3)
1
exp( BICk )
2
P( M k / D)  K
1
exp( BICl )

2
l 1
The posterior mean and variance
summarize the effects of the parameters
on the dependent variable. Raftery (1995)
reports
E (  1 / D,  1  0)   ˆ1 (k ) P( M k / D)
A1
(9)
Var (  1 / D,  1  0)   [Var (k )   1 (k ) 2 ]P( M k / D)  E (  1 / D,  1  0) 2
A1
ˆ

where 1
(k) and Var(k) are MLE under
model k, and the summation is over
ˆ

models that1 include .
BMA Implementation

SPlus function bic.logit – performs
BMA on logistic regression models.

30 regressors implies summation in
equation 1 over 1 billion models.

To manage summation we use
Occam’s window.
Occam’s Window
Exclude models that predict the data
sufficiently less than predictions of the
best model. Predictions based on
PMP of each model. Models in A’ are
included
max PMPl
A'  {M k :
 C}
PMPk
Results



26 Models supported by the data
Model with highest PMP receives 21% of
total.
Variables that receive strong support for
inclusion include:



Geographic: Distance, HY State, HY School,
Competitor distance
Geodemog: College Age, Average Income
Contacts: Number, Campus visit, Referral
Table 3: Results of BMA Applied to Prediction of Enrollment
Predictor
Contact
contacts
autoact
visit
referral
www
phone
Geographic
distance
hystate
hyschool
compete
dist1
dist2
dist3
dist4
Geodemographic
colldemo
totalpop
medage
whitedem
avginc
Academic
acadint
aviation
Interaction
vismile
avitmile
aviatinc
incmile1
incmile2
incmile3
incmile4
Mean β/D
Std Error β/D
Pr(β≠0/D)
0.1969
0.0191
1.3386
1.7240
0.0147
0.1650
0.0299
0.0690
0.0827
0.0745
0.0665
0.1901
100
8.3
100
100
5.6
47.5
-0.0040
0.7726
0.9491
0.0033
0
0.0155
0
0
0.0004
0.1213
0.0819
0.0004
0
0.0723
0
0
100
100
100
100
0
5.2
0
0
2.8015
2.80E-07
0
0.0578
8.59E-06
0.5395
1.34E-06
0
0.2178
1.49E-06
100
4.9
0
7.9
100
0.2725
0.1871
0.0803
0.2478
97.8
38.1
0.0016
0.0005
0
0
4.10E-07
0
0
0.0002
0.0004
0
0
1.59E-06
0
0
100
63.1
0
0
7.4
0
0
Out of Sample Predictive Performance

Split the data into two equal parts:



First part used to build/estimate the
model
Second part used to test the model’s
predictions.
Outcome (enrollment) is binary, while
our model generates a probability
estimate.
What is a successful prediction?


Greene (2001) - No “correct” choice
for probability cutoff. Typical value is
.5
Tradeoff in cutoff choice:

Lower cutoff increases the accuracy of
inquiries that are predicted to enroll and
who actually enroll (sensitivity) at the
expense of inquiries predicted to enroll
and do not enroll (false positive rate)
Predictive Performance:
Classification
Actual Outcome
Predicted to
Enroll
Enrolled
370
36%
Did not
Enroll
194
2.8%
Predicted not
to Enroll
657
64%
6693
97%
7350
6887
7914
Prediction
TOTAL
1027
TOTAL
564
Predictive performance

89% of observations correctly
classified



Specificity: 97%
Sensitivity: 36%
ROC curve describes relation between
sensitivity and 1- specificity (false +
rate)

Area under ROC curve = .87
Another Predictive Performance Method
Total Estimates
Total Enrolled
Total Not Enrolled
Percent of Enrolled
Accumulating % of Enrolled
Accumulating Count of Enrolled
records within ranges
Accumulated percent of total
estimates
1.00
15,412 24
1,893 20
13,519 4
1%
MODEL SCORE RANGES
0.70 0.60 0.50 0.40 0.30
217 217 319 434 592
136 140 153 209 225
81 77 166 225 367
7% 7% 8% 11% 12%
0.20
1,048
247
801
13%
9% 20% 27% 35% 43% 54% 66%
79%
0.90
221
158
63
8%
0.80
278
202
76
11%
24 245 523 740 957 1,276 1,710 2,302 3,350
22%
0.10
3,231
292
2,939
15%
0.00
8,831
111
8,720
6%



79% of enrolled found within 22% of
entire population (scores >= 0.2)
Focused efforts without compromising
enrollment numbers
Efficiency implications
Practical Applications



Effective regional market segmentation
Targeted tele-counseling efforts
Special projects
Regional Market Segmenting

Target Marketing and Segmentation


Prospect names purchased based on zip
code.
Establish a predictive “score” for all zip
codes in US based on census-level data
What the data indicated (WA)
Where enrolled students came from (WA)


83% of enrolled WA students fell within
top scoring zips over three years
Direct Mail Names Purchases

Prior years very open search criteria


MN, CO, SD, MT
This year, much more restrictive to get
deeper into broader markets
Only key zips
 CO, WA, OR, AZ, IL, MN, etc.

WA Search Names - 2003
WA Search Names - 2004
Targeted Tele-Counseling Efforts




Student calling program
Top 20% of all model scores identified
Fluid number excluding applicants
Prompt student to take action
Special Projects



Limited funds but targeted initiatives
Focus on as many of top scoring
students
Postcards, brochures, etc.
Possible Future Research


Cluster analysis for better market
segmentation
Study of marginal effects
Thank You!
Questions?