Transcript Time Series Data Processes
Time Series Data Processes
by Tai Yu April 15, 2013
WHAT ARE YOU DOING???
• • • • • • • • • • • • • • • • • • • • • • • •
61 %macro
get_data(mn_yr0 ,mn_yr1 ,mn_yr2 ,mn_yr3 ,mn_yr4 ,mn_yr5, mn_yr6);
62 63 data
asof_&mn_yr6. asof_&mn_yr5. asof_&mn_yr4. asof_&mn_yr3. asof_&mn_yr2. asof_&mn_yr1.;
64 65 66
set wfpf_delinq_data; if daysdelq<1 THEN DPD=0; else if daysdelq>180 THEN DPD=180; else DPD=daysdelq;
67 68 69 70 71
if "28& mn_yr1 .:00:00:00"dt<=ASOF_DT<"01& mn_yr0 .:00:00:00"dt then do; dpd_1 =dpd; output asof_& mn_yr1 .; end; else if "28& mn_yr2 .:00:00:00"dt<=ASOF_DT<"01& mn_yr1 .:00:00:00"dt then do; dpd_2 =dpd; output asof_& mn_yr2 .; end; else if "28& mn_yr3 .:00:00:00"dt<=ASOF_DT<"01& mn_yr2 .:00:00:00"dt then do; dpd_3 =dpd; output asof_& mn_yr3 .; end; else if "28& mn_yr4 .:00:00:00"dt<=ASOF_DT<"01& mn_yr3 .:00:00:00"dt then do; dpd_4 =dpd; output asof_& mn_yr4 .; end; else if "28& mn_yr5 .:00:00:00"dt<=ASOF_DT<"01& mn_yr4 :00:00:00"dt
72 73
then do; dpd_5 =dpd; output asof_& mn_yr5 .; end; else if "28& mn_yr6 .:00:00:00"dt<=ASOF_DT<"01& mn_yr5 .:00:00:00"dt then do; dpd_6 =dpd; output asof_& mn_yr6 .; end;
74 75 run;
2
• • • • • • • • • • • • • • • • • • • WHAT ARE YOU DOING???
84 data 85 data 86 data
_asof_& mn_yr1 .; set asof_& mn_yr1 .; drop dpd_2--dpd_6 ;
run
; _asof_& mn_yr2 .; set asof_& mn_yr2 .; drop dpd_1 dpd_3--dpd_6 ;
run
; _asof_& mn_yr3 .; set asof_& mn_yr3 .; drop dpd_1--dpd_2 dpd_4--dpd_6 ;
run
;
87 data
_asof_& mn_yr4 .; set asof_& mn_yr4 .; drop dpd_1--dpd_3 dpd_5--dpd_6 ;
run
;
88 data
_asof_& mn_yr5 .; set asof_& mn_yr5 .; drop dpd_1--dpd_4 dpd_6 ;
run
;
89 data
_asof_& mn_yr6 .; set asof_& mn_yr6 .; drop dpd_1--dpd_5 ;
run
;
90 91 data 92
Cohort_12mn_dpd_& merge _asof_& mn_yr6 mn_yr1 .; . _asof_& mn_yr5 . _asof_& mn_yr4 . _asof_& mn_yr3 . _asof_& mn_yr2 . _asof_& mn_yr1 . (in=a);
93
by acct_id;
94
if a;
95 run; 96 97 %mend; 98 99 get_data
(Sep2006, Aug2006,Jul2006,Jun2006,May2006,Apr2006, Mar2006,); 3
WHAT ARE YOU DOING???
• •
61 %macro 62
get_data(mn_yr0 ,mn_yr1 ,mn_yr2 ,mn_yr3 ,mn_yr4 ,mn_yr5 , mn_yr6 , mn_yr6, mn_yr7 ,mn_yr8 ,mn_yr9 ,mn_yr10,mn_yr11 , mn_yr12); • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
63 data
asof_&mn_yr12. asof_&mn_yr11. asof_&mn_yr10. asof_&mn_yr9. asof_&mn_yr8. asof_&mn_yr7.
64 65 66
asof_&mn_yr6. asof_&mn_yr5. asof_&mn_yr4. asof_&mn_yr3. asof_&mn_yr2. asof_&mn_yr1. ; set wfpf_delinq_data;
67 68 69 70
if daysdelq<1 THEN DPD=0; else if daysdelq>180 THEN DPD=180; else DPD=daysdelq; if "28& mn_yr1 .:00:00:00"dt<=ASOF_DT<"01& asof_& then do; dpd_1 =dpd; output asof_& mn_yr1 .; end; mn_yr0.
:00:00:00"dt else if "28& mn_yr2 .:00:00:00"dt<=ASOF_DT<"01& mn_yr1 .:00:00:00"dt
71 72 73 74
then do; dpd_2 =dpd; output asof_& mn_yr2 .; end; else if "28& mn_yr3 .:00:00:00"dt<=ASOF_DT<"01& mn_yr2 .:00:00:00"dt then do; dpd_3 =dpd; output asof_& mn_yr3 .; end; else if "28& mn_yr4 .:00:00:00"dt<=ASOF_DT<"01& mn_yr3 .:00:00:00"dt then do; dpd_4 =dpd; output asof_& mn_yr4 .; end; else if "28& mn_yr5 .:00:00:00"dt<=ASOF_DT<"01& mn_yr4 :00:00:00"dt then do; dpd_5 =dpd; output asof_& mn_yr5 .; end; else if "28& mn_yr6 .:00:00:00"dt<=ASOF_DT<"01& mn_yr5 .:00:00:00"dt
75 76 77 78
then do; dpd_6 =dpd; output asof_& mn_yr6 .; end; else if "28& mn_yr7 .:00:00:00"dt<=ASOF_DT<"01& mn_yr6 .:00:00:00"dt then do; dpd_7 =dpd; output asof_& mn_yr7 .; end; else if "28& mn_yr8 .:00:00:00"dt<=ASOF_DT<"01& mn_yr7 .:00:00:00"dt then do; dpd_8 =dpd; output asof_& mn_yr8 .; end; else if "28& mn_yr9 .:00:00:00"dt<=ASOF_DT<"01& mn_yr8 .:00:00:00"dt then do; dpd_9 =dpd; output asof_& mn_yr9 .; end; else if "28& mn_yr10 .:00:00:00"dt<=ASOF_DT<"01& mn_yr9 .:00:00:00"dt then do; dpd_10 =dpd; output asof_& mn_yr10 .; end; else if "28& mn_yr11 .:00:00:00"dt<=ASOF_DT<"01& mn_yr10 .:00:00:00"dt
79 80
then do; dpd_11 =dpd; output asof_& mn_yr11 .;end; else if "28& mn_yr12 .:00:00:00"dt<=ASOF_DT<"01& mn_yr11 .:00:00:00"dt then do; dpd_12 =dpd; output asof_& mn_yr12 .; end;
81 run;
4
WHAT ARE YOU DOING???
• • • • • • • • • • • • • • • • • • • • • • • • • • •
82 83 data
_asof_& mn_yr1 .; set asof_& mn_yr1 .; drop dpd_2--dpd_12 ;
run
;
84 data 85 data
_asof_& mn_yr2 .; set asof_& mn_yr2 .; drop dpd_1 dpd_3--dpd_12 ;
run
; _asof_& mn_yr3 .; set asof_& mn_yr3 .; drop dpd_1--dpd_2 dpd_4--dpd_12 ;
run
;
86 data
_asof_& mn_yr4 .; set asof_& mn_yr4 .; drop dpd_1--dpd_3 dpd_5--dpd_12 ;
run
;
87 data
_asof_& mn_yr5 .; set asof_& mn_yr5 .; drop dpd_1--dpd_4 dpd_6--dpd_12 ;
run
;
88 data
_asof_& mn_yr6 .; set asof_& mn_yr6 .; drop dpd_1--dpd_5 dpd_7--dpd_12 ;
run
;
89 data 90 data
_asof_& _asof_& mn_yr7 mn_yr8 .; set asof_& .; set asof_& mn_yr7 mn_yr8 .; drop .; drop dpd_1--dpd_6 dpd_8--dpd_12 dpd_1--dpd_7 dpd_9--dpd_12 ; ;
run run
; ;
91 data
_asof_& mn_yr9 .; set asof_& mn_yr9 .; drop dpd_1--dpd_8 dpd_10--dpd_12 ;
run
;
92 data 93 data 94 data 95
_asof_& mn_yr10 .; set asof_& mn_yr10 .; drop dpd_1--dpd_9 dpd_11--dpd_12 ;
run
; _asof_& mn_yr11 .; set asof_& mn_yr11 .; drop dpd_1--dpd_10 dpd_12 ;
run
; _asof_& mn_yr12 .; set asof_& mn_yr12 .; drop dpd_1--dpd_11 ;
run
;
96 data
Cohort_12mn_dpd_& mn_yr1 .;
97 98 99 merge
_asof_& mn_yr12 . _asof_& mn_yr11 . _asof_& mn_yr10 . _asof_& mn_yr9 . _asof_& mn_yr8 . _asof_& mn_yr7 . _asof_& mn_yr6 . _asof_& mn_yr5 . _asof_& mn_yr4 . _asof_& mn_yr3 . _asof_& mn_yr2 . _asof_& mn_yr1 .(
in=a
);
100 101 by if
a; acct_id;
102 run; 103 104 %mend; 105 106 get_data
(Sep2006, Aug2006,Jul2006,Jun2006,May2006,Apr2006, Mar2006, Feb2006,Jan2006,Dec2005,Nov2005,Oct2005,Sep2005); 5
What is Time Series Data?
• Definition of Time Series: A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. (by Australian Bureau of Statistics)
An ordered sequence of values of a variable at equally spaced time intervals
. (by Engineering Statistics Handbook) 6
What is Time Series Data?
• For example, the monthly delinquent status of a customer over 12-month period
Obs 8 9 10 11 12 1 2 3 4 5 6 7 id
106794 106794 106794 106794 106794 106794 106794 106794 106794 106794 106794 106794
asof_dt dpd
08/31/06 07/31/06 06/30/06 05/31/06 04/30/06 03/31/06 02/28/06 01/31/06 12/31/05 11/30/05 10/31/05 09/30/05 108 77 46 16 15 16 13 16 16 15 0 0 7
Stages of Time Series Analysis
1. Analyze data to obtain an understanding of the underlying drivers that produced the collected data. 2. Develop model(s) to forecast possible outcomes through the collected data. 3. Compare monitoring results with predicted outcomes to make appropriate control process modification(s). 8
Applications of Time Series Data
Stock Market Inventory Sales Workload 9
A Typical Time Series Data Process
Transpose 12 monthly delinquent status observations of account
Obs 8 9 10 11 12 1 2 3 4 5 6 7 acct_id
106794 106794 106794 106794 106794 106794 106794 106794 106794 106794 106794 106794
asof_dt
08/31/06 07/31/06 06/30/06 05/31/06 04/30/06 03/31/06 02/28/06 01/31/06 12/31/05 11/30/05 10/31/05 09/30/05
dpd
108 77 46 16 15 16 13 16 16 15 0 0 to a single account observation with 12-month delinquent status
Obs 1 acct_id
106794
perf_obs_dt dpd_1 dpd_2 dpd_3 dpd_4 dpd_5 dpd_6 dpd_7 dpd_8 dpd_9 dpd_10 dpd_11 dpd_12
08/31/06 108 77 46 16 15 16 13 16 16 15 0 0 10
• • • • • • • • • • • • • • • • • •
Time Series Data by SAS Procedure
001 002 003 004 005 006 007
%macro
DLQ_12_Month(perf_obs_date); data perf_12_month; set acct_basel_dpd; format perf_obs_dt date9.; 008 perf_obs_dt = intnx("MONTH",
"&perf_obs_date."d
,
0
,'END'); 009 010 dlq_status_month = intck ("MONTH", datepart(asof_dt)), perf_obs_dt ) +
1
; 011 012 if
1
<= dlq_status_month <=
12
; 013 run; 014 015 proc sort data = perf_12_month; 016 017 run; by acct_id perf_obs_dt dlq_status_month; 018 11
Time Series Data by SAS Procedure
After executing line 003 to line 018, the data set is now with two additional variables: Performance Observed Date (perf_obs_dt) Delinquent Status Month (dlq_status_month)
obs acct_id asof_dt
1 106794 08/31/2006 2 106794 07/31/2006 3 106794 06/30/2006 4 106794 05/31/2006 5 106794 04/30/2006 6 106794 03/31/2006 7 106794 02/28/2006 8 106794 01/31/2006 9 106794 12/31/2005 10 106794 11/30/2005 11 106794 10/31/2005 12 106794 09/30/2005
basel_dpd perf_obs_dt dlq_status_month
108 77 46 16 15 16 13 16 16 15 0 0 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 1 2 3 4 5 6 7 8 9 10 11 12 12
• • • • • • • • • • • • • • • •
Time Series Data by SAS Procedure
018 019 proc transpose data = perf_12_month 020 out = Cohort_12mn_dpd_&perf_obs_date 021 (drop = _name_ where=(dpd1 ^=
.
)) 022 prefix = dpd 023 ; 024 by acct_id perf_obs_dt; 025 026 027 028 029 run; 030 031 032 033 % id
%mend
; dlq_status_month; var basel_dpd;
DLQ_12_Month
(01AUG2006); PROC TRANSPOSE: 1. Transposes the variable basel_dpd by acct_id perf_obs_dt 2. Creates new variables dpd_1 to dpd_12 by PREFIX and ID options 13
Time Series Data by SAS Procedure
After executing line 019 to line 029, the new data set is a single account observation with 12-month delinquent status
Obs 1 acct_id
106794
perf_obs_dt dpd_1
08/31/06 108
dpd_2 dpd_3 dpd_4 dpd_5 dpd_6 dpd_7 dpd_8 dpd_9 dpd_10 dpd_11 dpd_12
77 46 16 15 16 13 16 16 15 0 0 14
• • • • • • • • • • • • • • • • • • • • • • •
Time Series Data by SAS Data Step
019 data Cohort_12mn_dpd_&perf_obs_date (keep = acct_id perf_obs_dt dpd1 - dpd12); 020 set perf_12_month; 021 by acct_id perf_obs_dt dlq_status_month; 022 array dpd[
12
] dpd1 - dpd12; 023 retain dpd1 - dpd12 j
0
; 024 025 if first.acct_id and first.perf_obs_dt 026 then do; 027 do i =
1
to
12
; 028 dpd[i] =
0
; 029 end; 030 j =
0
; 031 end; 032 033 j = j +
1
; 034 035 dpd[j] = basel_dpd; 036 037 if last.acct_id and last.perf_obs_dt ; 038 run; 039 040
%mend
; 041 %
DLQ_12_Month
(01AUG2006); 15
Time Series Data by SAS Data Step
SAS Data Step: 1. Declares ARRAY DPD to create new variable DPD_1 to DPD_12 2. Declares RETAIN to pass the values of variable DPD_1 to DPD_12 from one observation to the next observation 3. Initiates the values of variable DPD_1 to DPD_12 to 0s when the first account id and the first observation date are detected and neutralizes index J 4. Assigns the values of variable DPD_1 to DPD_12 by index J 5. Outputs the values of variable DPD_1 to DPD_12 to new data set only when the last account id and the last observation date is detected 16
• • • • • • • • • • • • • • • • •
Time Series Data by SAS Function
019 data Cohort_12mn_dpd_&perf_obs_date 020 (keep = acct_id perf_obs_dt dpd1 - dpd12); 021 set perf_12_month; 022 by acct_id perf_obs_dt dlq_status_month; 023 array dpd[12] dpd1 - dpd12;; 024 025 %do i = 1 %to 12; 026 %let j = %eval(12 - &i); 027 dpd[&i] = lag &j(basel_dpd); 028 %end; 029 030 if dlq_status_month =
12
; 031 run; 032 033
%mend
; 034 035 %
DLQ_12_Month
(01AUG2006); 17
Time Series Data by SAS Function
• SAS LAG Function: 1. Stores a value in a queue and returns a value stored previously in that queue. 2. Each occurrence of a LAGn function in a program generates its own queue of values.
3. When an occurrence of LAGn is executed, the value at the top of its queue is removed and returned, the remaining values are shifted upwards, and the new value of the argument is placed at the bottom of the queue. observation of the prior execution.
obs acct_id asof_dt
1 106794 08/31/2006
basel_dpd perf_obs_dt dlq_status_month
108 08/31/2006 1 2 106794 07/31/2006 3 106794 06/30/2006 77 46 08/31/2006 08/31/2006 2 3 4 106794 05/31/2006 5 106794 04/30/2006 6 106794 03/31/2006 16 15 16 08/31/2006 08/31/2006 08/31/2006 4 5 6 7 106794 02/28/2006 8 106794 01/31/2006 9 106794 12/31/2005 10 106794 11/30/2005 11 106794 10/31/2005 12 106794 09/30/2005 13 16 16 15 0 0 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 08/31/2006 7 8 9 10 11 12
DPD LAG
DPD_
1
DPD_ DPD_
3
DPD_ DPD_
5
DPD_
2 4 6
DPD_
7
LAG
11
LAG
10
LAG
9
LAG
8
LAG
7
LAG
6
LAG
5
DPD_
8
LAG
4
DPD_
9
LAG
3
DPD_
10
LAG
2
DPD_
11
LAG
1
DPD_
12
LAG
0
18
• • • • • • • • • • • • • • • • • • •
Time Series Data by SAS Function
019 data Cohort_12mn_dpd_&perf_obs_date 020 (keep = acct_id perf_obs_dt dpd1 - dpd12); 021 set perf_12_month; 022 by acct_id perf_obs_dt dlq_status_month; 023 array dpd[
12
] dpd1 - dpd12;; 024 025 if dlq_status_month =
12
; 026 027 %do i = 1 %to 12; 028 %let j = %eval(12 - &i); 029 dpd[&i] = lag&j(basel_dpd); 030 %end; 031 032
*
033 run; if dlq_status_month =
12
; 034 035
%mend
; 036 037 %
DLQ_12_Month
(01AUG2006); 19
Time Series Data by SAS Function
SAS LAG Function: 1. The queue for each occurrence of LAGn is initialized with n missing values. 2. Missing values are returned for the first n executions of each occurrence of LAGn, after which the lagged values of the argument begin to appear.
3. Stores values at the bottom of the queue and returns values from the top of the queue occurs only when the function is executed. An occurrence of the LAGn function that is executed conditionally will store and return values only from the observations for which the condition is satisfied.
Obs 1 acct_id
106794
perf_obs_dt dpd1
08/31/06 .
.
dpd2
.
dpd3
.
dpd4
.
dpd5
.
dpd6
.
dpd7
.
dpd8
.
dpd9
.
dpd10
.
dpd11 dpd12
0 20
Time Series Data by SAS Function
Special Case: When not all time series is populated. The sub-setting IF statement (if dlq_status_month = 12;) will never be satisfied. SAS returns no observation to the output dataset.
obs acct_id
1 2 3 4 5 6 7 8 9 10679 10679 10679
asof_dt
08/31/2006 07/31/2006 06/30/2006 10679 10679 10679 05/31/2006 04/30/2006 03/31/2006 10679 10679 10679 02/28/2006 01/31/2006 12/31/2005
basel_dpd
10 7 4 6 5 6 3 6 6
dlq_status_month
1 2 3 4 5 6 7 8
9
21
• • • • • • • • • • • • • • • • • • • •
Time Series Data by SAS Function
019 proc sort data = perf_12_month; 020 by acct_id load_dt descending dlq_status_month; 021 run; 022 023 data Cohort_12mn_dpd_&perf_obs_date (keep = acct_id load_dt dpd1 - dpd12); 024 set perf_12_month; 025 by acct_id load_dt descending dlq_status_month; 026 027 array dpd[
12] dpd1 - dpd12;;
028 029 %do i =
1 %to 12;
030 %let j = %eval(&i - 1); 031 dpd[&i] = lag&j(basel_dpd); 032 %end; 033 034 if dlq_status_month =
1 ;
035 run;
036 %mend;
037 038 %
DLQ_12_Month(01AUG2006);
22
Time Series Data by SAS Function
Special Case: When not all time series is populated. By 1. Sorting the variable dlq_status_month in descending order 2. Conditioning the sub setting IF statement is “True” when the beginning of the time series date period (if dlq_status_month = 1) is reached.
Obs 1 acct_id load_dt
10679 31-Aug-06
dpd1
10
dpd2
7
dpd3
4
dpd4
6
dpd5
5
dpd6
6
dpd7
3
dpd8
6
dpd9
6
dpd10
.
dpd11
.
dpd12
.
23
Weakness and Strength of Each Approach
Approach Strength Weakness =================================================== DATA STEP LAG FUNCTION Flexible Manipulating Initialization Self Explanatory Conditional Execution 24