Using Business Taxation Data as Auxiliary Variables and as Substitution Variables in the Australian Bureau of Statistics Frank Yu, Robert Clark and Gabriele B.

Download Report

Transcript Using Business Taxation Data as Auxiliary Variables and as Substitution Variables in the Australian Bureau of Statistics Frank Yu, Robert Clark and Gabriele B.

Using Business Taxation Data as
Auxiliary Variables and as
Substitution Variables in the
Australian Bureau of Statistics
Frank Yu, Robert Clark and Gabriele B. Durant
Outline of talk
Use of tax data in ABS
Using tax data as auxiliary variables
example: subannual surveys
Using tax data as variables of interest
missing taxation data
example: annual surveys
Dealing with missing tax data:
Missing at Random
Common Error Measurement model
Conclusion
Use of tax data
construct and maintain population frame
as auxiliary variables for estimation
substitute survey data to reduce provider burden
as source for imputing missing/invalid survey data
provide independent estimates for validation of
outputs
Data supplied by Australian
Taxation Office
Australian Business Register information
businesses identified by name, address
industry, payees
Business Activity Statement data - GST and PAYG data
available (90%) 6 months after reference quarter
turnover, wage and salaries, capital and non-capital expenses
Income Tax data
available (70 to 80%)18 months after reference quarter
detailed expenses and revenue and balance sheet
Use of tax data for frame creation
ABS Maintained Population
ABS MP
complex units
ATO maintained population
from Australian Busines Register
ATO MP
simple units: ABN = statistical unit
Use of tax data for frame
construction
construction: units from ABR
industry, sector
number of payees
multistate indicators
maintenance:
births and cancellation
tax roles : e.g. employing vs non-employing units
long term non-remitters excluded
stratification: single/multiple states, industry
Frame auxiliary variables (xi's)
derived size benchmarks:
from BAS, based on wage and salaries data
used as stratification variables
BAS turnover
BAS wages
need imputation (derived from average of quarterly
data)
lag reference quarter by 2 quarters
Survey data vs tax data
Sample
Survey
BAS data
BIT data
concept
accuracy
timeliness
detailed domain
**
*
***
*
*
**
**
**
*
***
*
***
richness of data
items
***
*
**
Use of tax data as auxiliary
variables
Survey
Variables of
interest
Auxiliary Variables
for estimation
Retail Trade
Sales
BAS turnover
Economic Activity
Survey
Annual Integrated
Collection
financial
BIT variables
variables
same as EAS BAS variables
tax data as auxiliary variables
s
xi
U\s
xi
yi
Generalised Regression Estimation
YGREG  YHT  ( X  X HT ) B
where
YHT   Yi /  i
s
X HT   X i /  i
s
B  ( X i ' X i /  i ) 1 ( X i 'Yi /  i )
s
s
Advantages and disadvantages
Advantages
provide efficiency
approximately unbiased
does not require X's to
be measuring the right
concepts
does not require X's to
be current
Disadvantages
does not model Y
directly e.g. zero units
influential points
efficiency in estimating
levels not equal to
efficiency for estimating
change
Issue: inactive/out of scope units
Solution: apply GREG to positive units only
efficiency for estimating level does not
necessarily translate to efficiency for
estimating change
Var (Y2,GREG  Y1,GREG )  Var (Y2, HT  Y1, HT )
iff  res  1-
1- Y
1  rXY
where  res is the lag 1autocorrelation of residuals,
Y is the lag 1 autocorrelatin of Y's, and
rXY is the correlation between Y and X's
Data Substitution Approach: Use
tax as the variable of interest
Assumes tax data are
better
respondents more
serious about getting it
right
more time to provide
information
audited accounts (for
BIT) for tax purposes
Detailed breakdown
Missing tax data
require matching to
frame
missingness is nonignorable
ƒ inactive units
ƒ late units have more
expenses
Examples: Economic Activity
Survey (annual) 1990s to 05/06
estimation of totals
for broad items for
microbusinesses
augmenting sample
for simple
businesses
estimation of
detailed items
tax data as substitution
variables
tax data to replace broad
level income and expenses
items
detailed items imputed by
pro-rating broad tax data
based on splits observd in
surveys
Examples: Annual Integrated
Collection (06/7 onwards)
AIC - core survey estimation of totals tax data as auxiliary variables
estimates
for survey variables for generalised regression
for small and large estimation
businesses
AIC complementary
estimates
AIC complementary
estimates
estimation of totals
for broad items for
microbusinesses
estimation of
detailed
state/industry
classes
tax data as substitution
variables
AIC complementary
estimation of
detailed economic
tax data as substitution variables,
disaggregated by model estimation
of pro-rating factors
tax data as substitution
variables
Notation
Y available
ri = 1
U
Y not available
ri = 0
Use MAR model on frame only
frame
variables
Xi
tax data of interest
Y available
ri = 1
model: Y= f(x) for ri = 1
U
Y not available
Xi
ri = 0
Use MAR model conditional on frame
variables only
U
Xi
Y available
ri = 1
model: Y= f(x) for ri = 1
MAR
Y not available
Xi
ri = 0
impute Y^ = f(x) for ri = 0
But for non-ignorable missingness
U
Xi
Y available
ri = 1
model: Y= f(x) for ri = 1
Y not available
Xi
ri = 0
impute Y^ = f(x) for ri = 0
Use a sample to inform about the nonreporters based
on their survey response.
Notation: Use Y to represent tax variables and Y* for
survey variables (a surrogate of Y)
U
Xi
Y available
ri = 1
Y* available
s
Y not available
Xi
ri = 0
Y* available
Imputing tax data from survey data
U
Xi
Y available
model: Y= f(Y*, xi)
ri = 1
Y* available
s
Y not available
Xi
ri = 0
Y* available
Imputing tax data from survey data
U
Xi
Y available
model: Y= f(Y*,
f(Y*) xi)
ri = 1
Y* available
s
Y not available
Xi
Y* available
ri = 0
impute Ŷ
Imputing tax data from survey data
U
Xi
Y available
model: Y= f(Y*, x)
ri = 1
Y* available
s
Y not available
Xi
Y* available
ri = 0
impute Ŷ=f(Y*, x)
Models for Y
Missing at Random: Y independent of r given x and Y*
r Y
x ,Y *
Common measurement error: Given Y, distribution of Y*
Is independent of r
r Y *
x ,Y
Use MAR model: missing at random r Y
x ,Y
given X and Y*
*
U
Xi
Y available
model: Y= f(Y*, x) for ri = 1
ri = 1
Y* available
MAR
s
Y not available
Xi
ri = 0
Y* available
impute Ŷ for ri = 0
Imputation using MAR model
1.
2.
3.
Using data on Y and Y* observed from the units in
the sample where where both survey and tax data are
reported, model Y as a function of Y*.
Use this model to impute Yi* for tax non reporters in
the sample (assuming Y* is known for them).
For units not in the sample, if their tax data is
missing, impute using the distribution
f (Yi | ri  0, xi )   f (Yi | ri  0, xi , Yi * ) f (Y *i | ri  0, xi )dYi *
  f (Yi | ri  1, xi , Yi* ) f (Y *i | ri  0, xi )dYi *
r Y *
Use CME model
x ,Y
U
Xi
Y available
ri = 1
model: Y*= f(Y, x) for ri = 1
invert to get Ŷ= g(Y*)
CME
Y* available
s
Y not available
Xi
ri = 0
Y* available
impute Ŷ = h(X) for
ri = 0
for i in U\s
Imputation using CME model
r Y *
x ,Y
f (Yi | Yi , xi , ri  0)  f ((Yi | Yi , xi , ri  1).
*
*
A typical model can be:
Y    Yi   i where E ( i | Yi .ri )  0,
*
i
This model motivates an unbiased impute:Yi  (Y   )
*
i
We also want to model Yi in terms of X i when
Y* and Y are both not observed (i.e. for i  s and ri  0)
E (Yi | xi .ri  0)   0i xi giving an impute  0i xi
1
i
Modelling survey data (Y*) and tax data
(Y) - invert this to predict Y from Y*
Model: survey data Y* (EAS 05/06) as a
function of frame variable X (tax_turn_0405)
for tax nonrespondents (i.e. r =0)
Empirical Best Linear Unbiased
Predictor (EBLUP) of Yi
BLUP impute:
EBLUP impute
CME imputation process
use units in sample where tax and survey variables are
observed and model the survey variable (Y*) as a function
of tax and frame data. (Y, X)
Under CME this model applies to r = 0 too.
use units in the sample where survey data are observed (i in
s) but tax data are not (ri = 0) to model the survey variable
(Y*)as function of frame data (x).
combine to give an impute for (Y) for tax nonrespondents (r
= 0):
Combine to get EBLUP
Further work
domain estimation for CME/MAR
variance estimation
discriminating between CME and MAR based on
data
Conclusion
GREG is useful for estimation of survey data but
efficiency gain is limited.
There is increasing interest in using tax data
directly on its own to produce economic statistics.
Non-ignorable missingness becomes a key issue
with tax data.
Survey data could be useful to help impute the tax
data