Using Business Taxation Data as Auxiliary Variables and as Substitution Variables in the Australian Bureau of Statistics Frank Yu, Robert Clark and Gabriele B.
Download ReportTranscript Using Business Taxation Data as Auxiliary Variables and as Substitution Variables in the Australian Bureau of Statistics Frank Yu, Robert Clark and Gabriele B.
Using Business Taxation Data as Auxiliary Variables and as Substitution Variables in the Australian Bureau of Statistics Frank Yu, Robert Clark and Gabriele B. Durant Outline of talk Use of tax data in ABS Using tax data as auxiliary variables example: subannual surveys Using tax data as variables of interest missing taxation data example: annual surveys Dealing with missing tax data: Missing at Random Common Error Measurement model Conclusion Use of tax data construct and maintain population frame as auxiliary variables for estimation substitute survey data to reduce provider burden as source for imputing missing/invalid survey data provide independent estimates for validation of outputs Data supplied by Australian Taxation Office Australian Business Register information businesses identified by name, address industry, payees Business Activity Statement data - GST and PAYG data available (90%) 6 months after reference quarter turnover, wage and salaries, capital and non-capital expenses Income Tax data available (70 to 80%)18 months after reference quarter detailed expenses and revenue and balance sheet Use of tax data for frame creation ABS Maintained Population ABS MP complex units ATO maintained population from Australian Busines Register ATO MP simple units: ABN = statistical unit Use of tax data for frame construction construction: units from ABR industry, sector number of payees multistate indicators maintenance: births and cancellation tax roles : e.g. employing vs non-employing units long term non-remitters excluded stratification: single/multiple states, industry Frame auxiliary variables (xi's) derived size benchmarks: from BAS, based on wage and salaries data used as stratification variables BAS turnover BAS wages need imputation (derived from average of quarterly data) lag reference quarter by 2 quarters Survey data vs tax data Sample Survey BAS data BIT data concept accuracy timeliness detailed domain ** * *** * * ** ** ** * *** * *** richness of data items *** * ** Use of tax data as auxiliary variables Survey Variables of interest Auxiliary Variables for estimation Retail Trade Sales BAS turnover Economic Activity Survey Annual Integrated Collection financial BIT variables variables same as EAS BAS variables tax data as auxiliary variables s xi U\s xi yi Generalised Regression Estimation YGREG YHT ( X X HT ) B where YHT Yi / i s X HT X i / i s B ( X i ' X i / i ) 1 ( X i 'Yi / i ) s s Advantages and disadvantages Advantages provide efficiency approximately unbiased does not require X's to be measuring the right concepts does not require X's to be current Disadvantages does not model Y directly e.g. zero units influential points efficiency in estimating levels not equal to efficiency for estimating change Issue: inactive/out of scope units Solution: apply GREG to positive units only efficiency for estimating level does not necessarily translate to efficiency for estimating change Var (Y2,GREG Y1,GREG ) Var (Y2, HT Y1, HT ) iff res 1- 1- Y 1 rXY where res is the lag 1autocorrelation of residuals, Y is the lag 1 autocorrelatin of Y's, and rXY is the correlation between Y and X's Data Substitution Approach: Use tax as the variable of interest Assumes tax data are better respondents more serious about getting it right more time to provide information audited accounts (for BIT) for tax purposes Detailed breakdown Missing tax data require matching to frame missingness is nonignorable ƒ inactive units ƒ late units have more expenses Examples: Economic Activity Survey (annual) 1990s to 05/06 estimation of totals for broad items for microbusinesses augmenting sample for simple businesses estimation of detailed items tax data as substitution variables tax data to replace broad level income and expenses items detailed items imputed by pro-rating broad tax data based on splits observd in surveys Examples: Annual Integrated Collection (06/7 onwards) AIC - core survey estimation of totals tax data as auxiliary variables estimates for survey variables for generalised regression for small and large estimation businesses AIC complementary estimates AIC complementary estimates estimation of totals for broad items for microbusinesses estimation of detailed state/industry classes tax data as substitution variables AIC complementary estimation of detailed economic tax data as substitution variables, disaggregated by model estimation of pro-rating factors tax data as substitution variables Notation Y available ri = 1 U Y not available ri = 0 Use MAR model on frame only frame variables Xi tax data of interest Y available ri = 1 model: Y= f(x) for ri = 1 U Y not available Xi ri = 0 Use MAR model conditional on frame variables only U Xi Y available ri = 1 model: Y= f(x) for ri = 1 MAR Y not available Xi ri = 0 impute Y^ = f(x) for ri = 0 But for non-ignorable missingness U Xi Y available ri = 1 model: Y= f(x) for ri = 1 Y not available Xi ri = 0 impute Y^ = f(x) for ri = 0 Use a sample to inform about the nonreporters based on their survey response. Notation: Use Y to represent tax variables and Y* for survey variables (a surrogate of Y) U Xi Y available ri = 1 Y* available s Y not available Xi ri = 0 Y* available Imputing tax data from survey data U Xi Y available model: Y= f(Y*, xi) ri = 1 Y* available s Y not available Xi ri = 0 Y* available Imputing tax data from survey data U Xi Y available model: Y= f(Y*, f(Y*) xi) ri = 1 Y* available s Y not available Xi Y* available ri = 0 impute Ŷ Imputing tax data from survey data U Xi Y available model: Y= f(Y*, x) ri = 1 Y* available s Y not available Xi Y* available ri = 0 impute Ŷ=f(Y*, x) Models for Y Missing at Random: Y independent of r given x and Y* r Y x ,Y * Common measurement error: Given Y, distribution of Y* Is independent of r r Y * x ,Y Use MAR model: missing at random r Y x ,Y given X and Y* * U Xi Y available model: Y= f(Y*, x) for ri = 1 ri = 1 Y* available MAR s Y not available Xi ri = 0 Y* available impute Ŷ for ri = 0 Imputation using MAR model 1. 2. 3. Using data on Y and Y* observed from the units in the sample where where both survey and tax data are reported, model Y as a function of Y*. Use this model to impute Yi* for tax non reporters in the sample (assuming Y* is known for them). For units not in the sample, if their tax data is missing, impute using the distribution f (Yi | ri 0, xi ) f (Yi | ri 0, xi , Yi * ) f (Y *i | ri 0, xi )dYi * f (Yi | ri 1, xi , Yi* ) f (Y *i | ri 0, xi )dYi * r Y * Use CME model x ,Y U Xi Y available ri = 1 model: Y*= f(Y, x) for ri = 1 invert to get Ŷ= g(Y*) CME Y* available s Y not available Xi ri = 0 Y* available impute Ŷ = h(X) for ri = 0 for i in U\s Imputation using CME model r Y * x ,Y f (Yi | Yi , xi , ri 0) f ((Yi | Yi , xi , ri 1). * * A typical model can be: Y Yi i where E ( i | Yi .ri ) 0, * i This model motivates an unbiased impute:Yi (Y ) * i We also want to model Yi in terms of X i when Y* and Y are both not observed (i.e. for i s and ri 0) E (Yi | xi .ri 0) 0i xi giving an impute 0i xi 1 i Modelling survey data (Y*) and tax data (Y) - invert this to predict Y from Y* Model: survey data Y* (EAS 05/06) as a function of frame variable X (tax_turn_0405) for tax nonrespondents (i.e. r =0) Empirical Best Linear Unbiased Predictor (EBLUP) of Yi BLUP impute: EBLUP impute CME imputation process use units in sample where tax and survey variables are observed and model the survey variable (Y*) as a function of tax and frame data. (Y, X) Under CME this model applies to r = 0 too. use units in the sample where survey data are observed (i in s) but tax data are not (ri = 0) to model the survey variable (Y*)as function of frame data (x). combine to give an impute for (Y) for tax nonrespondents (r = 0): Combine to get EBLUP Further work domain estimation for CME/MAR variance estimation discriminating between CME and MAR based on data Conclusion GREG is useful for estimation of survey data but efficiency gain is limited. There is increasing interest in using tax data directly on its own to produce economic statistics. Non-ignorable missingness becomes a key issue with tax data. Survey data could be useful to help impute the tax data