CONTENTS
3.1 Introduction
The BBDESIGN module implements the weighted finite population Bayesian Bootstrap approach to generate synthetic populations from complex survey data. The primary goal is to incorporate weighting, clustering and stratification in a nonparametric approach for generating the non-sampled portion of the population from the posterior predictive distribution, conditional on the observed data and the design information. BBDESIGN assumes a two stage stratified cluster sampling approach with unequal probability of sampling at either or both stages. This approach generates a Bayesian Boostrap sample of non-sampled clusters and then uses a weighted Poly Urn model to sample non-sampled elements within the sampled and non-sampled clusters in each stratum. The details about the procedure are described in Dong, Elliott and Raghunathan (2014a, 2014b) and Zhou, Elliott and Raghunathan(2015, 2016a, 2016b). Once several synthetic populations are generated, the population quantity can be computed from each synthetic population and these can be combined using simple rules to form single inference. If there are missing values, then the synthetic populations are also generated with missing values which can be multiply imputed using the IMPUTE module. The combining rules, which differ from standard missing data multiple imputation combining rules, are discussed in the above references and will be illustrated through examples in later chapters.
3.2 BBDESIGN Statements
DATAIN filename;
This required statement identifies the location and name of the input data set. For example, in the SAS environment, the filename can be expressed as 'libname.sasdata'. In other environments, read the data set and include the name of the data set in the filename. For example,
DATAIN Mylib1.Mydata;
indicates that the SAS data file Mydata is located in the library Mylib1. Mylib1 is the name assigned to a directory with the SAS libname statement.
DATAOUT outfile ;
This statement identifies the location and name of the output data set containing the synthesized data. If more than one synthetic data set is generated, the output data set will be a concatenation of the multiple synthesized data sets. The system variable IMPL , automatically added to the output file, can be used to distinguish each implicate.
Additional Statements Include:
IMPLICATE number; where number is the number of implicate data sets.
STRATUM var;
where var is the name of the variable in the data set defining the stratum.
CLUSTER var;
where var is the name of the variable in the data set defining the clusters within each stratum.
WEIGHT var;
where var is the name of the variable in the data set defining the unit level weight.
POPSIZE number;
where the number is the number of observations included in each synthetic population generated. The default is 10 times the original sample size.
CSAMPLES number;
where number is the number of Bayesian bootstrap samples to be drawn from the sampled clusters in each stratum. The default is 5.
WSAMPLES number;
where number is the number of times the weighted Polya Urn model to be used to generate replicates of non-sampled units to be added in each of the Bayesian Bootstrap sample of clusters. The default is 5.
The number 'CSAMPLES' and 'WSAMPLES' determines the number synthetic populations generated. The default is 25 (5 Ã5, the product of the two default numbers). As in the case of any bootstrap based analysis, 250 to 500 synthetic populations may be needed to obtain reliable point and interval estimates.
ID var;
where var is the name of variable indicating a unique subject identifier. If this keyword is absent an id variable called OBS will be created in the output data set.
VAR varlist;
where varlist is a list of variables to be transferred to the output data set (of synthetic populations). If this keyword is not specified, all variables will be transferred to the output data set.
PRINT options;
can be used to print information used in the process for creating synthetic populations.
SEED number;
where number used to initialize the random number sequence for obtaining the draws. This is an important feature to reproduce the same random number sequence at a future time point.