A comparison of spatial smoothing methods for small area estimation with sampling weights
Introduction
Small area estimation (SAE) is used in many fields including education and epidemiology, and global, environmental and public health. Often the surveys carried out to inform SAE are complex in nature, with non-random sampling being carried out for reasons of necessity (i.e., logistical reasons) or to ensure that certain populations of interest are well represented. In addition, post-stratification may be used to reweight the observations in order to recover known population totals. This approach can account for non-response within the strata used in the post-stratification.
There are two approaches to modeling complex survey data that we shall consider in this paper. In the first design-based approach weighted estimators are considered, with inference carried out based on the (randomization) distribution of the samples that could have been collected, i.e., the distribution of the individuals that could appear in the sample. In contrast, a model-based approach assumes a hypothetical infinite population from which the responses are drawn. While appealing from a conceptual point of view (since standard statistical modeling machinery can be leaned upon), the modeling approach is difficult to implement since one must model the sampling mechanism, if informative, at least to some extent. For example, if non-random sampling is based on particular inclusion variables (e.g., race or geographical area) then these variables must be included in the model if they are associated with the outcome of interest. Similarly, variables that affect the probabilities of non-response must also be included in the model, again if they are related to the outcome. The alternative is to assume that variables upon which sampling is based and non-response depends are unrelated to the outcome of interest, which is a dangerous endeavor. Another impediment to the model-based approach is that the key variables that are required for inclusion may be unavailable in public-use databases. Even if available, the sampling scheme may be highly complex, requiring a model which has a large number of parameters and being therefore difficult to fit. Gelman (2007b) describes the issues, and the accompanying discussion (Bell and Cohen, 2007, Breidt and Opsomer, 2007, Little, 2007, Lohr, 2007, Pfefferman, 2007, Gelman, 2007a) gives a range of perspectives on the use of weighted estimators, regression modeling, or a combination of the two.
In this paper we will consider SAE in the situation in which either the variables upon which sampling was based are unavailable or the scheme is so complex that a simpler approach is desired. SAE has seen a great deal of research interest, with Rao (2003) being a classic text. In the related field of disease mapping, the use of spatial modeling is commonplace (Wakefield et al., 2000), but in this context the data usually consist of a complete enumeration of disease cases in an area, so that no weighting scheme needs to be considered. It is the existence of the weights that causes a major difficulty when one wishes to use spatial smoothing in SAE, and consequently there are relatively few instances of approaches that use spatial smoothing within a model that acknowledges the sampling scheme. In Chen et al. (submitted for publication) a new method of incorporating the weights within a spatial hierarchical model was introduced, and various random effects models were compared via simulation. In this paper we compare the method with a number of other suggested methods for weighting.
As a motivating example, we examine data from the Behavioral Risk Factor Surveillance System (BRFSS). This survey is carried out at the state level in the United States and is the largest telephone-based survey in the world. In the BRFSS survey, interviewees (who are 18 years or older) are asked a series of questions on their health behaviors and provide general demographic information, such as age, race, gender and the zip code in which they live. In this paper we focus on the survey conducted in Washington State in 2006, and on the Centers for Disease Control (CDC) calculated variable Adults who are current smokers. With respect to this question, 19,502 respond with “No”, 3733 with “Yes” and 132 were classified as “don’t know/refuse/missing”. In the analysis, we remove these latter values. The response variable is therefore a binary indicator and our objective is to estimate the number of individuals who are 18 or older and who are current smokers, in each of 498 zip codes in Washington State. We also utilize population estimates from 2006. Table 1 summarizes the population and survey data. So far as the survey is concerned, the number of samples per zip code shows large variability with a median of 30 and minimum and maximum values of 1 and 384. The spread is apparent in Fig. 1. Fig. 2 maps, by zip code, the observed number of smokers in the sample (top) and the sample sizes (bottom) and the spatial variability in each map is evident.
We now describe in greater detail the complex survey scheme that was used by BRFSS in 2006. In this year, the BRFSS survey used land-lines only, and utilized a disproportionate stratified random sample scheme with stratification by county and “phone likelihood”. Under this scheme in each county, based on previous surveys, blocks of 100 telephone numbers were classified into strata that are either “likely” or “unlikely” to yield residential numbers. Telephone numbers in the “likely” strata are then sampled at a higher rate than their “unlikely” counterparts. Once a person is reached at a phone number the number of eligible adults (aged 18 or over) is determined, and one of these is randomly selected for interview. The sample weight, Sample Wt, is then calculated as the product of four terms where Strat Wt is the inverse probability of a “likely” or “unlikely” stratum being selected in a particular county, No Telephones represents the number of residential telephones in the respondent’s household, No Adults is the number of adults in the household, and Post Strat Wt is the post-stratification correction factor. The latter is given by the number of people in strata defined by gender and age, using the 7 age groups 18–24, 25–34, 35–44, 45–54, 55–64, 65–74, 75+. The raw data we will base estimation on are the respondent’s outcome, with an accompanying weight, and the population information. And crucially, we will also examine the possibility of leveraging geographic information to smooth rates across zip codes.
The structure of the paper is as follows. In Section 2 we describe a number of approaches to formulating hierarchical models that incorporate weighting and in Section 3 a number of these methods are compared via a simulation study. In Section 4 we return to the BRFSS data and the paper concludes with a discussion in Section 5.
Section snippets
Notation and the Horvitz–Thompson estimator
We first establish our notation. We will focus on binary outcomes, and let represent the binary indicator for the event of interest on the th individual, in the th area, . Common small area characteristics of interest include the true total count, , or the true proportion, , in area . In common with the majority of the survey sampling literature we will denote population values with upper case letters and sampled values with lower case letters.
Simulation study
We now present a simulation study to compare five of the estimators described in the previous section. The estimators we compare are the naive binomial (6), the logit normal (7), pseudo-likelihood (8), the arcsin square root transform (10) and the numerator and denominator effective sample size adjusted binomial (11). In each case we consider two random effects models: independent random effects only, and the convolution model with both independent and spatial ICAR random effects. We also
BRFSS example
We apply the sample weighted Bayesian hierarchical models we described in Section 2 to the Washington State 2006 BRFSS data introduced in Section 1. Sampling weights are taken to be the final weights used in the BRFSS survey, as in (1). These weights range between 1.2 and 4675 across zip codes. The effective sample sizes and number of observations used in the effective sample size approach are calculated using the design-based Horvitz–Thompson variance estimator. Fig. 3 gives the effective
Discussion
In this paper we have considered random effects models that account for the sampling weights that are common in SAE. The simulations of Section 3 clearly illustrate the benefits of hierarchical modeling, namely large reductions in the variance of parameter estimation when compared with non-hierarchical approaches. These simulations also show that non-response and selection bias can be reduced via the incorporation of the weights. Further simulations are required to characterize situations in
Acknowledgments
The first author was supported by a seed grant from the Center for Statistics and the Social Sciences. The second author was supported by grant R01 AI029168 from the National Institutes of Health.
References (32)
The transformation of Poisson, binomial and negative-binomial data
Biometrika
(1948)General multi-level modeling with sampling weights
Comm. Statist. Theory Methods
(2006)- et al.
Comment on “Struggles with survey weighting and regression modeling”
Statist. Sci.
(2007) - et al.
On conditional and intrinsic auto-regressions
Biometrika
(1995) - et al.
Bayesian image restoration with two applications in spatial statistics
Ann. Inst. Statist. Math.
(1991) On the variances of asymptotically normal estimators from complex surveys
Internat. Statist. Rev.
(1983)- et al.
Comment on “Struggles with survey weighting and regression modeling”
Statist. Sci.
(2007) - et al.
A comparison of Bayesian and likelihood-based methods for fitting multilevel models
Bayesian Anal.
(2006) - et al.
A comparison of Bayesian and likelihood-based methods for fitting multilevel models (rejoinder)
Bayesian Anal.
(2006) - Chen, C., Wakefield, J., Lumley, T., 2013. The use of sample weights in Bayesian hierarchical models for small area...