Elsevier

Spatial Statistics

Volume 8, May 2014, Pages 69-85
Spatial Statistics

A comparison of spatial smoothing methods for small area estimation with sampling weights

https://doi.org/10.1016/j.spasta.2013.12.001Get rights and content

Abstract

Small area estimation (SAE) is an important endeavor in many fields and is used for resource allocation by both public health and government organizations. Often, complex surveys are carried out within areas, in which case it is common for the data to consist only of the response of interest and an associated sampling weight, reflecting the design. While it is appealing to use spatial smoothing models, and many approaches have been suggested for this endeavor, it is rare for spatial models to incorporate the weighting scheme, leaving the analysis potentially subject to bias. To examine the properties of various approaches to estimation we carry out a simulation study, looking at bias due to both non-response and non-random sampling. We also carry out SAE of smoking prevalence in Washington State, at the zip code level, using data from the 2006 Behavioral Risk Factor Surveillance System. The computation times for the methods we compare are short, and all approaches are implemented in R using currently available packages.

Introduction

Small area estimation (SAE) is used in many fields including education and epidemiology, and global, environmental and public health. Often the surveys carried out to inform SAE are complex in nature, with non-random sampling being carried out for reasons of necessity (i.e., logistical reasons) or to ensure that certain populations of interest are well represented. In addition, post-stratification may be used to reweight the observations in order to recover known population totals. This approach can account for non-response within the strata used in the post-stratification.

There are two approaches to modeling complex survey data that we shall consider in this paper. In the first design-based approach weighted estimators are considered, with inference carried out based on the (randomization) distribution of the samples that could have been collected, i.e., the distribution of the individuals that could appear in the sample. In contrast, a model-based approach assumes a hypothetical infinite population from which the responses are drawn. While appealing from a conceptual point of view (since standard statistical modeling machinery can be leaned upon), the modeling approach is difficult to implement since one must model the sampling mechanism, if informative, at least to some extent. For example, if non-random sampling is based on particular inclusion variables (e.g., race or geographical area) then these variables must be included in the model if they are associated with the outcome of interest. Similarly, variables that affect the probabilities of non-response must also be included in the model, again if they are related to the outcome. The alternative is to assume that variables upon which sampling is based and non-response depends are unrelated to the outcome of interest, which is a dangerous endeavor. Another impediment to the model-based approach is that the key variables that are required for inclusion may be unavailable in public-use databases. Even if available, the sampling scheme may be highly complex, requiring a model which has a large number of parameters and being therefore difficult to fit. Gelman (2007b) describes the issues, and the accompanying discussion (Bell and Cohen, 2007, Breidt and Opsomer, 2007, Little, 2007, Lohr, 2007, Pfefferman, 2007, Gelman, 2007a) gives a range of perspectives on the use of weighted estimators, regression modeling, or a combination of the two.

In this paper we will consider SAE in the situation in which either the variables upon which sampling was based are unavailable or the scheme is so complex that a simpler approach is desired. SAE has seen a great deal of research interest, with Rao (2003) being a classic text. In the related field of disease mapping, the use of spatial modeling is commonplace (Wakefield et al., 2000), but in this context the data usually consist of a complete enumeration of disease cases in an area, so that no weighting scheme needs to be considered. It is the existence of the weights that causes a major difficulty when one wishes to use spatial smoothing in SAE, and consequently there are relatively few instances of approaches that use spatial smoothing within a model that acknowledges the sampling scheme. In Chen et al. (submitted for publication) a new method of incorporating the weights within a spatial hierarchical model was introduced, and various random effects models were compared via simulation. In this paper we compare the method with a number of other suggested methods for weighting.

As a motivating example, we examine data from the Behavioral Risk Factor Surveillance System (BRFSS). This survey is carried out at the state level in the United States and is the largest telephone-based survey in the world. In the BRFSS survey, interviewees (who are 18 years or older) are asked a series of questions on their health behaviors and provide general demographic information, such as age, race, gender and the zip code in which they live. In this paper we focus on the survey conducted in Washington State in 2006, and on the Centers for Disease Control (CDC) calculated variable Adults who are current smokers. With respect to this question, 19,502 respond with “No”, 3733 with “Yes” and 132 were classified as “don’t know/refuse/missing”. In the analysis, we remove these latter values. The response variable is therefore a binary indicator and our objective is to estimate the number of individuals who are 18 or older and who are current smokers, in each of 498 zip codes in Washington State. We also utilize population estimates from 2006. Table 1 summarizes the population and survey data. So far as the survey is concerned, the number of samples per zip code shows large variability with a median of 30 and minimum and maximum values of 1 and 384. The spread is apparent in Fig. 1. Fig. 2 maps, by zip code, the observed number of smokers in the sample (top) and the sample sizes (bottom) and the spatial variability in each map is evident.

We now describe in greater detail the complex survey scheme that was used by BRFSS in 2006. In this year, the BRFSS survey used land-lines only, and utilized a disproportionate stratified random sample scheme with stratification by county and “phone likelihood”. Under this scheme in each county, based on previous surveys, blocks of 100 telephone numbers were classified into strata that are either “likely” or “unlikely” to yield residential numbers. Telephone numbers in the “likely” strata are then sampled at a higher rate than their “unlikely” counterparts. Once a person is reached at a phone number the number of eligible adults (aged 18 or over) is determined, and one of these is randomly selected for interview. The sample weight, Sample Wt, is then calculated as the product of four terms Sample Wt=Strat Wt×1No Telephones×No Adults×Post Strat Wt where Strat Wt is the inverse probability of a “likely” or “unlikely” stratum being selected in a particular county, No Telephones represents the number of residential telephones in the respondent’s household, No Adults is the number of adults in the household, and Post Strat Wt is the post-stratification correction factor. The latter is given by the number of people in strata defined by gender and age, using the 7 age groups 18–24, 25–34, 35–44, 45–54, 55–64, 65–74, 75+. The raw data we will base estimation on are the respondent’s outcome, with an accompanying weight, and the population information. And crucially, we will also examine the possibility of leveraging geographic information to smooth rates across zip codes.

The structure of the paper is as follows. In Section  2 we describe a number of approaches to formulating hierarchical models that incorporate weighting and in Section  3 a number of these methods are compared via a simulation study. In Section  4 we return to the BRFSS data and the paper concludes with a discussion in Section  5.

Section snippets

Notation and the Horvitz–Thompson estimator

We first establish our notation. We will focus on binary outcomes, and let Yik represent the binary indicator for the event of interest on the kth individual, k=1,,Ni in the ith area, i=1,,I. Common small area characteristics of interest include the true total count, Ti=k=1NiYik, or the true proportion, Pi=TiNi, in area i,i=1,,I. In common with the majority of the survey sampling literature we will denote population values with upper case letters and sampled values with lower case letters.

Simulation study

We now present a simulation study to compare five of the estimators described in the previous section. The estimators we compare are the naive binomial (6), the logit normal (7), pseudo-likelihood (8), the arcsin square root transform (10) and the numerator and denominator effective sample size adjusted binomial (11). In each case we consider two random effects models: independent random effects only, and the convolution model with both independent and spatial ICAR random effects. We also

BRFSS example

We apply the sample weighted Bayesian hierarchical models we described in Section  2 to the Washington State 2006 BRFSS data introduced in Section  1. Sampling weights are taken to be the final weights used in the BRFSS survey, as in (1). These weights range between 1.2 and 4675 across zip codes. The effective sample sizes and number of observations used in the effective sample size approach are calculated using the design-based Horvitz–Thompson variance estimator. Fig. 3 gives the effective

Discussion

In this paper we have considered random effects models that account for the sampling weights that are common in SAE. The simulations of Section  3 clearly illustrate the benefits of hierarchical modeling, namely large reductions in the variance of parameter estimation when compared with non-hierarchical approaches. These simulations also show that non-response and selection bias can be reduced via the incorporation of the weights. Further simulations are required to characterize situations in

Acknowledgments

The first author was supported by a seed grant from the Center for Statistics and the Social Sciences. The second author was supported by grant R01 AI029168 from the National Institutes of Health.

References (32)

  • F. Anscombe

    The transformation of Poisson, binomial and negative-binomial data

    Biometrika

    (1948)
  • T. Asparouhov

    General multi-level modeling with sampling weights

    Comm. Statist. Theory Methods

    (2006)
  • R. Bell et al.

    Comment on “Struggles with survey weighting and regression modeling”

    Statist. Sci.

    (2007)
  • J. Besag et al.

    On conditional and intrinsic auto-regressions

    Biometrika

    (1995)
  • J. Besag et al.

    Bayesian image restoration with two applications in spatial statistics

    Ann. Inst. Statist. Math.

    (1991)
  • D. Binder

    On the variances of asymptotically normal estimators from complex surveys

    Internat. Statist. Rev.

    (1983)
  • F. Breidt et al.

    Comment on “Struggles with survey weighting and regression modeling”

    Statist. Sci.

    (2007)
  • W. Browne et al.

    A comparison of Bayesian and likelihood-based methods for fitting multilevel models

    Bayesian Anal.

    (2006)
  • W. Browne et al.

    A comparison of Bayesian and likelihood-based methods for fitting multilevel models (rejoinder)

    Bayesian Anal.

    (2006)
  • Chen, C., Wakefield, J., Lumley, T., 2013. The use of sample weights in Bayesian hierarchical models for small area...
  • P. Congdon et al.

    Estimating small area diabetes prevalence in the US using the behavioral risk factor surveillance system

    J. Data Sci.

    (2010)
  • Y. Fong et al.

    Bayesian inference for generalized linear mixed models

    Biostatistics

    (2010)
  • A. Gelman

    Prior distributions for variance parameters in hierarchical models

    Bayesian Anal.

    (2006)
  • A. Gelman

    Rejoinder to “Struggles with survey weighting and regression modeling”

    Statist. Sci.

    (2007)
  • A. Gelman

    Struggles with survey weighting and regression modeling

    Statist. Sci.

    (2007)
  • D. Horvitz et al.

    A generalization of sampling without replacement from a finite universe

    J. Amer. Statist. Assoc.

    (1952)
  • Cited by (0)

    View full text