Elsevier

Spatial Statistics

Volume 18, Part B, November 2016, Pages 455-473
Spatial Statistics

Model-based inference for small area estimation with sampling weights

https://doi.org/10.1016/j.spasta.2016.09.004Get rights and content

Abstract

Obtaining reliable estimates about health outcomes for areas or domains where only few to no samples are available is the goal of small area estimation (SAE). Often, we rely on health surveys to obtain information about health outcomes. Such surveys are often characterised by a complex design, stratification, and unequal sampling weights as common features. Hierarchical Bayesian models are well recognised in SAE as a spatial smoothing method, but often ignore the sampling weights that reflect the complex sampling design. In this paper, we focus on data obtained from a health survey where the sampling weights of the sampled individuals are the only information available about the design. We develop a predictive model-based approach to estimate the prevalence of a binary outcome for both the sampled and non-sampled individuals, using hierarchical Bayesian models that take into account the sampling weights. A simulation study is carried out to compare the performance of our proposed method with other established methods. The results indicate that our proposed method achieves great reductions in mean squared error when compared with standard approaches. It performs equally well or better when compared with more elaborate methods when there is a relationship between the responses and the sampling weights. The proposed method is applied to estimate asthma prevalence across districts.

Introduction

In public health we are often interested in the question whether there are disparities in illness, behavioural risk factors or health conditions across areas. An increasing amount of information on individuals is collected in this respect. Bayesian methods in disease mapping based on census or population registry data are well developed and used in a fairly standard manner (see e.g.,  Elliott et al., 2001; Waller and Gotway, 2004; Lawson, 2013 for a review of the methods). Such population registry or census data obtains information pertaining to each member of the population of an area. Historically, focus was on the construction of cancer atlases and on mapping rare diseases based on registry data (see e.g.,  Kemp et al., 1985; Mason, 1995).

Since it is nearly always impossible to measure the health outcome of interest in every individual in the population, a survey is used to record information from a random sample of individuals from the population (Cochran, 1977). Such surveys are often characterized by a complex design, with stratification, clustering and unequal sampling weights as common features. Policy makers are often interested in a specific characteristic, such as the total number of diseased cases or the prevalence, per area. In small area estimation (SAE) one investigates how to obtain these area specific characteristics from survey data covering more than only the area of interest by using spatial smoothing methods.

In SAE, one needs to choose whether to base inference on design-based, model-based or design-based model-assisted approaches. In design-based inference the values of the health outcomes are assumed fixed, and inference is based on the randomization distribution of the sample inclusion indicators. Often a model is used in the construction of a design-based estimator (known as design-based model-assisted approaches). A popular design-based estimator is the Horvitz–Thompson (HT) estimator (1952) and its extensions that weigh sampled individuals with the associated sampling weight. These estimators play a dominant role in sample surveys, however, they often fail in SAE because the sample size per area could be very small or even zero inflating the mean squared error tremendously. This makes design-based estimators unreliable or not feasible to use (Rao, 2011). Additionally, because of the spatial nature of the problem, understanding the geographical distribution of the health outcome is important. Model-based approaches that perform spatial smoothing, both those based on empirical and hierarchical Bayesian methodology, have shown to be more relevant in the handling of spatially correlated health survey data. In model-based approaches one conditions on the selected sample and the inference is based on the underlying model of the health outcome. Examples include Fay and Herriot (1979) which proposed a linear empirical Bayes model to estimate the income for small areas, while Datta and Ghosh (1991) considered a hierarchical Bayesian formulation instead. A number of extensions have been made, see Rao (2003) and Jiang and Lahiri (2006) for an overview. For binary data, MacGibbon and Tomberlin (1989) developed an empirical Bayes model using a logistic regression model with fixed and random effects. Stroud (1994), Ghosh et al. (1998) and Farrell (2000) described hierarchical Bayesian approaches to estimate small area proportions.

While model-based SAE is conceptually appealing, complex survey designs with the accompanying survey weights cause a difficulty in their practical implementation. Only relatively few approaches acknowledge the survey sampling mechanism and account for it in the model. Kott (1989) and Prasad and Rao (1999) described a design-consistent model-based estimator. Kott (1989) proposed an estimator which is a weighted combination of the HT estimator and the sample means of the different areas. Prasad and Rao (1999) proposed a pseudo-empirical best linear unbiased prediction estimator for the small area mean based on area level data. You and Rao, 2002, You and Rao, 2003 used unit level data instead. Malec et al. (1997) described a hierarchical Bayesian model for binary survey data. They examined the use of sampling weights as a linear covariate in the model, after the inclusion of several post-stratification variables. Chen et al. (2014) proposed the use of a weight-adjusted Bayesian estimator that takes into account the effective sample size. Mercer et al. (2014) described a simulation study in which several methods for spatial smoothing in SAE, taking into account the sampling weights, are compared.

In this article, we describe a spatial predictive model-based approach to SAE for a binary health outcome in a complex survey with given sampling weights. We assume that the sampling weights on the sampled individuals are the only information available about the survey design. The goal is to estimate the prevalence of the health outcome for all small areas in the spatial domain. A hierarchical Bayesian model is used in which the health outcomes are regressed on the sampling weights. A non-parametric regression on the weights is used to minimise possible bias of the regression function. Additionally, both unstructured and structured spatial random effects are introduced to model the geographical distribution of the health outcomes. The population distribution of the sampling weights is unknown as well, hence we must model the weights themselves to be able to perform predictions. Our proposed method extends ideas described in Si et al. (2015) that are useful for surveys outside the SAE context. We use integrated nested Laplace approximations in R for model estimation (Rue et al., 2009). The methods described in this article add a hierarchical Bayesian model-based prediction approach for data with associated sampling weights to the SAE literature.

The structure of the paper is as follows. In Section  2 we introduce notations and describe the traditional design-based approach to perform SAE from a health survey. Several model-based approaches summarized in Mercer et al. (2014) that are used here for comparison purposes in the simulation study are also described in Section  2. We describe our proposed model-based approach in Section  3, and provide some details on the implementation of the models in standard software. A simulation study comparing our methods to other design- and model-based methods is provided in Section  4. In Section  5, we analyse the 2001 Belgian Health Interview Survey to estimate asthma prevalence across districts. We conclude the paper with a discussion in Section  6.

Section snippets

Notation

Let Yik be a binary health outcome for individual i in small area k (i=1,,Nk and k=1,,K) with Nk the population size in area k. We assume that Nk is known for each area. A sample of size nk is drawn from each area k, where some of the nk could be zero. Denote the sampled values by yik. Let N=k=1KNk and n=k=1Knk represent the total population and sample size, respectively. We shall focus on estimating the true prevalence, Pk, in each area k, namely Pk=1Nki=1NkYik. Let Rik denote the binary

Proposed methods

In this section, we propose a hierarchical model for the observed outcomes yik (Section  3.1), and explain how to use this model to make predictions yˆik for non-sampled individuals in order to obtain an estimator of Pk (Section  3.2).

Simulation setup

In this section we describe the setup of the simulation study to evaluate the performance of the different small area estimators described in this article. As geography, we took the administrative district division of Belgium (see Fig. 1 and Section  5). The total region consists out of 43 districts. Population sizes stratified by five-year age-groups and gender (yielding a total of J=36 strata) at each district are available. The total population size is around ten million. Let xa denote the

Application to belgian health interview survey

Next, we focus our attention on empirical data measuring the prevalence of asthma across the 43 districts shown in Fig. 1 using the 2001 Belgian Health Interview Survey (HIS). Data were collected in response to the question “Have you experienced asthma in the previous year?”. In total, 12,003 individuals responded to this question. The number of respondents per district varied between 50 and 2949, and 4 districts were not selected in the survey. In total 612 (5.1%) individuals responded

Discussion

We have presented a predictive model-based approach for the estimation of small area estimates from a health survey in which the survey weights of the sampled individuals are the only information available on the survey design. Our approach uses a hierarchical Bayesian model in which the health outcomes are regressed via a non-parametric function on the normalized survey weights to obtain predictions of the outcome for the non-sampled individuals. The hierarchical model accounts for the spatial

Acknowledgments

Support from a doctoral grant of Hasselt University is acknowledged (BOF11D04FAEC to YV). Support from the National Institutes of Health is acknowledged [award number R01CA172805 to CF]. Support from the University of Antwerp scientific chair in Evidence-Based Vaccinology, financed in 2009–2015 by a gift from Pfizer, is acknowledged [to NH]. Support from the IAP Research Network P7/06 of the Belgian State (Belgian Science Policy) is gratefully acknowledged (FEDRA P7/06). This research is

References (50)

  • S. Demarest et al.

    Health Interview Survey 2001: Protocol for the Sampling Design

    (2001)
  • L.E. Eberly et al.

    Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models

    Stat. Med.

    (2000)
  • P.H.C. Eilers et al.

    Flexible smoothing with B-splines and penalties (with discussion)

    Statist. Sci.

    (1996)
  • P.J. Farrell

    Bayesian inference for small area proportions

    Indian J. Stat.

    (2000)
  • R.E. Fay et al.

    Estimates of income for small places: An application of James–Stein procedures to census data

    J. Amer. Statist. Assoc.

    (1979)
  • Y. Fong et al.

    Bayesian inference for generalized linear mixed models

    Biostatistics

    (2010)
  • M. Ghosh et al.

    Generalized linear models for small-area estmation

    J. Amer. Statist. Assoc.

    (1998)
  • M.E. Gonzalez

    Use and evaluation of synthetic estimators

  • T. Hastie et al.

    The Elements of Statistical Learning

    (2001)
  • D.G. Horvitz et al.

    A generalization of sampling without replacement from a finite universe

    J. Amer. Statist. Assoc.

    (1952)
  • J. Jiang et al.

    Mixed model prediction and small area estimation

    TEST

    (2006)
  • Kemp, I., Boyle, P., Smans, M., Muir, C.S., 1985. Atlas of cancer in Scotland, 1975–1980: Incidence and epidemiological...
  • P. Kott

    Robust small domain estimation using random effects modelling

    Surv. Methodol.

    (1989)
  • Cited by (36)

    • A new small area estimation algorithm to balance between statistical precision and scale

      2021, International Journal of Applied Earth Observation and Geoinformation
      Citation Excerpt :

      For some specific domains, field sample size might therefore be too small and the uncertainty too large to meet the precision requirements. In such cases, as well as in areas of interest without any field plots, model-based approaches represent alternatives (Vandendijck et al., 2016; Magnussen, 2015). Those so-called indirect estimators (Rao and Molina, 2015) take advantage of sample plots and auxiliary data available outside of the area of interest.

    • A roadmap for disclosure avoidance in the survey of income and program participation

      2024, A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation
    View all citing articles on Scopus
    View full text