Model-based inference for small area estimation with sampling weights
Introduction
In public health we are often interested in the question whether there are disparities in illness, behavioural risk factors or health conditions across areas. An increasing amount of information on individuals is collected in this respect. Bayesian methods in disease mapping based on census or population registry data are well developed and used in a fairly standard manner (see e.g., Elliott et al., 2001; Waller and Gotway, 2004; Lawson, 2013 for a review of the methods). Such population registry or census data obtains information pertaining to each member of the population of an area. Historically, focus was on the construction of cancer atlases and on mapping rare diseases based on registry data (see e.g., Kemp et al., 1985; Mason, 1995).
Since it is nearly always impossible to measure the health outcome of interest in every individual in the population, a survey is used to record information from a random sample of individuals from the population (Cochran, 1977). Such surveys are often characterized by a complex design, with stratification, clustering and unequal sampling weights as common features. Policy makers are often interested in a specific characteristic, such as the total number of diseased cases or the prevalence, per area. In small area estimation (SAE) one investigates how to obtain these area specific characteristics from survey data covering more than only the area of interest by using spatial smoothing methods.
In SAE, one needs to choose whether to base inference on design-based, model-based or design-based model-assisted approaches. In design-based inference the values of the health outcomes are assumed fixed, and inference is based on the randomization distribution of the sample inclusion indicators. Often a model is used in the construction of a design-based estimator (known as design-based model-assisted approaches). A popular design-based estimator is the Horvitz–Thompson (HT) estimator (1952) and its extensions that weigh sampled individuals with the associated sampling weight. These estimators play a dominant role in sample surveys, however, they often fail in SAE because the sample size per area could be very small or even zero inflating the mean squared error tremendously. This makes design-based estimators unreliable or not feasible to use (Rao, 2011). Additionally, because of the spatial nature of the problem, understanding the geographical distribution of the health outcome is important. Model-based approaches that perform spatial smoothing, both those based on empirical and hierarchical Bayesian methodology, have shown to be more relevant in the handling of spatially correlated health survey data. In model-based approaches one conditions on the selected sample and the inference is based on the underlying model of the health outcome. Examples include Fay and Herriot (1979) which proposed a linear empirical Bayes model to estimate the income for small areas, while Datta and Ghosh (1991) considered a hierarchical Bayesian formulation instead. A number of extensions have been made, see Rao (2003) and Jiang and Lahiri (2006) for an overview. For binary data, MacGibbon and Tomberlin (1989) developed an empirical Bayes model using a logistic regression model with fixed and random effects. Stroud (1994), Ghosh et al. (1998) and Farrell (2000) described hierarchical Bayesian approaches to estimate small area proportions.
While model-based SAE is conceptually appealing, complex survey designs with the accompanying survey weights cause a difficulty in their practical implementation. Only relatively few approaches acknowledge the survey sampling mechanism and account for it in the model. Kott (1989) and Prasad and Rao (1999) described a design-consistent model-based estimator. Kott (1989) proposed an estimator which is a weighted combination of the HT estimator and the sample means of the different areas. Prasad and Rao (1999) proposed a pseudo-empirical best linear unbiased prediction estimator for the small area mean based on area level data. You and Rao, 2002, You and Rao, 2003 used unit level data instead. Malec et al. (1997) described a hierarchical Bayesian model for binary survey data. They examined the use of sampling weights as a linear covariate in the model, after the inclusion of several post-stratification variables. Chen et al. (2014) proposed the use of a weight-adjusted Bayesian estimator that takes into account the effective sample size. Mercer et al. (2014) described a simulation study in which several methods for spatial smoothing in SAE, taking into account the sampling weights, are compared.
In this article, we describe a spatial predictive model-based approach to SAE for a binary health outcome in a complex survey with given sampling weights. We assume that the sampling weights on the sampled individuals are the only information available about the survey design. The goal is to estimate the prevalence of the health outcome for all small areas in the spatial domain. A hierarchical Bayesian model is used in which the health outcomes are regressed on the sampling weights. A non-parametric regression on the weights is used to minimise possible bias of the regression function. Additionally, both unstructured and structured spatial random effects are introduced to model the geographical distribution of the health outcomes. The population distribution of the sampling weights is unknown as well, hence we must model the weights themselves to be able to perform predictions. Our proposed method extends ideas described in Si et al. (2015) that are useful for surveys outside the SAE context. We use integrated nested Laplace approximations in R for model estimation (Rue et al., 2009). The methods described in this article add a hierarchical Bayesian model-based prediction approach for data with associated sampling weights to the SAE literature.
The structure of the paper is as follows. In Section 2 we introduce notations and describe the traditional design-based approach to perform SAE from a health survey. Several model-based approaches summarized in Mercer et al. (2014) that are used here for comparison purposes in the simulation study are also described in Section 2. We describe our proposed model-based approach in Section 3, and provide some details on the implementation of the models in standard software. A simulation study comparing our methods to other design- and model-based methods is provided in Section 4. In Section 5, we analyse the 2001 Belgian Health Interview Survey to estimate asthma prevalence across districts. We conclude the paper with a discussion in Section 6.
Section snippets
Notation
Let be a binary health outcome for individual in small area ( and ) with the population size in area . We assume that is known for each area. A sample of size is drawn from each area , where some of the could be zero. Denote the sampled values by . Let and represent the total population and sample size, respectively. We shall focus on estimating the true prevalence, , in each area , namely Let denote the binary
Proposed methods
In this section, we propose a hierarchical model for the observed outcomes (Section 3.1), and explain how to use this model to make predictions for non-sampled individuals in order to obtain an estimator of (Section 3.2).
Simulation setup
In this section we describe the setup of the simulation study to evaluate the performance of the different small area estimators described in this article. As geography, we took the administrative district division of Belgium (see Fig. 1 and Section 5). The total region consists out of 43 districts. Population sizes stratified by five-year age-groups and gender (yielding a total of strata) at each district are available. The total population size is around ten million. Let denote the
Application to belgian health interview survey
Next, we focus our attention on empirical data measuring the prevalence of asthma across the 43 districts shown in Fig. 1 using the 2001 Belgian Health Interview Survey (HIS). Data were collected in response to the question “Have you experienced asthma in the previous year?”. In total, 12,003 individuals responded to this question. The number of respondents per district varied between 50 and 2949, and 4 districts were not selected in the survey. In total 612 (5.1%) individuals responded
Discussion
We have presented a predictive model-based approach for the estimation of small area estimates from a health survey in which the survey weights of the sampled individuals are the only information available on the survey design. Our approach uses a hierarchical Bayesian model in which the health outcomes are regressed via a non-parametric function on the normalized survey weights to obtain predictions of the outcome for the non-sampled individuals. The hierarchical model accounts for the spatial
Acknowledgments
Support from a doctoral grant of Hasselt University is acknowledged (BOF11D04FAEC to YV). Support from the National Institutes of Health is acknowledged [award number R01CA172805 to CF]. Support from the University of Antwerp scientific chair in Evidence-Based Vaccinology, financed in 2009–2015 by a gift from Pfizer, is acknowledged [to NH]. Support from the IAP Research Network P7/06 of the Belgian State (Belgian Science Policy) is gratefully acknowledged (FEDRA P7/06). This research is
References (50)
- et al.
The use of sample weights in Bayesian hierarchical models for small area estimation
Spat. Spat.-Temporal Epidemiol.
(2014) - et al.
A comparison of spatial smoothing methods for small area estimation with sampling weights
Spat. Stat.
(2014) - et al.
Pseudo hierarchical Bayes small area estimation combining unit level models and survey weights
J. Statist. Plann. Inference
(2003) The multinomial-Poisson transformation
Statistician
(1994)- et al.
Bayesian image restoration, with two applications in spatial statistics
Ann. Inst. Statist. Math.
(1991) - et al.
Simultaneous probability statements for Bayesian P-splines
Stat. Model.
(2008) - et al.
Bayesian penalized spline model-based inference for finite population proportion in unequal probability sampling
Surv. Methodol.
(2010) Sampling Techniques
(1977)- et al.
Estimating small area diabetes prevalence in the US using the behavioral risk factor surveillance system
J. Data Sci.
(2010) - et al.
Bayesian prediction in linear models: Applications to small area estimation
Ann. Statist.
(1991)
Health Interview Survey 2001: Protocol for the Sampling Design
Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models
Stat. Med.
Flexible smoothing with B-splines and penalties (with discussion)
Statist. Sci.
Bayesian inference for small area proportions
Indian J. Stat.
Estimates of income for small places: An application of James–Stein procedures to census data
J. Amer. Statist. Assoc.
Bayesian inference for generalized linear mixed models
Biostatistics
Generalized linear models for small-area estmation
J. Amer. Statist. Assoc.
Use and evaluation of synthetic estimators
The Elements of Statistical Learning
A generalization of sampling without replacement from a finite universe
J. Amer. Statist. Assoc.
Mixed model prediction and small area estimation
TEST
Robust small domain estimation using random effects modelling
Surv. Methodol.
Cited by (36)
A new small area estimation algorithm to balance between statistical precision and scale
2021, International Journal of Applied Earth Observation and GeoinformationCitation Excerpt :For some specific domains, field sample size might therefore be too small and the uncertainty too large to meet the precision requirements. In such cases, as well as in areas of interest without any field plots, model-based approaches represent alternatives (Vandendijck et al., 2016; Magnussen, 2015). Those so-called indirect estimators (Rao and Molina, 2015) take advantage of sample plots and auxiliary data available outside of the area of interest.
A roadmap for disclosure avoidance in the survey of income and program participation
2024, A Roadmap for Disclosure Avoidance in the Survey of Income and Program ParticipationBayesian estimation methods for survey data with potential applications to health disparities research
2024, Wiley Interdisciplinary Reviews: Computational StatisticsTaking advantage of sampling designs in spatial small-area survey studies
2024, Statistical ModellingMapping the prevalence of cancer risk factors at the small area level in Australia
2023, International Journal of Health Geographics