Incorporating Hierarchical Structure Into Dynamic Systems: An Application Of Estimating HIV Epidemics At Sub-National And Sub-Population Level
Dynamic models have been successfully used in producing estimates of HIV epidemics at national level, due to their epidemiological nature and their ability to simultaneously estimate prevalence, incidence, and mortality rates. Recently, HIV interventions and policies have required more information at sub-national and sub-population levels to support local planning, decision making and resource allocation. Unfortunately, many areas and high-risk groups lack sufficient data for deriving stable and reliable results, and this is a critical technical barrier to more stratified estimates. One solution is to borrow information from other areas and groups within the same country. However, directly assuming hierarchical structures within the HIV dynamic models is complicated and computationally time consuming. In this paper, we propose a simple and innovative way to incorporate the hierarchical information into the dynamic systems by using auxiliary data. The proposed method efficiently uses information from multiple areas and risk groups within each country without increasing the computational burden. As a result, the new model improves predictive ability in general with especially significant improvement in areas and risk groups with sparse data.
Keywords: Hierarchical model, Dynamic systems, HIV epidemics
Mounting an effective response to HIV/AIDS requires reliable estimates of the global scope and recent trends in the epidemic. National governments, with support from UNAIDS, use a dynamic mathematical modeling software called Spectrum to annually generate national HIV epidemic estimates [Case2014]. Spectrum was applied in 163 countries in 2015. Recent estimates suggest that great progress has been made, including a 38% decline in new infections globally since 2001, a 58% drop in new infections in children since 2002, and a 35% fall in AIDS-related death since 2005 [UNAIDS2014].
However, it is perceived that among these successes, certain key populations at higher risk of acquiring HIV and sub-national geographic regions with greatly elevated burden have not benefited equally. Young women are disproportionately affected by HIV in Sub-Saharan Africa, accounting for 71% of all infections among adolescents. Prevalence reviews have shown that sex workers and men who have sex with men have ten-fold or higher prevalence compared to the population as a whole [Beyrer2012, Baral2012]. People who inject drugs account for an estimated 30% of new HIV infections outside of Sub-Saharan Africa. And within high burden countries, geographic variations by province and district can be large, e.g., in Zambia HIV regional HIV prevalence varied by a factor of three from 6.4% to 18.2% in 2013 [ZambiaDHS2015].â
The existing tools for generating estimates involve fitting a dynamic epidemic model called the Estimation and Projection Package (EPP) to national surveillance data about HIV prevalence using a Bayesian framework [Alkema2007]. The dynamic system modeling framework has several advantages including its epidemiological nature, ability to make short-term projection, and simultaneous estimation of prevalence, incidence, and mortality rates. It is well known that HIV epidemics can spread heterogeneously across sub-national areas and risk groups. Nevertheless, current approaches to monitor the epidemic do not always adequately capture this heterogeneity. Currently, the sub-national estimations are done by applying the epidemic models independently using only data from within the area/group [Mahy2014]. However, the availability and quality of the data vary widely. For areas/groups with sparse data, the models produce inaccurate results with large uncertainty bounds [Lyerla2008, Calleja2010].
One solution is to assume that the parameters of the epidemic models are correlated among areas, groups, or even countries in a hierarchical framework, thereby efficiently borrowing information from other areas and groups with similar epidemics. However, estimating parameters in the HIV epidemic models is already time consuming because of the dynamic systems that are used to construct the historic trends of multiple quantities of interest, such as the number of people living with HIV, the number of new infections, and the number of HIV deaths. Directly assuming hierarchical structures and modeling the joint distribution of parameters in multiple areas and risk groups further challenge the estimation and computation.
In this paper, we propose a simple and innovative way to incorporate the hierarchical information into the dynamic systems by using auxiliary data. The proposed method efficiently uses information from multiple areas and risk groups within each country without increasing the computational burden. In Section 2, we describe the dynamic models used to produce national estimates. In Section 3, we introduce our method that extends the sub-national / sub-population estimation to include the hierarchical structure for sharing information across areas and risk groups. In Section 4, we evaluate the model performance via test data prediction, and present results for Nigeria and Thailand. In Section 5, we offer conclusions and discussion for future work.
2 The HIV estimation model
The most commonly used model to fit and project HIV epidemics is the Estimation and Projection Package (EPP), which is the primary source of incidence estimates in Spectrum, the UNAIDS provided software used by most countries. EPP is based on a susceptible-infected (SI) epidemiological model. The dynamic model is as follows:
The model represents the adult (aged 15-49 year) population stratified into the susceptible population and infected population at time , such that the total adult population size is . The transmission rate over time from infected to uninfected adults is modeled as a flexible smooth function . The functional form for is either modeled by B-splines in the ‘r-spline’ model [Hogan2010] or a parametric autoregressive model in the r-trend model [Bao2010rflex, Bao2012rtrend]. is the number of new adults entering the population at age 15, is the prevalence, is the non-HIV death rate, and is the number of deaths among infected individuals. is the rate at which adults exit the model after attaining age 50, and is the rate of net migration into the population. For some high-risk groups, such as people who inject drugs and sex workers, entrance and exit into the population is also determined by the average duration of staying within the high-risk population. Given a small initial seed prevalence at the start of the epidemic, the model simulates a temporal trend of prevalence, incidence, and HIV mortality rates.
The main data source for EPP model estimation in the general population is sentinel surveillance of HIV prevalence among pregnant women attending antenatal clinics (ANC). These consist of the number of infected women and the number of tested women , for a given year and a given clinic . Where available, model estimation also uses population prevalence estimates for the entire population from nationally representative household surveys.
To account for multiple sentinel sites in the same region, the observed clinic-level prevalence and the population prevalence – – derived from the dynamic system (1) are linked through a mixed effects model. In addition, the prevalence is transformed to be put into a linear mixed effect model so that the random effects and their variance can be integrated out analytically under the normality assumption, and thus simplify the parameter estimation [Alkema2007, Bao2012rtrend].
where is the inverse cumulative distribution function of the standard normal distribution, , is the bias of ANC data with respect to prevalence data from national population-based household surveys (NPBS), is the site-specific random effect, is assumed to have an inverse-Gamma prior which gets integrated out in the likelihood evaluation, and is a fixed quantity that depends on the clinic data and approximates the binomial variation.
Finally, a Bayesian framework is used to bring in the prior information of the key parameters such as the starting time of the epidemic. Model parameters are estimated with posterior approximation via Incremental Mixture Importance Sampling (IMIS) [Raftery2010]. The final estimates reflect three sources of information: the prevalence data, the prior knowledge, and the epidemic trends inferred by the SI model in Equation 1. As a result, the EPP model has several advantages for estimating the HIV epidemics. Its basis in an dynamic epidemic model lends theoretical credibility to the resulting epidemic inference, and the model intrinsically represents the relationship between key epidemiological processes, enabling internally consistent estimates about multiple quantities of interest including prevalence, mortality, and incidence, about which there is no directly observable data. Finally, the dynamic model is linked to a statistical model so that it can produce uncertainty estimates and projection intervals for all quantities of interest. We will continue to use the EPP model as a building block to produce sub-national and sub-population epidemic estimates.
3 Incorporating hierarchical structure into estimating sub-epidemics
The current approach to extend the EPP model to fit sub-epidemics is to run the EPP model for each sub-national area and risk group combination independently using only data from within the sub-epidemic. However, many sub-epidemics lack sufficient data for deriving stable and reliable results (see Figure LABEL:fig:Nigeria (d) as an example where the uncertainty is too large for the early period of the epidemic without sufficient data). Our goal is to improve the accuracy of the results in areas and groups with sparse data, while retaining the simplicity of the epidemic model and not increasing the computational burden. In previous work, we have demonstrated the application of a full hierarchical model for joint parameter estimation across regions [Bao2015a]. However, this is burdensome for a computationally expensive dynamic model and likelihood, and may prove to be infeasible for some routine applications. The main idea proposed here is very simple: we propose to fit a generalized linear mixed model (GLMM) to data from all sub-national areas and risk groups, and incorporate the fitted prevalence into the EPP model estimation through auxiliary data to efficiently approximate the information that would be added through a hierarchical structure. The approach can be outlined as follows:
For a particular country, we pool the data from all areas and risk groups together and fit a GLMM with a non-parametric flexible time trend and random effects for areas, sites, and risk groups to represent the hierarchical structure.
We then use the predictive distribution of the sub-epidemic prevalence from the GLMM to create auxiliary data. The precision of the auxiliary data is determined by the uncertainty estimate from GLMM.
Finally, we add the auxiliary data to the original data and fit the EPP model separately for each sub-epidemic as before. The resulting prevalence, incidence, and mortality estimates can all be derived the same way as before, while they now contain the information from other areas that would be included through the hierarchical structure.
The details of the model are described in the following subsections.
3.1 Generalized linear mixed models (GLMM)
In the first step, we model the prevalence data without using the epidemic model, so that the information can be efficiently pooled across areas. GLMM is a natural choice to model such data which takes the hierarchy or even spatial dependence into consideration. Instead of using a dynamic model, we model prevalence trends as functions of time over the period in which data were observed. Spline models introduce great flexibility with fixed degree, and relieve the pressure of over-fitting [Wold1974, Wahba1990]. With some experiments, we choose the natural cubic spline model with equally spaced knots to describe the time trends. The natural spline uses boundary constraints to stablize the estimates near the data boundary, which is helpful when few observations are available in the begining or end year of the data.
We first apply GLMM to countries in which the overall prevalence is high and the epidemic is not confined to particular subgroups. Therefore, there is only one risk group to be considered, that is, the whole adult population of the country. The proportions of HIV+ cases among antenatal clinic (ANC) attendees observed over years are often used as a proxy for the time trend of adult prevalence in general population. In a typical epidemic with multiple areas, and multiple surveillance sites within each area, let , , indicate area, site, and time respectively. It is expected that prevalence data collected from the same site or different sites within the same area are correlated. We test GLMM specifications of varying complexity, and the following model is recommended:
where are the basis functions of the natural cubic spline; ’s are fixed effects that determine the time trend at the national level; ’s are the random effects at area and site levels. It reflects the hypothesis of different but similar area-level epidemic trends by including area-level random intercepts and spline coefficients.
In countries with low-level and concentrated epidemics, HIV has spread rapidly in the key populations that are most likely to acquire and transmit HIV, but is not well established in the general population. There is no set of representative data that can be used for the general population, and the estimation is done within each individual group. The sub-epidemics could be quite different across high-risk groups such as people who inject drugs, sex workers, clients of sex workers, and men who have sex with men. Therefore, we suggest introducing sub-population-specific intercepts and spline coefficients in Equation 3, and treating them as fixed effects. Other components remain the same, so that the area-specific and site-specific parameters are shared by all sub-populations, e.g. the area with relatively high HIV prevalence among people who inject drugs also has relatively high HIV prevalence among sex workers, compared other areas. Let be the index of risk groups. The following hierarchical model is suggested:
where ’s are group-specific fixed effects, ’s are the random effects at area and site levels.
For both models, we assign diffuse Gaussian priors to the fixed effects, and use the Monte Carlo Markov Chain (MCMC) algorithm implemented in the R package – MCMCglmm [Hadfield2010], to approximate the posterior distribution of area-specific prevalence, , with the site effects being marginalized. We run 105,000 iterations with 5,000 burn-in and 20 thinning interval, and collect a final set of 5,000 posterior samples. The convergence of Markov chains is checked based on trace plots and auto-correlation function (ACF) plots [Brockwell2013].
To verify our model choices, we compare the above models with the alternative hierarchical structures and prior distributions. The model performance is evaluated by using (1) the deviance information criterion (DIC) [Spiegelhalter2002, Spiegelhalter2014]; (2) the mean absolute error (MAE) of the test dataset prediction. This ensures that our proposed models either perform the best or nearly the best among all candidate models. Detailed model specifications and results are provided in the Appendix A.
3.2 Incorporating hierarchical information into EPP model estimation
GLMM utilizes data information efficiently by assuming similarity of the time trends (spline coefficients) while allowing heterogeneity across areas and risk groups, so that the prevalence trend in a specific area/risk group will be determined not only by its own data but also data from other areas and groups. However, it ignores the epidemic model and only provides inference for HIV prevalence trends within the period in which data were observed. Ideally, one could have replaced the spline function with the prevalence trend produced by the epidemic model in Equation 1. The difficulty is that estimating parameters in dynamic models is time consuming due to the lack of analytic solutions. Estimating multiple dynamic systems – one for each area – will further increase the computing cost and makes it unfeasible for a country with many areas.
Instead, we use the posterior distribution of the prevalence from the GLMM to create auxiliary data to add onto the original data for each area and each risk group. The auxiliary data can be viewed as an approximation for the prior distribution of the prevalence, and contains the hierarchical information from the GLMM. More specifically, for area and risk group in year , we denote the posterior estimate of as with mean and variance . We approximate this posterior information by converting each to a binomial observation with prevalence and sample size . This binomial formulation facilitates straightforward inclusion of auxiliary information into the existing EPP likelihood function and intuitive adjustment of the ‘strength’ of the auxiliary prevalence information by scaling the auxiliary data sample size. The total auxiliary data sample size for the sub-epidemic of area and risk group can be set to any predetermined number by scaling proportionally, so that the relative strength of remains the same within each sub-epidemic. corresponds to fitting Spectrum/EPP to data within area and group without borrowing information from other areas or groups. As increases, the prior prevalence distribution becomes more informative. In two illustrative examples below, we determine the optimal default value of using cross-validation.
3.3 Model validation
We evaluate the model performance by its out-of-sample prediction as outlined below:
For each sub-epidemic, randomly partition observations into training and test sets.
Apply GLMM to the training data of all sub-epidemics.
Generate auxiliary data from the predictive distributions of GLMM.
For each sub-epidemic, apply the EPP model to the training data with the addition of auxiliary data of size .
For each sub-epidemic, compare the observations in the test data with the corresponding predictions from the model. Let be the observed prevalence of site and year on the probit scale in the test data, and be the th posterior predictive sample from the model. The mean absolute error (MAE) is defined as averaged over all the posterior samples, sites, and years in the test data. This single-number criterion takes both the bias and the uncertainty of the predictive distribution into consideration. In addition, we evaluate the uncertainty estimates by the coverage and the average width of the 95% prediction intervals.
As illustrative examples, we present results for two countries—Nigeria and Thailand. Nigeria represents countries with high (>1%) HIV prevalence, where pregnant women prevalence (as measured in antenatal clinics) has been used as the indicator of the epidemic trend, together with available data from national population-based surveys. There are about 49 such countries, most of them in sub-Saharan Africa. Thailand represents epidemics where the HIV prevalence is lower and largely concentrated among specific sub-populations, including sex workers and their clients, gay and other men who have sex with men, and people who inject drugs. For each example, we compare the proposed model with the existing approach, i.e. independent fitting of EPP.
4.1 Nigeria Example
Nigeria consists of 36 states and the Federal Capital Territory Abuja with surveillance data starting from 1992. The data quality varies across states: the number of years for which ANC data is available ranges from 6 to 9; the number of clinic sites per state ranges from 2 to 8. The number of tested individuals of each area ranges from 3,482 to 10,789 with a median of 4,850.
Following the procedure described in Section 3.3, we partitioned the data using a training-test ratio of 1:1 at each site. The auxiliary data was generated by applying the GLMM specified in Equation 3 to the training datasets of 37 areas. For the test dataset of each area, we calculated the mean absolute error (MAE) with varying auxiliary data sample size, , , as well as the MAE of the original EPP estimate without auxiliary information (). The random training-test splitting and the corresponding MAE evaluations ware repeated 10 times, and thus, there were 370 test datasets from the 10 training-test splits and 37 areas.
For each test dataset, we calculated the MAE reduction of the proposed model relative to the original EPP as . Figure LABEL:fig:Nigeria (a) illustrates the relative MAE reduction with varying values of , and each curve represents one test dataset. The black solid curve shows the median among the 370 test datasets; and the black dashed curves correspond to 2.5th qunatile and 97.5th quantile. As increases from 15 to 200, the median MAE reduction increases from 0.2% to 3.1%, and MAE reductions are positive for most test datasets. For between 500 and 10,000, the median reduction further increases up to 5.0%. However, the variation of MAE reductions across test datasets becomes larger, and there are a few test datasets whose MAE reduction are less than -20%, or, equivalently, whose MAE increase by 20% or more after including auxiliary data. To prevent a large increase of MAE in any test dataset, we recommend at which the MAE reductions range from -7.2% to 18.0% with median 3.1%. Among 370 test datasets, 337 have positive reductions, and only 13 have negative reduction of more than 1%.
Similarly, we examine the relative reduction of the average width of the 95% credible intervals for HIV prevalence estimates. Figure LABEL:fig:Nigeria (b) shows that including the auxiliary data narrows the credible interval with positive width reductions. In addition, Figure LABEL:fig:Nigeria (c) shows the coverage of 95% credible intervals with varying from 0 to 10,000. There are too few data points in each area to derive reliable area-specific coverage. For each training-test split, we combine the test datasets of 37 areas, and define the coverage as the proportion of all test data points that fall within the 95% credible intervals. There are 10 gray curves, one for each training-test split, and the black curve indicates their median. Those proportions are fairly close to 95% and have no trend over . Therefore, we conclude that introducing the auxiliary data improves the test data prediction accuracy and reduces the range of uncertainty bounds without diminishing their coverage.
Finally, Figure LABEL:fig:Nigeria (d) compares the estimated area-level prevalence trends from different models in one training-test split as an illustrative example. The training data (brown color) are available between 2001 and 2010. The black curves show the posterior median and 95% credible interval (black dotted curves) of prevalence trends estimated from EPP without using auxiliary data, and there is a huge amount of uncertainties before 2001 due to lack of data. The auxiliary data (green color) incorporate prevalence trends from other areas in Nigeria, and suggest that the prevalence peaks in late 1990 and slowly declines since then. The blue curves show the posterior median and 95% credible interval of prevalence trends estimated from EPP with auxiliary data of sample size . Introducing the auxiliary data does not lead to major change in the prevalence estimates after 2000, but greatly reduces the uncertainty before 2000. The blue curves (with auxiliary data) suggest less dramatic changes of prevalence before 2000, and better agree with the test data average (red lines) than the black curves (without auxiliary data).