Multivariate Hierarchical Frameworks for Modelling Delayed Reporting in Count Data

Multivariate Hierarchical Frameworks for Modelling Delayed Reporting in Count Data

Oliver Stoner  
Department of Mathematics, University of Exeter
Theo Economou
Department of Mathematics, University of Exeter
The authors gratefully acknowledge the Natural Environment Research Council for funding this work through a GW4+ Doctoral Training Partnership studentship [NE/L002434/1].

In many fields and applications count data can be subject to delayed reporting. This is where the total count, such as the number of disease cases contracted in a given week, may not be immediately available, instead arriving in parts over time. For short term decision making, the statistical challenge lies in predicting the total count based on any observed partial counts, along with a robust quantification of uncertainty.

In this article we discuss previous approaches to modelling delayed reporting and present a multivariate hierarchical framework where the count generating process and delay mechanism are modelled simultaneously. Unlike other approaches, the framework can also be easily adapted to allow for the presence of under-reporting in the final observed count. To compare our approach with existing frameworks, one of which we extend to potentially improve predictive performance, we present a case study of reported dengue fever cases in Rio de Janeiro. Based on both within-sample and out-of-sample posterior predictive model checking and arguments of interpretability, adaptability, and computational efficiency, we discuss the advantages and disadvantages of each modelling framework.

Keywords: Multivariate count data; Generalized Dirichlet; Bayesian methods; Under-reporting; Notification delay; Censoring.

1 Introduction

In many fields and applications where count data are collected, a situation can arise where the available reported count is believed to be less than or equal to the true count. Delayed reporting in particular is where the total observable count, which may still be less than the true count, is only available after a certain amount of time. In some situations information will trickle in over time so that the current total count gets ever closer to the true count, before eventually reaching the final total observable count.

Figure 3: Bar plot of Rio de Janeiro dengue cases in the weeks leading up to time . The grey bars represent the total (as yet unobserved) number of reported cases, while the coloured bars show the number of cases reported in each week after the cases occurred.

An example of this situation is the occurrence of dengue fever, a viral infection spread by mosquitoes, in Rio de Janeiro. Imagining we are at the end of week , due to delayed reporting we have only observed a portion of the total observable number of cases which were contracted this week. A week from now, at time , a further portion will become available and so on, such that after a number of weeks the total number of observed cases we have observed from week eventually reaches a final total. Figure 3 shows an instance of the data, where we are at the end of week . The grey portions of each bar represent the yet unknown cases as of week , for instance we can see that for dengue cases that occurred in week we only have two weeks worth of information because we only have information that arrived in weeks and , while for cases occurring in week we have three weeks worth of information and so on.

Reporting delay becomes a problem when decisions need to be made based on the total count before it has been observed in its entirety. We can see in Figure 3, for example, that in the surveillance of dengue fever it can take months before the total observable number of cases contracted in a given week is known. This impedes the response time to severe outbreaks and puts lives at risk. It is therefore necessary to make predictions about the current state of the disease based on the partial number of cases observed (now-casting). This allows warnings to be issued and preparations made for predicted epidemics before they have been completely detected by the data. This motivates a statistical treatment of delayed reporting, with the goal of being able to predict the total count based on corresponding counts already observed. Further goals include predicting total counts which have not occurred yet (forecasting) and learning about the structure of the delay mechanism, so that improvements in reporting can be considered.

In this article we explore previous statistical approaches to modelling delayed reporting in count data, and discus their strengths and weaknesses. We then propose a general framework for modelling count data with discrete-time delays, which is sufficiently flexible to be used for a range of data, including those with complex spatio-temporal structures, and can be easily adapted to account for the presence of under-reporting in the final observed count. We present a case study based on counts of dengue fever cases in Rio de Janeiro, and assess the proposed framework by posterior predictive checking (now-casting and forecasting performance), relative to existing approaches.

The article is structured as follows: Section 2 presents an overview of existing approaches to modelling count data suffering from discrete-time delayed reporting, in addition to a substantial extension to one of the existing approaches. In Section 3 we propose a general framework for modelling delayed reporting. In Section 4 we present a case study of dengue fever data from Rio de Janeiro, with which we assess the efficacy of our framework compared to existing approaches. In Section 5, we discuss the potential issue of under-reporting in the final observed count and how the general framework from Section 3 can be adapted to account for it. Finally, Section 6 concludes with a discussion of interpretability, adaptability and ease of implementation.

2 Background

We begin by introducing some notation. Let be the total observable count occurring at temporal unit and spatial unit . We refer to as the total observable count because, in some cases, the final count we observe may still be an under-representation of the true count, an issue we return to in Section 5. Suppose that after some (temporal) delay unit (e.g. one week) a portion of has been reported. We denote this first portion . At the next delay unit we observe an additional portion of , denoted as . This continues so that at each delay unit we observe a count so that the sum of the observed gets closer to the total count .

2.1 Multinomial mixture approach

A sensible approach for modelling delayed reporting involves the idea of jointly modelling at the same time as the totals . Höhle and Heiden (2014) propose modelling the delayed counts as arising from a conditional Multinomial distribution. Here is the expected proportion of which will be reported at delay and is modelled as arising from Generalized-Dirichlet() (GD) distribution (Wong, 1998) where and are constant in time. The total observable count is also modelled explicitly as a latent Poisson variable in the Multinomial model:


Wang et al. (2018) also apply this approach to the monitoring of foodborne diseases, while a similar approach (without the General-Dirichlet layer) can be found in Salmon et al. (2015).

However, the assumption that the Generalized-Dirichlet distribution is time-invariant can be viewed as a restriction in capturing any delay mechanism which varies systematically over time, potentially inhibiting nowcasting and forecasting precision. Höhle and Heiden (2014) seek to address this by presenting a second approach in which the Generalized-Dirichlet model is replaced with a more conventional logistic regression on the Multinomial probabilities:


where is a linear combination of covariate effects. However, whilst this does allow the model to better capture heterogeneity in the delay mechanism over time, it is in part more restrictive. This is because in some applications the Multinomial delay model may be over-dispersed. We will discuss this issue in more detail in Section 3, where we propose a general framework which retains both the flexibility to capture spatio-temporal heterogeneity as well as the ability to appropriately separate variability in the delay mechanism from the model of the total count.

2.2 Conditional independence approach

An alternative approach, often used in the field of actuarial statistics for projecting ultimate losses from delayed insurance claims, is the Chain-Ladder method (Mack, 1993). The method is popular because it is easy to understand and is based entirely on empirical calculations. Renshaw and Verrall (1998) showed that the Chain-Ladder method can be presented as a Generalized Linear Model (Dobson and Barnett, 2018) of the following form:


This has been extended in various ways, for example to include potential covariates (see for instance England and Verrall (2002) and Barbosa and Struchiner (2002)). These approaches however, are restrictive in the sense that they assume the delay structure is homogeneous in time. In reality, the way in which reporting is delayed, for example the proportion of cases reported in the first week, changes over time. The baseline Chain-Ladder model has therefore been extended to accommodate such non-homogeneities as well as spatial variability.

A highly flexible approach that in some sense generalises the Chain-Ladder, is the conditional independence approach where the partial counts () are modelled as independent quantities, conditional on any spatio-temporal or delay structures in their expected value. We refer to this as the Generalized Linear Model (GLM) approach, as it is effectively (conditional on dispersion parameters) a Negative-Binomial GLM (Dobson and Barnett, 2018) for the partial counts :


Here and can be linear combinations of covariate effects or random effects, including complex temporal, spatial and spatio-temporal structures. The former is intended to capture variation in the total counts while the latter is intended to capture variation in the delay mechanism. Aside from the flexibility of incorporating complex structures in the model for , the key advantage of this approach is that it can be very easily implemented in a variety of frequentist frameworks (such as Generalized Additive Models, Wood (2017)), as well as Bayesian ones (such as Integrated Nested Laplacian Approximations (INLA) (Lindgren and Rue, 2015) and Markov Chain Monte Carlo (MCMC)). For example, Bastos et al. (2017) presents the application of this framework to dengue fever in Rio de Janeiro and to spatio-temporal Severe Acute Respiratory Infection (SARI) data in the state of Paraná (Brazil). Both were implemented in the Bayesian framework using INLA and in this case the framework was demonstrated to be a powerful tool for now-casting.

However, as is not modelled directly, inference is based on . Firstly, this means that uncertainty associated with the delay component of the GLM is potentially transferred through the summation of the into the uncertainty of the . A consequence of this uncertainty propagation is that models such as (8)-(9) may result in forecasting uncertainty (for example as quantified by 95% prediction intervals) that is prohibitively large, particularly when forecasting into the future where no are available.

Furthermore, to obtain reliable inference for , we would expect the model to capture well. As is not modelled directly this is given by:


This means that capturing the variance of well relies on modelling the covariances of the well. The issue with this is that the covariances of the are restricted by the assumption that are independent, conditional on . In many cases this may not be a valid assumption and consequently any inference based on is fundamentally flawed. To potentially address this, in the following subsection we present an extension to the conditional independence approach, which may capture better the dependency structure of over .

2.3 Extension of the conditional independence approach

We begin by noting that modelling with a Negative-Binomial distribution is equivalent to modelling as an over-dispersed Poisson quantity:


In this form we can consider the variance of as the sum of the variance of the Poisson component and the variance of the Gamma component. A Gamma component which contributes more to the total variance corresponds to a lower value for the Negative-Binomial shape parameter and vice-versa. In the GLM framework we assume that both the Poisson and Gamma quantities are conditionally-independent across the delay indices . Noting that in Bayesian hierarchical modelling the Gamma component is often approximated by a Log-Normal component, where the mean at the log-level includes an identically distributed Normal random effect, one approach to modelling conditional covariance between multivariate counts is to model the Poisson mean as a Multivariate-Log-Normal quantity (Aitchison and Ho, 1989):


In this framework, which we refer to as the “GLM+ framework”, the partial counts are still independent given . However, at least some of the total covariance can be described explicitly by the Multivariate-Normal covariance structure. The implication of this is that the model may be better able to capture the covariance structure of the , and consequently the variance of the total counts , compared to the GLM framework.

In the following section, we present a general framework based on the Multinomial mixture approach, which retains the desirable merits of jointly modelling as well as the necessary flexibility to capture variability in the spatio-temporal and delay structures.

3 Generalized-Dirichlet-Multinomial Framework

Recall that denotes the true count occurring at temporal unit and in spatial unit and that denotes the observed count corresponding to with delay . We begin by defining a Negative-Binomial model for the true counts:


with the same as in Section 2. Modelling directly (as opposed to indirectly using the GLM), reduces the risk that Var() will not be captured well. However in order to make predictions about which have not yet been fully observed, we also need a model for the delayed counts (which should provide partial information on the unobserved ):


Unlike the GLM approach, modelling the in this way implies they are not assumed to be conditionally independent. In the simplest formulation of this framework, the are not random but fixed, given any spatio-temporal structures or relevant covariates. However, this carries the risk of falsely confounding variability in the delay mechanism with variability in the true count when now-casting. We illustrate this by considering that the predictive distribution for unobserved totals , conditional on partial counts , is given by:


The issue is that is Multinomial, which lacks flexibility in the variance as, conditional on , both the mean and variance are defined wholly by . As such, if there is excess variability (over-dispersion) in , this is likely to be erroneously absorbed by . For example, if is too high for the Multinomial to reasonably capture given , predictions of may be too high when now-casting. Moreover, if both the mean and correlation structure are exclusively defined by , then this limits flexibility in capturing unusual covariance structures.

Both of these issues can be addressed by modelling as a distribution, the probability density function of which is:


The resulting marginal model can be obtained by integrating out to obtain a Generalized-Dirichlet-Multinomial or GDM mixture distribution for , with probability mass function:


To be useful as a tool for nowcasting and forecasting, the model needs to be able to provide inference for conditional on any corresponding which have been observed (as well as any preceding which have been observed by the time of prediction). In a Markov Chain Monte Carlo implementation framework (such as the one used here) this is possible by sampling the corresponding which have not yet been observed as well as the unobserved . However, to do the former we need to be able to sample from the conditional distributions . Fortunately, we can do this by defining and implementing the model in terms of the conditional structure of the GDM:


To model structured variability in the delay mechanism, it makes sense to re-parametrise the Beta-Binomial in terms of its mean and dispersion parameter , which relate to the parameters of the GDM by:


The intuition behind this characterisation is that, having already observed some delayed counts corresponding to the true count , then represents the proportion of the remaining (so far unreported) counts we expect to be reported in the next delay step . Variability about is controlled by the dispersion parameter . Both the mean and dispersion parameters can be generally characterised as functions of space, time and delay:


In contrast to the GLM approach, predictive inference for the unobserved is based on both the delayed counts and previous observed values for . In practice, using MCMC for model inference automatically generates predictive samples of the unobserved from . Furthermore, the delay mechanism does not appear in the model for , meaning that associated variability does not propagate into the predictive inference for unobserved . In the subsequent section we will apply equivalent GDM, GLM and GLM+ models to dengue fever data, discussing their relative merits with respect to performance in model checking, now-casting and forecasting.

4 Case Study

Dengue fever is a viral infection, transmitted by mosquitoes, which has flu symptoms that may evolve into a potentially fatal condition known as severe dengue (World Health Organization, 2018). The disease causes a major burden for the population it affects, particularly in Brazil, which reports more dengue cases than any other country (Silva et al., 2016). Effective response to dengue requires early detection (World Health Organization, 2018), so it is important that healthcare providers are able to prepare themselves for a possible outbreak. Though the reporting of dengue cases to the Brazilian national surveillance system (SINAN) is mandatory (Silva et al., 2016), it can take weeks or even months of delay for the number of reported cases occurring in a given week to approach a final count. For this reason, statistical delayed-reporting models are used to correct delays and predict outbreaks before the total count is available (Bastos et al., 2017).

Figure 2: Bar plot of Rio de Janeiro dengue cases in the weeks leading up to time . The grey bars represent the total (as yet unobserved) number of reported cases, while the coloured bars show the number of cases reported in each week after the cases occurred.
Figure 6: Total number of reported dengue cases from 2011 onwards in Rio de Janeiro. Different colours represent which data are fully observed, partially observed or unobserved at week (March 2013).

Here we consider data on dengue fever cases in Rio de Janeiro, Brazil, occurring in weeks (week commencing the 3rd of January 2011) to (week commencing the 15th of April 2013). For illustration, we assume that present day is week (week commencing the 4th of March 2013). Furthermore, we consider the total observable count to be the number of cases observed after 6 months (26 weeks) worth of data (in addition to the number of cases reported in the week of occurrence). With present day being week , this implies we have 88 weeks of fully observed total counts . Total counts occurring in weeks to are only partially observed and must be predicted based on the partial observations (now-casting). Total counts after present day () have not yet occurred and so they are completely unobserved. This is the forecasting period.

The time series of counts is illustrated in Figure 6, with different colours corresponding to the three different periods. There is some evidence of seasonality in the data, with outbreaks starting around the beginning of the calendar year and ending approximately 6 months later. This reflects the fact that the incidence of dengue fever is thought to depend heavily on the time of year and climatological conditions (Morales et al., 2016). We can also see some non-seasonal temporal structure, meanwhile, with the outbreak in 2012 being more severe than the one in 2011.

Figure 7: Proportion of dengue cases reported in the first week (the week in which they occurred, left), the second week (centre) and the third week (right).

The left panel in Figure 7 shows the proportion of dengue cases reported in the week they occurred (first week) plotted against time, while the middle and right panels show the proportion of cases reported a week after they occurred (second week) and the following (third) week, respectively. We can see strong evidence of temporal structure in the delay mechanism, with the average proportion reported in the first week steadily dropping throughout 2011, reaching its lowest point at the start of 2012 before beginning to rise again.

4.1 Formulation of competing models

We now model this data using three comparable models (in terms of flexibility and interpretation), namely the GDM, GLM and GLM+. Modelling every partial count (in this case all 27 weeks) will result in the greatest predictive precision, though this comes at a high computational cost. Instead, if the total is almost entirely observed after delay steps, it may be more pragmatic to model only counts up to as well as the sum of the remaining counts . In the GDM approach this is achieved by only including the conditional models for the first partial counts, such that the remainder is modelled implicitly, while in the GLM and GLM+ approaches this can be achieved by modelling in the same way as the individual counts.

Figure 5: Total number of reported dengue cases from 2011 onwards in Rio de Janeiro. Different colours represent which data are fully observed, partially observed or unobserved at week (March 2013).
Figure 10: Quantiles of the proportions of fully observed () total dengue cases covered by after each additional week of data.

One way to make this decision is to consider the proportions of each observed reported after each delay step. Figure 10 shows the 20%, 40%, 60%, and 80% quantiles of the proportions of the total dengue cases reported after each delay step. By looking at the 20% quantiles of these proportions we can see that the vast majority (over 80%) of total dengue cases are covered after weeks worth of data 80% of the time, with little to be gained unless many more weeks are considered. For this reason we choose to model only the first 8 weeks individually.

The model based on the GDM framework is defined by:


Where and are the expectations and dispersions parameters of the Beta-Binomial conditional distributions, as described in (22)-(26).

The model based on the GLM framework is defined by:


The model based on the GLM+ framework is defined by:


In all models is a penalized cyclic cubic spline (Wood, 2017) defined over weeks , which represents the effect of the time of year on the total number of reported dengue cases, and is a penalized cubic spline defined over the whole temporal range. The latter is designed to capture non-seasonal temporal structure in the rate of total reported dengue cases and is constrained to be linear beyond the final knot so that it can be used for forecasting. The effects are defined by a different penalized cubic spline (each with its own smoothness penalty) for each delay index , intended to capture temporal changes in the delay mechanism over time. As discussed in Wood (2016), the coefficients of each spline are assigned a Multivariate-Normal prior distribution and are penalized to prevent excessive wiggliness through an unknown penalty parameter (the scaling factor of the Multivariate-Normal prior precision matrix). The re-parametrisation is potentially more interpretable for the purpose of specifying a prior distribution, where smaller values of correspond to a stricter penalty on how flexible the smooth function is. The splines are centred to have zero-mean, and as such the models include the fixed effects and as intercepts.

The Negative-Binomial dispersion parameters ( and ) were assigned relatively non-informative Exponential() prior distributions. The GDM dispersion parameters were assigned Log-Normal() prior distributions, such that most of the prior density is over values of which result in a modest contribution from the Generalized-Dirichlet component to the overall variance of the GDM, without ruling out higher values which correspond to a Multinomial situation. Relatively non-informative Normal prior distributions were specified for the global intercept parameter and also for the delay-specific intercept parameters . In the GDM model, the intercept parameters represent the means of relative proportions at the logistic level. For these parameters we specified Normal prior distributions with the means chosen so that the prior mode implies approximately equal amount of cases being reported in each week of delay, with the variance chosen so that they are relatively non-informative. We specified Half-Normal prior distributions for the penalty parameters for splines and . This imposes a relatively strong smoothness penalty on the effects and , which are supposed to capture medium-to-long term trends in the incidence of dengue cases. We relaxed this penalty slightly for the effects by specifying weaker Half-Normal priors. Finally, for the Multivariate-Normal covariance of in the GLM+ model, we specified a fairly weak Inverse-Wishart prior with an identity scale matrix (dimension ) and degrees of freedom.

All code was written and executed using R (R Core Team (2018)) and all three models were implemented using NIMBLE (de Valpine et al., 2017), a facility for highly flexible implementation of MCMC. The model matrices for the splines were set up using the package jagam (Wood, 2016). Four MCMC chains were run from different initial values and with different random number generator seeds, until convergence criteria were met. We discuss how we assessed convergence of the chains to the posterior in the Appendix A.

4.2 Results

To compare the models we will begin by exploring which aspects of the results are similar. Figure 11 shows the posterior mean predicted temporal effect () as well as the seasonal effect () from the GDM, GLM and GLM+ models, with associated 95% credible intervals, on the incidence rate dengue cases (at the log-scale). Both effects are very similar in shape between the three models: in the left panel we can see that all models suggest a persistent increase in dengue incidence in 2012, which makes sense given the more severe outbreak shown in Figure 6, while the right panel shows a strong seasonal effect in all models, with a much higher incidence rate in the first half of the year than the second. Interestingly the seasonal effect is less certain, though still strong, for the GDM model compared to the GLM and GLM+ models. Given that there are only approximately two years of fully observed data, the uncertainty in the GDM model’s seasonal effect seems more reasonable.

Figure 11: Posterior mean temporal () and seasonal () spline effects on the incidence rate of dengue cases, from the GDM, GLM and GLM+ models, with associated 95% credible intervals.

Similarly, Figure 14 shows that, although not perfectly comparable because the models use different link functions (logistic for GDM and log for GLM and GLM+), the temporal effects on the number of cases reported in the first week are very similar between the three models. For example, all three models show a distinct drop in proportion of cases reported in the first week during the 2012 outbreak.

Figure 9: Quantiles of the proportions of fully observed () total dengue cases covered by after each additional week of data.
Figure 14: Posterior mean delay spline effect corresponding to counts reported in the first week , from the GDM, GLM and GLM+ models, with associated 95% credible intervals.

We now move on to ways in which the models differ. Recall from Section 2, that in the GLM framework, capturing the distribution of the true counts well relies on a potentially restrictive assumption that the delayed counts are conditionally independent. In contrast, by modelling the total counts explicitly, the GDM framework has more flexibility to capture their distribution well. Similarly, the addition of a covariance model in the GLM+ framework means that it may be able to capture the covariance of the partial counts , and consequently the variance of the total counts, better than the GLM framework.

We use in-sample posterior predictive checking (Gelman et al., 2014) to the fit of the models to the data. This is done by simulating replicates of the observed partial counts and the fully observed (weeks 1-88) total dengue counts from the respective predictive distributions. We can then see if particular statistics of the observed data are captured well, by comparing them to the distribution obtained by computing the corresponding statistics of the replicates.

We begin by looking at the covariance of the partial counts and the covariance of the proportion reported in each week . For each sets of replicates, we compute the sample covariance of these two quantities, resulting in a distribution of samples for each individual covariance and . The left column of Figure 17 shows the mean difference between the replicate covariances and the observed covariances, while the right column shows the mean squared difference between the replicate covariances and the observed covariances. For both the covariance of the partial counts and the covariance of the proportion reported in each week, we can see that the GDM model is the least biased (potentially even unbiased for the proportion reported in each week) and the most precise (lowest mean squared error).

Figure 13: Posterior mean delay spline effect corresponding to counts reported in the first week , from the GDM, GLM and GLM+ models, with associated 95% credible intervals.
Figure 17: Density plots of the mean bias (left column) and the logarith of the mean squared error (right column) of the covariance of the partial counts and the proportion reported in each week .

Similarly, the left and central panels of Figure 18 show density estimates of the distribution of the sample mean and the sample variance, respectively, of the replicate total counts We can see that in both cases the observed statistic, represented by a vertical line, is captured best by the GDM model, with the GLM faring better than the GLM+ model. This is a surprising result, given that the GLM+ has more flexibility than the GLM to capture the covariance structure of the partial counts . The right panel of Figure 18 shows posterior means of the sorted replicates, with 95% prediction interval. In this plot we can clearly see that, while the distribution of the total counts is captured best by the GDM and adequately well by the GLM, the GLM+ has an excessively heavy upper tail, compared to the data. This difference is likely because in the Poisson-Log-Normal mixture the logarithm of the Poisson mean is symmetric, compared to negatively skewed in the Poisson-Gamma mixture.

Figure 18: The left and central panels show density plots of the sample mean and sample variance of the posterior replicates of the fully observed (weeks 1-104) total dengue cases () from the GDM and GLM models. The vertical lines represent the corresponding statistics from the observed data. The right panel shows the mean of replicates of the total dengue cases , from the GDM and GLM models, with associated 95% posterior predictive intervals.

Recall that two important uses of delayed-reporting models are the prediction of total counts which have occurred but haven’t yet been fully observed (nowcasting) and the prediction of total counts which have not yet occurred (forecasting). In this case study we imagine we are in week 114 and we would like to predict the number of dengue cases in recent weeks (e.g. ) as well as to predict dengue cases over the next 6 weeks. Figure 19 shows the posterior median predicted number of dengue cases from the three models, with associated 95% posterior predictive intervals. We can see that, whilst the median predictions from all three models are virtually identical, the model with the least predictive uncertainty, in both the now-casting range and forecasting range, is the GDM, making the GDM forecast potentially most useful to decision-makers. Notably, the GLM+ is far closer to the GDM in terms of certainty than the GLM, suggesting our extension may have improved now-casting and forecasting precision. However, we would consider the GDM’s quantification of uncertainty more trustworthy, given its favourable results in the in-sample predictive checking.

Figure 19: Posterior median predictions of the unobserved total dengue cases , from the GDM, GLM and GLM+ models, with associated 95% posterior predictive intervals. Predictions beyond week are forecasting without any observed partial counts .

4.3 Comparison with other approaches

By this point we have demonstrated several ways, for this data, in which the GDM framework improves over the GLM framework our own extension of it, the GLM+ framework. It remains to show that the increased flexibility of the GDM over other approaches discussed in Secton 2 leads to tangible improvements in this example. Recall that one method presented by Höhle and Heiden (2014) and others, is to treat the parameters of the Generalized-Dirichlet component as stationary in time. As we saw in Figure 7, there is substantial variation over time in the proportion of dengue cases reported in the first week. This structure would not be captured by assuming time-stationarity in the Generalized-Dirichlet model, inevitably leading to poorer nowcasting and forecasting performance.

An alternative suggestion was to model the proportion of cases reported at each delay level in each week using a conventional Multinomial logistic regression, removing the additional variability provided by the Generalized-Dirichlet component.

Figure 20: Posterior median proportion, from the GDM model, of dengue cases reported in the first (left) and second (right) weeks after incidence, with associated 95% credible intervals. Also shown are 95% posterior predictive intervals of the proportion reported in the first and second weeks from the GDM model with and without the additional variance from the Generalized-Dirichlet layer.

One way to assess the contribution of the GD variance is to simulate posterior replicates of the proportion reported in each week of delay () both from the GDM using the posterior samples for the dispersion parameters and again from the same model but in the limiting case when , such that the joint conditional distribution of is Multinomial. Figure 20 shows 95% posterior predictive distributions for the proportion of dengue cases reported after 1 (left) and 2 (right) weeks of delay for both the model with GD variance and without. We can see that without the GD variance an excessively high number of points are not captured by the prediction intervals. Also shown are the 95% prediction interval coverages: the proportion of observations which lie within their corresponding 95% prediction intervals. The coverages with the GD variance are just over 95%, indicating a good fit to this data, while less than two-thirds of points are covered without the GD variance.

5 Under-reporting

An added challenge that occurs in data that are subject to reporting delay is that, in some situations, the final observed total count may still be a (substantial) under-estimate of the true count. In disease surveillance, this may translate to many cases never being reported, leading to a biased understanding (underestimation) of the actual magnitude of outbreaks. For instance, although reporting of dengue cases to the national surveillance system (SINAN) is mandatory, research suggests that the reported total may be substantially lower than the true number of dengue cases, owing to issues such as patients not seeking healthcare (Silva et al., 2016).

To address this, the GDM framework can be adapted to allow for under-reporting. In particular, it can be integrated into the hierarchical framework for under-reporting presented in Stoner et al. (2019). Suppose that in addition to the partial counts and the total counts there exist unobserved true counts such that . Then the complete model for delayed reporting and under-reporting is given by:


such that now represents the incidence rate of the true count (as opposed to the total observed count ) and represents the reporting rate. As illustrated in Stoner et al. (2019), both covariates and random effects can be used to model this reporting rate at the logistic level, represented by the generic function in (38).

Without any observations for the true count , the model is not identifiable between a high reporting rate and a low incidence rate or vice versa, but this can be resolved through the use at least one informative prior (such as for the overall reporting rate across the whole time series, as discussed in Stoner et al. (2019)).

Using this approach means that policy and intervention can be based on predictions for the true number of cases, taking into account both the delayed reporting and under-reporting mechanisms, to reduce the risk of an undersized response. Note further, that allowing for under-reporting in the total count would be much less straightforward using the GLM and GLM+ approaches, mainly because the totals are not modelled explicitly.

6 Discussion

In this article we have introduced the problem of delayed-reporting and its implications for society. We explained that it is a problem based around prediction, providing a motivation for a statistical approach to the problem. We explored several existing approaches, focusing on (a) approaches based on a Multinomial mixture distribution with either a time stationary Generalized-Dirichlet distribution or a logistic regression and (b) the conditional independence (GLM) approach. Both approaches are very flexible, in terms of incorporating complex spatio-temporal structures. However, we argue that they both have limitations: The approaches based on a Multinomial mixture are not sufficiently flexible to simultaneously capture delay mechanisms which are both heterogeneous in time and over-dispersed, with respect to the Multinomial variance. The GLM approach, on the other hand, does not explicitly model the total counts. This means it relies on capturing the covariance structure of the partial counts well in order to capture the distribution of the total counts well. This is hindered by the assumption that the partial counts are independent, conditional on any covariate or random effects. To potentially address this, we proposed an extension to this approach (which we refer to as the GLM+) which includes an explicit covariance model for the partial counts, with the aim of better capturing the distribution of the total counts.

We have proposed a general framework based on a Generalized-Dirichlet-Multinomial mixture, where the true total counts are explicitly modelled (unlike the GLM) and where the Multinomial probabilities are a Generalized-Dirichlet whose parameters may vary in space and time. We presented a case study of data on reported dengue fever cases in Rio de Janeiro. In-sample predictive model checking was used to assess the models with respect to how well the distribution of the total number of cases was captured. Out-of-sample predictive checking was also used to assess performance when nowcasting and forecasting. We found that in every test the GDM has the strongest performance, even compared to the GLM+ model which, despite potentially having the most general covariance structure of the three models, was hindered by having an excessively heavy upper tail.

In addition to considering the performance of each model for the particular data set, it is also important to consider other reasons why one might be preferable over the others. The GLM model, for instance, is by far the easiest to implement, especially in a non-Bayesian setting such as the Generalized Additive Model framework or in an approximate Bayesian setting such as INLA. The GDM, however, lends itself more to a full Bayesian treatment, where Markov Chain Monte Carlo (MCMC) is used, compared to the other frameworks. This is because the effects associated with the true total count and the effects associated with the delay mechanism are separated into different parts of the model and are related to different parts of the data (the total counts and the partial counts, respectively). In the GLM and GLM+ frameworks, meanwhile, all of the effects are in the same model and they end up competing with each other. For this reason, it is possible to obtain a higher effective number of posterior samples per second with the GDM model.

In our view, the GDM framework is the most interpretable of all of the frameworks discussed here. This is because the delay mechanism, and any associated variability, is completely separated from the process which generates total counts. This makes in turn it easier to adapt the model for a given data problem. For example, we can see in Figure 20 that the variability in the proportion of cases reported in the first week decreases in the middle of the time series. To capture this, it is a fairly trivial modification to model the logarithm of the dispersion parameters , as defined in (26), using a penalized spline in time. Knowing that variability in the delay mechanism at a certain time is likely to be lower or higher than usual could further improve now-casting precision. Whilst it would be possible to model the Negative-Binomial dispersion parameters as time-varying in the GLM and GLM+ frameworks, there is no equivalent way of separating temporal structure in the variance of the total counts, from structure in the variance of the delay mechanism, as is possible in the GDM framework. On the same theme of adaptability, the GDM framework can also be easily integrated into a hierarchical framework for correcting under-reporting, which may be essential in scenarios where the final observed total count is still a substantial under-representation of the true count. In such situations, allowing for both the delay mechanism and the under-reporting mechanism simultaneously may be crucial for well-informed decision making.

Appendix A Convergence of MCMC Chains

For each model, convergence of the four chains was assessed by visual inspection of trace plots and by computing the Multivariate Potential Scale Reduction Factor (MPSRF) (Brooks and Gelman, 1998) of a selection of model parameters. This compares the variance between the chains to the variance within the chains. If the two variances are similar then this typically results in an MPSRF of less than 1.05. Starting from different initial values and obtaining an MPSRF of less than 1.05 gives the best indication that the chains have converged to the posterior distribution.

  • For the GDM model, we computed the MPSRF of every 10th (), , every 10th and the . The model was run for a total of 400k iterations, discarding the first 200k as burn-in and thinning by 20 to save memory. The MPSRF was computed to be 1.05 indicating that the model had converged.

  • For the GLM model, we computed the MPSRF of every 10th and the . The model was run for a total of 800k iterations, discarding the first 400k as burn-in and thinning by 40 to save memory. The MPSRF was computed to be 1.04.

  • For the GLM+ model, we computed the MPSRF of every 10th . The model was run for a total of 800k iterations, discarding the first 400k as burn-in and thinning by 40 to save memory. The MPSRF was computed to be 1.02


  • Aitchison and Ho (1989) Aitchison, J. and C. H. Ho (1989, 12). The multivariate Poisson-log normal distribution. Biometrika 76(4), 643–653.
  • Barbosa and Struchiner (2002) Barbosa, M. T. S. and C. J. Struchiner (2002, 02). The estimated magnitude of AIDS in Brazil: a delay correction applied to cases with lost dates. Cadernos de Saude Publica 18, 279–285.
  • Bastos et al. (2017) Bastos, L., T. Economou, G. M., V. D., T. Bailey, and C. Codeço (2017, sep). Modelling reporting delays for disease surveillance data. Accessed: 2019-02-14.
  • Brooks and Gelman (1998) Brooks, S. P. and A. Gelman (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 7(4), 434–455.
  • de Valpine et al. (2017) de Valpine, P., D. Turek, C. J. Paciorek, C. Anderson-Bergman, D. T. Lang, and R. Bodik (2017). Programming with models: Writing statistical algorithms for general model structures with nimble. Journal of Computational and Graphical Statistics 26(2), 403–413.
  • Dobson and Barnett (2018) Dobson, A. and A. Barnett (2018). An Introduction to Generalized Linear Models. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.
  • England and Verrall (2002) England, P. and R. Verrall (2002). Stochastic claims reserving in general insurance. British Actuarial Journal 8(3), 443–518.
  • Gelman et al. (2014) Gelman, A., J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin (2014, November). Bayesian Data Analysis, Third Edition (Chapman and Hall/CRC Texts in Statistical Science) (Third ed.). London: Chapman and Hall/CRC.
  • Höhle and Heiden (2014) Höhle, M. and M. a. d. Heiden (2014, 6). Bayesian nowcasting during the stec o104:h4 outbreak in germany, 2011. Biometrics 70(4), 993–1002.
  • Lindgren and Rue (2015) Lindgren, F. and H. Rue (2015). Bayesian spatial modelling with r-inla. Journal of Statistical Software, Articles 63(19), 1–25.
  • Mack (1993) Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder reserve estimates. ASTIN Bulletin 23(2), 213–225.
  • Morales et al. (2016) Morales, I., H. Salje, S. Saha, and E. S. Gurley (2016). Seasonal distribution and climatic correlates of dengue disease in dhaka, bangladesh. The American Journal of Tropical Medicine and Hygiene 94(6), 1359–1361.
  • R Core Team (2018) R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Renshaw and Verrall (1998) Renshaw, A. E. and R. J. Verrall (1998). A stochastic model underlying the chain-ladder technique. British Actuarial Journal 4(4), 903–923.
  • Salmon et al. (2015) Salmon, M., D. Schumacher, K. Stark, and M. Höhle (2015). Bayesian outbreak detection in the presence of reporting delays. Biometrical Journal 57(6), 1051–1067.
  • Silva et al. (2016) Silva, M. M. O., M. M. de Souza Rodrigues, I. A. D. Paploski, M. Kikuti, A. M. Kasper, J. S. Cruz, T. L. Queiroz, A. S. Tavares, P. M. Santana, J. M. G. Araújo, A. I. Ko, M. G. Reis, and G. S. Ribeiro (2016). Accuracy of dengue reporting by national surveillance system, brazil. Emerging infectious diseases 22 2, 336–9.
  • Stoner et al. (2019) Stoner, O., T. Economou, and G. Drummond Marques da Silva (2019). A hierarchical framework for correcting under-reporting in count data. Journal of the American Statistical Association.
  • Wang et al. (2018) Wang, X., M. Zhou, J. Jia, Z. Geng, and G. Xiao (2018). A bayesian approach to real-time monitoring and forecasting of chinese foodborne diseases. International Journal of Environmental Research and Public Health 15(8).
  • Wong (1998) Wong, T.-T. (1998). Generalized dirichlet distribution in bayesian analysis. Applied Mathematics and Computation 97(2), 165 – 181.
  • Wood (2016) Wood, S. (2016). Just another gibbs additive modeler: Interfacing jags and mgcv. Journal of Statistical Software, Articles 75(7), 1–15.
  • Wood (2017) Wood, S. N. (2017, May). Generalized Additive Models: An Introduction with R, Second Edition (Chapman and Hall/CRC Texts in Statistical Science) (Second ed.). London: Chapman and Hall/CRC.
  • World Health Organization (2018) World Health Organization (2018, September). WHO dengue and severe dengue fact sheet. Accessed: 2019-02-14.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description