Decentralized Data Fusion and Active Sensing with Mobile Sensors for Modeling and Predicting Spatiotemporal Traffic Phenomena
Abstract
The problem of modeling and predicting spatiotemporal traffic phenomena over an urban road network is important to many traffic applications such as detecting and forecasting congestion hotspots. This paper presents a decentralized data fusion and active sensing (D{}^{2}FAS) algorithm for mobile sensors to actively explore the road network to gather and assimilate the most informative data for predicting the traffic phenomenon. We analyze the time and communication complexity of D{}^{2}FAS and demonstrate that it can scale well with a large number of observations and sensors. We provide a theoretical guarantee on its predictive performance to be equivalent to that of a sophisticated centralized sparse approximation for the Gaussian process (GP) model: The computation of such a sparse approximate GP model can thus be parallelized and distributed among the mobile sensors (in a Googlelike MapReduce paradigm), thereby achieving efficient and scalable prediction. We also theoretically guarantee its active sensing performance that improves under various practical environmental conditions. Empirical evaluation on realworld urban road network data shows that our D{}^{2}FAS algorithm is significantly more timeefficient and scalable than stateoftheart centralized algorithms while achieving comparable predictive performance.
Decentralized Data Fusion and Active Sensing with Mobile Sensors for Modeling and Predicting Spatiotemporal Traffic Phenomena
Jie Chen{}^{{\dagger}}, Kian Hsiang Low{}^{{\dagger}}, Colin KengYan Tan{}^{{\dagger}}, Ali Oran{}^{\lx@sectionsign} Dept. Computer Science, National University of Singapore{}^{\dagger} SingaporeMIT Alliance for Research and Technology{}^{\lx@sectionsign} Republic of Singapore Patrick Jaillet{}^{{\ddagger}}, John Dolan{}^{\lx@paragraphsign}, Gaurav Sukhatme{}^{\ast} Dept. Electrical Engineering and Computer Science, MIT{}^{{\ddagger}} The Robotics Institute, Carnegie Mellon University{}^{\lx@paragraphsign} Dept. Computer Science, University of Southern California{}^{\ast} USA
1 Introduction
Knowing and understanding the traffic conditions and phenomena over road networks has become increasingly important to the goal of achieving smoothflowing, congestionfree traffic, especially in denselypopulated urban cities. According to a 2011 urban mobility report (Schrank et al., 2011), traffic congestion in the USA has caused 1.9 billion gallons of extra fuel, 4.8 billion hours of travel delay, and \$101 billion of delay and fuel cost. Such huge resource wastage can potentially be mitigated if the spatiotemporally varying traffic phenomena (e.g., speeds and travel times along road segments) are predicted accurately enough in real time to detect and forecast the congestion hotspots; networklevel (e.g., ramp metering, road pricing) and userlevel (e.g., route replanning) measures can then be taken to relieve this congestion, so as to improve the overall efficiency of road networks.
In practice, it is nontrivial to achieve realtime, accurate prediction of a spatiotemporally varying traffic phenomenon because the quantity of sensors that can be deployed to observe an entire road network is costconstrained. Traditionally, static sensors such as loop detectors (Krause et al., 2008a; Wang and Papageorgiou, 2005) are placed at designated locations in a road network to collect data for predicting the traffic phenomenon. However, they provide sparse coverage (i.e., many road segments are not observed, thus leading to data sparsity), incur high installation and maintenance costs, and cannot reposition by themselves in response to changes in the traffic phenomenon. Lowcost GPS technology allows the collection of traffic data using passive mobile probes (Work et al., 2010) (e.g., taxis/cabs). Unlike static sensors, they can directly measure the travel times along road segments. But, they provide fairly sparse coverage due to low GPS sampling frequency (i.e., often imposed by taxi/cab companies) and no control over their routes, incur high initial implementation cost, pose privacy issues, and produce highlyvarying speeds and travel times while traversing the same road segment due to inconsistent driving behaviors. A critical mass of probes is needed on each road segment to ease the severity of the last drawback (Srinivasan and Jovanis, 1996) but is often hard to achieve on nonhighway segments due to sparse coverage. In contrast, we propose the use of active mobile probes (Turner et al., 1998) to overcome the limitations of static and passive mobile probes. In particular, they can be directed to explore any segments of a road network to gather traffic data at a desired GPS sampling rate while enforcing consistent driving behavior.
How then do the mobile probes/sensors actively explore a road network to gather and assimilate the most informative observations for predicting the traffic phenomenon? There are three key issues surrounding this problem, which will be discussed together with the related works:
Models for predicting spatiotemporal traffic phenomena. The spatiotemporal correlation structure of a traffic phenomenon can be exploited to predict the traffic conditions of any unobserved road segment at any time using the observations taken along the sensors’ paths. To achieve this, existing Bayesian filtering frameworks (Chen et al., 2011; Wang and Papageorgiou, 2005; Work et al., 2010) utilize various handcrafted parametric models predicting traffic flow along a highway stretch that only correlate adjacent segments of the highway. So, their predictive performance will be compromised when the current observations are sparse and/or the actual spatial correlation spans multiple segments. Their strong Markov assumption further exacerbates this problem. It is also not shown how these models can be generalized to work for arbitrary road network topologies and more complex correlation structure. Existing multivariate parametric traffic prediction models (Kamarianakis and Prastacos, 2003; Min and Wynter, 2011) do not quantify uncertainty estimates of the predictions and impose rigid spatial locality assumptions that do not adapt to the true underlying correlation structure.
In contrast, we assume the traffic phenomenon over an urban road network (i.e., comprising the full range of road types like highways, arterials, slip roads) to be realized from a rich class of Bayesian nonparametric models called the Gaussian process (GP) (Section 2) that can formally characterize its spatiotemporal correlation structure and be refined with growing number of observations. More importantly, GP can provide formal measures of predictive uncertainty (e.g., based on variance or entropy criterion) for directing the sensors to explore highly uncertain areas of the road network. Krause et al. (2008a) used GP to represent the traffic phenomenon over a network of only highways and defined the correlation of speeds between highway segments to depend only on the geodesic (i.e., shortest path) distance of these segments with respect to the network topology; their features are not considered. Neumann et al. (2009) maintained a mixture of two independent GPs for flow prediction such that the correlation structure of one GP utilized road segment features while that of the other GP depended on manually specified relations (instead of geodesic distance) between segments with respect to an undirected network topology. Different from the above works, we propose a relational GP whose correlation structure exploits the geodesic distance between segments based on the topology of a directed road network with vertices denoting road segments and edges indicating adjacent segments weighted by dissimilarity of their features, hence tightly integrating the features and relational information.
Data fusion. The observations are gathered distributedly by each sensor along its path in the road network and have to be assimilated in order to predict the traffic phenomenon. Since a large number of observations are expected to be collected, a centralized approach to GP prediction cannot be performed in real time due to its cubic time complexity.
To resolve this, we propose a decentralized data fusion approach to efficient and scalable approximate GP prediction (Section 3). Existing decentralized and distributed Bayesian filtering frameworks for addressing nontraffic related problems (Chung et al., 2004; Coates, 2004; OlfatiSaber, 2005; Rosencrantz et al., 2003; Sukkarieh et al., 2003) will face the same difficulties as their centralized counterparts described above if applied to predicting traffic phenomena, thus resulting in loss of predictive performance. Distributed regression algorithms (Guestrin et al., 2004; Paskin and Guestrin, 2004) for static sensor networks gain efficiency from spatial locality assumptions, which cannot be exploited by mobile sensors whose paths are not constrained by locality. Cortes (2009) proposed a distributed data fusion approach to approximate GP prediction based on an iterative Jacobi overrelaxation algorithm, which incurs some critical limitations: (a) the past observations taken along the sensors’ paths are assumed to be uncorrelated, which greatly undermines its predictive performance when they are in fact correlated and/or the current observations are sparse; (b) when the number of sensors grows large, it converges very slowly; (c) it assumes that the range of positive correlation has to be bounded by some factor of the communication range. Our proposed decentralized algorithm does not suffer from these limitations and can be computed exactly with efficient time bounds.
Active sensing. The sensors have to coordinate to actively gather the most informative observations for minimizing the uncertainty of modeling and predicting the traffic phenomenon. Existing centralized (Low et al., 2008, 2009a, 2011) and decentralized (Low et al., 2012; Stranders et al., 2009) active sensing algorithms scale poorly with a large number of observations and sensors. We propose a partially decentralized active sensing algorithm that overcomes these issues of scalability (Section 4).
This paper presents a novel Decentralized Data Fusion and Active Sensing (D{}^{2}FAS) algorithm (Sections 3 and 4) for sampling spatiotemporally varying environmental phenomena with mobile sensors. Note that the decentralized data fusion component of D{}^{2}FAS can also be used for static and passive mobile sensors. The practical applicability of D{}^{2}FAS is not restricted to traffic monitoring; it can be used in other environmental sensing applications such as mineral prospecting (Low et al., 2007), monitoring of ocean and freshwater phenomena (Dolan et al., 2009; Podnar et al., 2010; Low et al., 2009b, 2011, 2012) (e.g., plankton bloom, anoxic zones), forest ecosystems, pollution (e.g., oil spill), or contamination (e.g., radiation leak). The specific contributions of this paper include:

Analyzing the time and communication overheads of D{}^{2}FAS (Section 5): We prove that D{}^{2}FAS can scale better than existing stateoftheart centralized algorithms with a large number of observations and sensors;

Theoretically guaranteeing the predictive performance of the decentralized data fusion component of D{}^{2}FAS to be equivalent to that of a sophisticated centralized sparse approximation for the GP model (Section 3): The computation of such a sparse approximate GP model can thus be parallelized and distributed among the mobile sensors (in a Googlelike MapReduce paradigm), thereby achieving efficient and scalable prediction;

Theoretically guaranteeing the performance of the partially decentralized active sensing component of D{}^{2}FAS, from which various practical environmental conditions can be established to improve its performance;

Developing a relational GP model whose correlation structure can exploit both the road segment features and road network topology information (Section 2.1);

Empirically evaluating the predictive performance, time efficiency, and scalability of the D{}^{2}FAS algorithm on a realworld traffic phenomenon (i.e., speeds of road segments) dataset over an urban road network (Section 6): D{}^{2}FAS is more timeefficient and scales significantly better with increasing number of observations and sensors while achieving predictive performance close to that of existing stateoftheart centralized algorithms.
2 Relational Gaussian Process Regression
The Gaussian process (GP) can be used to model a spatiotemporal traffic phenomenon over a road network as follows: The traffic phenomenon is defined to vary as a realization of a GP. Let V be a set of road segments representing the domain of the road network such that each road segment s\in V is specified by a pdimensional vector of features and is associated with a realized (random) measurement z_{s} (Z_{s}) of the traffic condition such as speed if s is observed (unobserved). Let \{Z_{s}\}_{s\in V} denote a GP, that is, every finite subset of \{Z_{s}\}_{s\in V} follows a multivariate Gaussian distribution (Rasmussen and Williams, 2006). Then, the GP is fully specified by its prior mean \mu_{s}\triangleq\mathbb{E}[Z_{s}] and covariance \sigma_{ss^{\prime}}\triangleq\mbox{cov}[Z_{s},Z_{s^{\prime}}] for all s,s^{\prime}\in V. In particular, we will describe in Section 2.1 how the covariance \sigma_{ss^{\prime}} for modeling the correlation of measurements between all pairs of segments s,s^{\prime}\in V can be designed to exploit the road segment features and the road network topology.
A chief capability of the GP model is that of performing probabilistic regression: Given a set D\subset V of observed road segments and a column vector z_{D} of corresponding measurements, the joint distribution of the measurements at any set Y\subseteq V\setminus D of unobserved road segments remains Gaussian with the following posterior mean vector and covariance matrix
\displaystyle\mu_{YD}\triangleq\mu_{Y}+\Sigma_{YD}\Sigma_{DD}^{1}(z_{D}\mu_% {D})\vspace{0mm}  (1) 
\displaystyle\Sigma_{YYD}\triangleq\Sigma_{YY}\Sigma_{YD}\Sigma_{DD}^{1}% \Sigma_{DY}\vspace{0mm}  (2) 
where {\mu}_{Y} ({\mu}_{D}) is a column vector with mean components \mu_{s} for all s\in Y (s\in D), \Sigma_{YD} (\Sigma_{DD}) is a covariance matrix with covariance components \sigma_{ss^{\prime}} for all s\in Y,s^{\prime}\in D (s,s^{\prime}\in D), and \Sigma_{DY} is the transpose of \Sigma_{YD}. The posterior mean vector \mu_{YD} (1) is used to predict the measurements at any set Y of unobserved road segments. The posterior covariance matrix \Sigma_{YYD} (2), which is independent of the measurements z_{D}, can be processed in two ways to quantify the uncertainty of these predictions: (a) the trace of \Sigma_{YYD} yields the sum of posterior variances \Sigma_{ssD} over all s\in Y; (b) the determinant of \Sigma_{YYD} is used in calculating the Gaussian posterior joint entropy
\mathbb{H}[Z_{Y}Z_{D}]\triangleq\frac{1}{2}\log\hskip1.422638pt\left(2\pi e% \right)^{\leftY\right}\left\Sigma_{YYD}\right\ .\vspace{2.5mm}  (3) 
In contrast to the first measure of uncertainty that assumes conditional independence between measurements in the set Y of unobserved road segments, the entropybased measure (3) accounts for their correlation, thereby not overestimating their uncertainty. Hence, we will focus on using the entropybased measure of uncertainty in this paper.
2.1 GraphBased Kernel
If the observations are noisy (i.e., by assuming additive independent identically distributed Gaussian noise with variance \sigma^{2}_{n}), then their prior covariance \sigma_{ss^{\prime}} can be expressed as \sigma_{ss^{\prime}}=k(s,s^{\prime})+\sigma^{2}_{n}\delta_{ss^{\prime}} where \delta_{ss^{\prime}} is a Kronecker delta that is 1 if s=s^{\prime} and 0 otherwise, and k is a kernel function measuring the pairwise “similarity” of road segments. For a traffic phenomenon (e.g., road speeds), the correlation of measurements between pairs of road segments depends not only on their features (e.g., length, number of lanes, speed limit, direction) but also the road network topology. So, the kernel function is defined to exploit both the features and topology information, which will be described next.
Definition 1 (Road Network)
Let the road network be represented as a weighted directed graph G\triangleq(V,E,m) that consists of

a set V of vertices denoting the domain of all possible road segments,

a set E\subseteq V\times V of edges where there is an edge (s,s^{\prime}) from s\in V to s^{\prime}\in V iff the end of segment s connects to the start of segment s^{\prime} in the road network, and

a weight function m:E\rightarrow\mathbb{R}^{+} measuring the standardized Manhattan distance (Borg and Groenen, 2005) m((s,s^{\prime}))\triangleq\sum_{i=1}^{p}\left[s]_{i}[s^{\prime}]_{i}\right% \hskip1.422638pt/r_{i} of each edge (s,s^{\prime}) where [s]_{i} ([s^{\prime}]_{i}) is the ith component of the feature vector specifying road segment s (s^{\prime}), and r_{i} is the range of the ith feature. The weight function m serves as a dissimilarity measure between adjacent road segments.
The next step is to compute the shortest path distance d(s,s^{\prime}) between all pairs of road segments s,s^{\prime}\in V (i.e., using FloydWarshall or Johnson’s algorithm) with respect to the topology of the weighted directed graph G. Such a distance function is again a measure of dissimilarity, rather than one of similarity, as required by a kernel function. Furthermore, a valid GP kernel needs to be positive semidefinite and symmetric (Schölkopf and Smola, 2002), which are clearly violated by d.
To construct a valid GP kernel from d, multidimensional scaling (Borg and Groenen, 2005) is applied to embed the domain of road segments into the p^{\prime}dimensional Euclidean space \mathbb{R}^{p^{\prime}}. Specifically, a mapping g:V\rightarrow\mathbb{R}^{p^{\prime}} is determined by minimizing the squared loss g^{\ast}=\mathop{\arg\min}_{g}\sum_{s,s^{\prime}\in V}(d(s,s^{\prime})\g(s)% g(s^{\prime})\)^{2}. With a small squared loss, the Euclidean distance \g^{\ast}(s)g^{\ast}(s^{\prime})\ between g^{\ast}(s) and g^{\ast}(s^{\prime}) is expected to closely approximate the shortest path distance d(s,s^{\prime}) between any pair of road segments s and s^{\prime}. After embedding into Euclidean space, a conventional kernel function such as the squared exponential one (Rasmussen and Williams, 2006) can be used:
k(s,s^{\prime})=\sigma_{s}^{2}\exp\left({\frac{1}{2}\sum_{i=1}^{p^{\prime}}% \left(\frac{[g^{\ast}(s)]_{i}[g^{\ast}(s^{\prime})]_{i}}{\ell_{i}}\right)^{2}% }\right)\vspace{2.5mm} 
where [g^{\ast}(s)]_{i} ([g^{\ast}(s^{\prime})]_{i}) is the ith component of the p^{\prime}dimensional vector g^{\ast}(s) (g^{\ast}(s^{\prime})), and the hyperparameters \sigma_{s}, \ell_{1},\ldots,\ell_{p^{\prime}} are, respectively, signal variance and lengthscales that can be learned using maximum likelihood estimation (Rasmussen and Williams, 2006). The resulting kernel function k^{1}^{1}1For spatiotemporal traffic modeling, the kernel function k can be extended to account for the temporal dimension. is guaranteed to be valid.
2.2 Subset of Data Approximation
Although the GP is an effective predictive model, it faces a practical limitation of cubic time complexity in the number D of observations; this can be observed from computing the posterior distribution (i.e., (1) and (2)), which requires inverting covariance matrix \Sigma_{DD} and incurs {\mathcal{O}}(D^{3}) time. If D is expected to be large, GP prediction cannot be performed in real time. For practical usage, we have to resort to computationally cheaper approximate GP prediction.
A simple method of approximation is to select only a subset U of the entire set D of observed road segments (i.e., U\subset D) to compute the posterior distribution of the measurements at any set Y\subseteq V\setminus D of unobserved road segments. Such a subset of data (SoD) approximation method produces the following predictive Gaussian distribution, which closely resembles that of the full GP model (i.e., by simply replacing D in (1) and (2) with U):
\mu_{YU}=\mu_{Y}+\Sigma_{YU}\Sigma_{UU}^{1}(z_{U}\mu_{U})\vspace{2mm}  (4) 
\Sigma_{YYU}=\Sigma_{YY}\Sigma_{YU}\Sigma_{UU}^{1}\Sigma_{UY}\ .\vspace{1mm}  (5) 
Notice that the covariance matrix \Sigma_{UU} to be inverted only incurs {\mathcal{O}}(U^{3}) time, which is independent of D.
The predictive performance of SoD approximation is sensitive to the selection of subset U. In practice, random subset selection often yields poor performance. This issue can be resolved by actively selecting an informative subset U in an iterative greedy manner: Firstly, U is initialized to be an empty set. Then, all road segments in D\setminus U are scored based on a criterion that can be chosen from, for example, the works of (Krause et al., 2008b; Lawrence et al., 2003; Seeger and Williams, 2003). The highestscored segment is selected for inclusion in U and removed from D. This greedy selection procedure is iterated until U reaches a predefined size. Among the various criteria introduced earlier, the differential entropy score (Lawrence et al., 2003) is reported to perform well (Oh et al., 2010); it is a monotonic function of the posterior variance \Sigma_{ssU} (5), thus resulting in the greedy selection of a segment s\in D\setminus U with the largest variance in each iteration.
3 Decentralized Data Fusion
In the previous section, two centralized data fusion approaches to exact (i.e., (1) and (2)) and approximate (i.e., (4) and (5)) GP prediction are introduced. In this section, we will discuss the decentralized data fusion component of our D{}^{2}FAS algorithm, which distributes the computational load among the mobile sensors to achieve efficient and scalable approximate GP prediction.
The intuition of our decentralized data fusion algorithm is as follows: Each of the K mobile sensors constructs a local summary of the observations taken along its own path in the road network and communicates its local summary to every other sensor. Then, it assimilates the local summaries received from the other sensors into a globally consistent summary, which is exploited for predicting the traffic phenomenon as well as active sensing. This intuition will be formally realized and described in the paragraphs below.
While exploring the road network, each mobile sensor summarizes its local observations taken along its path based on a common support set U\subset V known to all the other sensors. Its local summary is defined as follows:
Definition 2 (Local Summary)
Given a common support set U\subset V known to all K mobile sensors, a set D_{k}\subset V of observed road segments and a column vector z_{D_{k}} of corresponding measurements local to mobile sensor k, its local summary is defined as a tuple (\dot{z}_{U}^{k},\dot{\Sigma}_{UU}^{k}) where
\dot{z}_{U}^{k}\triangleq\Sigma_{UD_{k}}\Sigma^{1}_{D_{k}D_{k}U}(z_{D_{k}}% \mu_{D_{k}})\vspace{0.5mm}  (6) 
\dot{\Sigma}_{UU}^{k}\triangleq\Sigma_{UD_{k}}\Sigma_{D_{k}D_{k}U}^{1}\Sigma% _{D_{k}U}\vspace{0.5mm}  (7) 
such that \Sigma_{D_{k}D_{k}U} is defined in a similar manner to (5).
Remark. Unlike SoD (Section 2.2), the support set U of road segments does not have to be observed, since the local summary (i.e., (6) and (7)) is independent of the corresponding measurements z_{U}. So, U does not need to be a subset of D=\bigcup^{K}_{k=1}D_{k}. To select an informative support set U from the set V of all possible segments in the road network, an offline active selection procedure similar to that in the last paragraph of Section 2.2 can be performed just once prior to observing data to determine U. In contrast, SoD has to perform online active selection every time new road segments are being observed.
By communicating its local summary to every other sensor, each mobile sensor can then construct a globally consistent summary from the received local summaries:
Definition 3 (Global Summary)
Given a common support set U\subset V known to all K mobile sensors and the local summary (\dot{z}_{U}^{k},\dot{\Sigma}_{UU}^{k}) of every mobile sensor k=1,\ldots,K, the global summary is defined as a tuple (\ddot{z}_{U},\ddot{\Sigma}_{UU}) where
\ddot{z}_{U}\triangleq\sum_{k=1}^{K}\dot{z}_{U}^{k}\vspace{2mm}  (8) 
\ddot{\Sigma}_{UU}\triangleq\Sigma_{UU}+\sum_{k=1}^{K}\dot{\Sigma}_{UU}^{k}\ .% \vspace{6mm}  (9) 
Remark. In this paper, we assume alltoall communication between the K mobile sensors. Supposing this is not possible and each sensor can only communicate locally with its neighbors, the summation structure of the global summary (specifically, (8) and (9)) makes it amenable to be constructed using distributed consensus filters (OlfatiSaber, 2005). We omit these details since they are beyond the scope of this paper.
Finally, the global summary is exploited by each mobile sensor to compute a globally consistent predictive Gaussian distribution, as detailed in Theorem 1A below, as well as to perform decentralized active sensing (Section 4):
Theorem 1
Let a common support set U\subset V be known to all K mobile sensors.
 A.

Given the global summary (\ddot{z}_{U},\ddot{\Sigma}_{UU}), each mobile sensor computes a globally consistent predictive Gaussian distribution \mathcal{N}(\overline{\mu}_{Y},\overline{\Sigma}_{YY}) of the measurements at any set Y of unobserved road segments where
\overline{\mu}_{Y}\triangleq\mu_{Y}+\Sigma_{YU}\ddot{\Sigma}_{UU}^{1}\ddot{z}% _{U}\vspace{1mm} (10) \overline{\Sigma}_{YY}\triangleq\Sigma_{YY}\Sigma_{YU}(\Sigma_{UU}^{1}\ddot% {\Sigma}_{UU}^{1})\Sigma_{UY}\ .\vspace{0mm} (11)  B.

Let \mathcal{N}(\mu_{YD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}\emph{PITC}}}% ,\Sigma_{YYD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}\emph{PITC}}}) be the predictive Gaussian distribution computed by the centralized sparse partially independent training conditional (PITC) approximation of GP model (QuiñoneroCandela and Rasmussen, 2005) where
\mu_{YD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}\emph{PITC}}}\triangleq% \mu_{Y}+\Gamma_{YD}\left(\Gamma_{DD}+\Lambda\right)^{1}\hskip2.845276pt\left% (z_{D}\mu_{D}\right)\vspace{2mm} (12) \Sigma_{YYD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}\emph{PITC}}}% \triangleq\Sigma_{YY}\Gamma_{YD}\left(\Gamma_{DD}+\Lambda\right)^{1}\Gamma_{% DY}\vspace{0mm} (13) such that
\Gamma_{BB^{\prime}}\triangleq\Sigma_{BU}\Sigma_{UU}^{1}\Sigma_{UB^{\prime}} (14) and \Lambda is a blockdiagonal matrix constructed from the K diagonal blocks of \Sigma_{DDU}, each of which is a matrix \Sigma_{D_{k}D_{k}U} for k=1,\ldots,K where D=\bigcup_{k=1}^{K}D_{k}. Then, \overline{\mu}_{Y}\hskip1.422638pt=\hskip1.422638pt\mu_{YD}^{\mbox{% \@setsize{\tiny}{7pt}{\vipt}{\@vipt}\emph{PITC}}} and \overline{\Sigma}_{YY}\hskip1.422638pt=\hskip1.422638pt\Sigma_{YYD}^{\mbox{% \@setsize{\tiny}{7pt}{\vipt}{\@vipt}\emph{PITC}}}.
The proof of Theorem 1B is given in Appendix A. The equivalence result of Theorem 1B bears two implications:
Remark 1. The computation of PITC can be parallelized and distributed among the mobile sensors in a Googlelike MapReduce paradigm (Chu et al., 2007), thereby improving the time efficiency of prediction: Each of the K mappers (sensors) is tasked to compute its local summary while the reducer (any sensor) sums these local summaries into a global summary, which is then used to compute the predictive Gaussian distribution. Supposing Y\leqU for simplicity, the {\mathcal{O}}\hskip2.845276pt\left(D((D/K)^{2}+U^{2})\right) time incurred by PITC can be reduced to {\mathcal{O}}\hskip2.845276pt\left((D/K)^{3}+U^{3}+U^{2}K\right) time of running our decentralized algorithm on each of the K sensors, the latter of which scales better with increasing number D of observations.
Remark 2. We can draw insights from PITC to elucidate an underlying property of our decentralized algorithm: It is assumed that Z_{D_{1}},\ldots,Z_{D_{K}},Z_{Y} are conditionally independent given the measurements at the support set U of road segments. To potentially reduce the degree of violation of this assumption, an informative support set U is actively selected, as described earlier in this section. Furthermore, the experimental results on realworld urban road network data^{2}^{2}2QuiñoneroCandela and Rasmussen (2005) only illustrated the predictive performance of PITC on a simulated toy example. (Section 6) show that D{}^{2}FAS can achieve predictive performance comparable to that of the full GP model while enjoying significantly lower computational cost, thus demonstrating the practicality of such an assumption for predicting traffic phenomena. The predictive performance of D{}^{2}FAS can be improved by increasing the size of U at the expense of greater time and communication overhead.
4 Decentralized Active Sensing
The problem of active sensing with K mobile sensors is formulated as follows: Given the set D_{k}\subset V of observed road segments and the currently traversed road segment s_{k}\in V of every mobile sensor k=1,\ldots,K, the mobile sensors have to coordinate to select the most informative walks w^{\ast}_{1},\ldots,w^{\ast}_{K} of length (i.e., number of road segments) L each and with respective origins s_{1},\ldots,s_{K} in the road network G:
\hskip0.853583pt\displaystyle(w^{\ast}_{1},\ldots,w^{\ast}_{K})\hskip2.27622% pt=\hskip2.27622pt\mathop{\arg\max}_{(w_{1},\ldots,w_{K})}\mathbb{H}\hskip2.% 845276pt\left[Z_{\bigcup^{K}_{k=1}Y_{w_{k}}}\Big{}Z_{\bigcup^{K}_{k=1}D_{k}}% \right]\vspace{1mm}  (15) 
where Y_{w_{k}} denotes the set of unobserved road segments induced by the walk w_{k}. To simplify notation, let a joint walk be denoted by w\triangleq(w_{1},\ldots,w_{K}) (similarly, for w^{\ast}) and its induced set of unobserved road segments be Y_{w}\triangleq\bigcup^{K}_{k=1}Y_{w_{k}} from now on. Interestingly, it can be shown using the chain rule for entropy that these maximumentropy walks w^{\ast} minimize the posterior joint entropy (i.e., \mathbb{H}[Z_{V\setminus(D\bigcup Y_{w^{\ast}})}Z_{D\bigcup Y_{w^{\ast}}}]) of the measurements at the remaining unobserved segments (i.e., V\setminus(D\bigcup Y_{w^{\ast}})) in the road network. After executing the walk w^{\ast}_{k}, each mobile sensor k observes the set Y_{w^{\ast}_{k}} of road segments and updates its local information:
D_{k}\leftarrow D_{k}\bigcup Y_{w^{\ast}_{k}}\ ,z_{D_{k}}\leftarrow z_{D_{k}% \bigcup Y_{w^{\ast}_{k}}},s_{k}\leftarrow\mbox{terminus of}\ w^{\ast}_{k}\ .% \vspace{0mm}  (16) 
Evaluating the Gaussian posterior entropy term in (15) involves computing a posterior covariance matrix (3) using one of the data fusion methods described earlier: If (2) of full GP model (Section 2) or (5) of SoD (Section 2.2) is to be used, then the observations that are gathered distributedly by the sensors have to be fully communicated to a central data fusion center. In contrast, our decentralized data fusion algorithm (Section 3) only requires communicating local summaries (Definition 2) to compute (11) for solving the active sensing problem (15):
\displaystyle w^{\ast}=\displaystyle\mathop{\arg\max}_{w}\ \overline{\mathbb{H% }}\hskip1.422638pt\left[Z_{Y_{w}}\right]\ ,\vspace{1.5mm}  (17) 
\overline{\mathbb{H}}\hskip1.422638pt\left[Z_{Y_{w}}\right]\triangleq\frac{1}% {2}\log\hskip1.422638pt\left(2\pi e\right)^{\leftY_{w}\right}\left% \overline{\Sigma}_{Y_{w}Y_{w}}\right\ .\vspace{1mm}  (18) 
Without imposing any structural assumption, solving the active sensing problem (17) will be prohibitively expensive due to the space of possible joint walks w that grows exponentially in the number K of mobile sensors. To overcome this scalability issue for D{}^{2}FAS, our key idea is to construct a blockdiagonal matrix whose logdeterminant closely approximates that of \overline{\Sigma}_{Y_{w}Y_{w}} (11) and exploit the property that the logdeterminant of such a blockdiagonal matrix can be decomposed into a sum of logdeterminants of its diagonal blocks, each of which depends only on the walks of a disjoint subset of the K mobile sensors. Consequently, the active sensing problem can be partially decentralized, leading to a reduced space of possible joint walks to be searched, as detailed in the rest of this section.
Firstly, we extend an earlier assumption in Section 3: Z_{D_{1}},\ldots,Z_{D_{K}},Z_{Y_{w_{1}}},\ldots,Z_{Y_{w_{K}}} are conditionally independent given the measurements at the support set U of road segments. Then, it can be shown via the equivalence to PITC (Theorem 1B) that \overline{\Sigma}_{Y_{w}Y_{w}} (11) comprises diagonal blocks of the form \overline{\Sigma}_{Y_{w_{k}}Y_{w_{k}}} for k=1,\ldots,K and offdiagonal blocks of the form \Sigma_{Y_{w_{k}}U}\ddot{\Sigma}_{UU}^{1}\Sigma_{UY_{w_{k^{\prime}}}} for k,k^{\prime}=1,\ldots,K and k\neq k^{\prime}. In particular, each offdiagonal block of \overline{\Sigma}_{Y_{w}Y_{w}} represents the correlation of measurements between the unobserved road segments Y_{w_{k}} and Y_{w_{k^{\prime}}} along the respective walks w_{k} of sensor k and w_{k^{\prime}} of sensor k^{\prime}. If the correlation between some pair of their possible walks is high enough, then their walks have to be coordinated. This is formally realized by the following coordination graph over the K sensors:
Definition 4 (Coordination Graph)
Define the coordination graph to be an undirected graph \mathcal{G}\triangleq(\mathcal{V},\mathcal{E}) that comprises

a set \mathcal{V} of vertices denoting the K mobile sensors, and

a set \mathcal{E} of edges denoting coordination dependencies between sensors such that there exists an edge \{k,k^{\prime}\} incident with sensors k\in\mathcal{V} and k^{\prime}\in\mathcal{V}\setminus\{k\} iff
\max_{s\in Y_{W_{k}},s^{\prime}\in Y_{W_{k^{\prime}}}}\left\Sigma_{sU}\ddot{% \Sigma}_{UU}^{1}\Sigma_{Us^{\prime}}\right>\varepsilon\vspace{1mm} (19) for a predefined constant \varepsilon>0 where W_{k} denotes the set of possible walks of length L of mobile sensor k from origin s_{k} in the road network G and Y_{W_{k}}\triangleq\bigcup_{w_{k}\in W_{k}}Y_{w_{k}}.
Remark. The construction of \mathcal{G} can be decentralized as follows: Since \ddot{\Sigma}_{UU} is symmetric and positive definite, it can be decomposed by Cholesky factorization into \ddot{\Sigma}_{UU}=\Psi\Psi^{\top} where \Psi is a lower triangular matrix and \Psi^{\top} is the transpose of \Psi. Then, \Sigma_{sU}\ddot{\Sigma}_{UU}^{1}\Sigma_{Us^{\prime}}=(\Psi\backslash\Sigma_{% Us})^{\top}\Psi\backslash\Sigma_{Us^{\prime}} where \Psi\backslash B denotes the column vector \phi solving \Psi\phi=B. That is, \Sigma_{sU}\ddot{\Sigma}_{UU}^{1}\Sigma_{Us^{\prime}} (19) can be expressed as a dot product of two vectors \Psi\backslash\Sigma_{Us} and \Psi\backslash\Sigma_{Us^{\prime}}; this property is exploited to determine adjacency between sensors in a decentralized manner:
Definition 5 (Adjacency)
Let
\Phi_{k}\triangleq\{\Psi\backslash\Sigma_{Us}\}_{s\in Y_{W_{k}}}\vspace{1.5mm}  (20) 
for k=1,\ldots,K. A sensor k\in\mathcal{V} is adjacent to sensor k^{\prime}\in\mathcal{V}\setminus\{k\} in coordination graph \mathcal{G} iff
\max_{\phi\in\Phi_{k},\phi^{\prime}\in\Phi_{k^{\prime}}}\left\phi^{\top}\phi^% {\prime}\right>\varepsilon\ .\vspace{3.0mm}  (21) 
It follows from the above definition that if each sensor k constructs \Phi_{k} and exchanges it with every other sensor, then it can determine its adjacency to all the other sensors and store this information in a column vector a_{k} of length K with its k^{\prime}th component being defined as follows:
\hskip2.845276pt\left[a_{k}\right]_{k^{\prime}}\hskip0.284528pt=\hskip0.284% 528pt\Bigg{\{}\begin{array}[]{rl}1&\mbox{if sensor $k$ is adjacent to sensor $% k^{\prime}$,}\\ 0&\mbox{otherwise.}\end{array}\vspace{2mm}  (22) 
By exchanging its adjacency vector a_{k} with every other sensor, each sensor can construct a globally consistent adjacency matrix A_{\mathcal{G}}\triangleq(a_{1}\ldots a_{K}) to represent coordination graph \mathcal{G}.
Next, by computing the connected components (say, \mathcal{K} of them) of coordination graph \mathcal{G}, their resulting vertex sets partition the set \mathcal{V} of K sensors into \mathcal{K} disjoint subsets \mathcal{V}_{1},\ldots,\mathcal{V}_{\mathcal{K}} such that the sensors within each subset have to coordinate their walks. Each sensor can determine its residing connected component in a decentralized way by performing a depthfirst search in \mathcal{G} starting from it as root.
Finally, construct a blockdiagonal matrix \widehat{\Sigma}_{Y_{w}Y_{w}} to comprise diagonal blocks of the form \overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}}} for n=1,\ldots,\mathcal{K} where w_{\mathcal{V}_{n}}\triangleq(w_{k})_{k\in\mathcal{V}_{n}} and Y_{w_{\mathcal{V}_{n}}}\triangleq\bigcup_{k\in\mathcal{V}_{n}}Y_{w_{k}}. The active sensing problem (17) is then approximated by
\hskip1.422638pt\begin{array}[]{l}\displaystyle\max_{w}\ \frac{1}{2}\log% \hskip1.422638pt\left(2\pi e\right)^{\leftY_{w}\right}\left\widehat{\Sigma% }_{Y_{w}Y_{w}}\right\\ \equiv\displaystyle\max_{(w_{\mathcal{V}_{1}},\ldots,w_{\mathcal{V}_{\mathcal{% K}}})}\sum_{n=1}^{\mathcal{K}}\log\hskip1.422638pt\left(2\pi e\right)^{\left% Y_{w_{\mathcal{V}_{n}}}\right}\left\overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}% }Y_{w_{\mathcal{V}_{n}}}}\right\\ =\displaystyle\sum_{n=1}^{\mathcal{K}}\max_{w_{\mathcal{V}_{n}}}\ \log\hskip1% .422638pt\left(2\pi e\right)^{\leftY_{w_{\mathcal{V}_{n}}}\right}\left% \overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}}}\right,\end% {array}\vspace{2mm}  (23) 
which can be solved in a partially decentralized manner by each disjoint subset \mathcal{V}_{n} of mobile sensors:
\displaystyle\widehat{w}_{\mathcal{V}_{n}}=\mathop{\arg\max}_{w_{\mathcal{V}_{% n}}}\ \log\hskip1.422638pt\left(2\pi e\right)^{\leftY_{w_{\mathcal{V}_{n}}}% \right}\left\overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}% }}\right.\vspace{2mm}  (24) 
Our active sensing algorithm becomes fully decentralized if \varepsilon is set to be sufficiently large: more sensors become isolated in \mathcal{G}, consequently decreasing the size \displaystyle\kappa\triangleq\max_{n}\left\mathcal{V}_{n}\right of its largest connected component to 1. As shown in Section 5.1, decreasing \kappa improves its time efficiency. On the other hand, it tends to a centralized behavior (17) by setting \varepsilon\rightarrow 0^{+}: \mathcal{G} becomes nearcomplete, thus resulting in \kappa\rightarrow K.
Let
\displaystyle\xi\triangleq\max_{n,w_{\mathcal{V}_{n}},i,i^{\prime}}\left\left% [\left(\overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}}}% \right)^{\hskip2.845276pt1}\right]_{ii^{\prime}}\right\vspace{1mm}  (25) 
and \epsilon\triangleq\displaystyle 0.5\log 1\hskip1.422638pt\Big{/}\hskip4.2679% 13pt\left(1\hskip1.422638pt\hskip1.422638pt\left(K^{1.5}L^{2.5}\kappa\xi% \varepsilon\right)^{2}\right). In the result below, we prove that the joint walk \widehat{w}\triangleq(\widehat{w}_{\mathcal{V}_{1}},\ldots,\widehat{w}_{% \mathcal{V}_{\mathcal{K}}}) is guaranteed to achieve an entropy \overline{\mathbb{H}}\hskip1.422638pt\left[Z_{Y_{\widehat{w}}}\right] (i.e., by plugging \widehat{w} into (18)) that is not more than \epsilon from the maximum entropy \overline{\mathbb{H}}\hskip1.422638pt\left[Z_{Y_{w^{\ast}}}\right] achieved by joint walk w^{\ast} (17):
Theorem 2 (Performance Guarantee)
If K^{1.5}L^{2.5}\kappa\xi\varepsilon\hskip0.170717pt<1, then \overline{\mathbb{H}}\hskip1.422638pt\left[Z_{Y_{w^{\ast}}}\right]\overline{% \mathbb{H}}\hskip1.422638pt\left[Z_{Y_{\widehat{w}}}\right]\leq\epsilon.
The proof of Theorem 2 is given in Appendix B. The implication of Theorem 2 is that our partially decentralized active sensing algorithm can perform comparatively well (i.e., small \epsilon) under the following favorable environmental conditions: (a) the network of K sensors is not large, (b) length L of each sensor’s walk to be optimized is not long, (c) the largest subset of \kappa sensors being formed to coordinate their walks (i.e., largest connected component in \mathcal{G}) is reasonably small, and (d) the minimum required correlation \varepsilon between walks of adjacent sensors is kept low.
5 Time and Communication Overheads
In this section, the time and communication overheads of our D{}^{2}FAS algorithm are analyzed and compared to that of centralized active sensing (17) coupled with the data fusion methods: Full GP (FGP) and SoD (Section 2).
5.1 Time Complexity
The data fusion component of D{}^{2}FAS involves computing the local and global summaries and the predictive Gaussian distribution. To construct the local summary using (6) and (7), each sensor has to evaluate \Sigma_{D_{k}D_{k}U} in {\mathcal{O}}\hskip2.845276pt\left(U^{3}+U(D/K)^{2}\right) time and invert it in {\mathcal{O}}\hskip2.845276pt\left((D/K)^{3}\right) time, after which the local summary is obtained in {\mathcal{O}}\hskip2.845276pt\left(U^{2}D/K+U(D/K)^{2}\right) time. The global summary is computed in {\mathcal{O}}\hskip2.845276pt\left(U^{2}K\right) by (8) and (9). Finally, the predictive Gaussian distribution is derived in {\mathcal{O}}\hskip2.845276pt\left(U^{3}+UY^{2}\right) time using (10) and (11). Supposing Y\leqU for simplicity, the time complexity of data fusion is then {\mathcal{O}}\hskip2.845276pt\left((D/K)^{3}+U^{3}+U^{2}K\right).
Let the maximum outdegree of G be denoted by \delta. Then, each sensor has to consider \Delta\triangleq\delta^{L} possible walks of length L. The active sensing component of D{}^{2}FAS involves computing \Phi_{k} in {\mathcal{O}}\hskip2.845276pt\left(\Delta LU^{2}\right) time, a_{k} in {\mathcal{O}}\hskip2.845276pt\left(\Delta^{2}L^{2}UK\right) time, its residing connected component in {\mathcal{O}}\hskip2.845276pt\left(\kappa^{2}\right) time, and the maximumentropy joint walk by (11) and (24) with the following incurred time: The largest connected component of \kappa sensors in \mathcal{G} has to consider \Delta^{\kappa} possible joint walks. Note that \overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}}}=\mbox{diag}% \hskip2.845276pt\left((\Sigma_{Y_{w_{k}}Y_{w_{k}}U})_{k\in\mathcal{V}_{n}}% \right)+\Sigma_{Y_{w_{\mathcal{V}_{n}}}U}\ddot{\Sigma}_{UU}^{1}\Sigma_{UY_{w_% {\mathcal{V}_{n}}}} where \mbox{diag}(B) constructs a diagonal matrix by placing vector B on its diagonal. By exploiting \Phi_{k}, the diagonal and latter matrix terms for all possible joint walks can be computed in {\mathcal{O}}\hskip2.845276pt\left(\kappa\Delta(LU^{2}+L^{2}U)\right) and {\mathcal{O}}\hskip2.845276pt\left(\kappa^{2}\Delta^{2}L^{2}U\right) time, respectively. For each joint walk w_{\mathcal{V}_{n}}, evaluating the determinant of \overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}}} incurs {\mathcal{O}}\hskip2.845276pt\left((\kappa L)^{3}\right) time. Therefore, the time complexity of active sensing is {\mathcal{O}}\hskip2.845276pt\left(\kappa\Delta LU^{2}+\Delta^{2}L^{2}U(K% +\kappa^{2})+\Delta^{\kappa}(\kappa L)^{3}\right).
Hence, the time complexity of our D{}^{2}FAS algorithm is {\mathcal{O}}\hskip 0.0pt((D/K)^{3}+U^{2}(U+K+\kappa\Delta L)+\Delta^{2}% L^{2}U(K+\kappa^{2})+\Delta^{\kappa}(\kappa L)^{3}). In contrast, the time incurred by centralized active sensing coupled with FGP and SoD are, respectively, {\mathcal{O}}\hskip2.845276pt\left(D^{3}+\Delta^{K}KL(D^{2}+(KL)^{2})\right) and {\mathcal{O}}\hskip2.845276pt\left(U^{3}D+\Delta^{K}KL(U^{2}+(KL)^{2})\right). It can be observed that D{}^{2}FAS can scale better with large D (i.e., number of observations) and K (i.e., number of sensors). The scalability of D{}^{2}FAS vs. FGP and SoD will be further evaluated empirically in Section 6.
5.2 Communication Complexity
Let the communication overhead be defined as the size of each broadcast message. Recall from the data fusion component of D{}^{2}FAS in Algorithm 1 that, in each iteration, each sensor broadcasts a {\mathcal{O}}\hskip2.845276pt\left(U^{2}\right)sized summary encapsulating its local observations, which is robust against communication failure. In contrast, FGP and SoD require each sensor to broadcast, in each iteration, a {\mathcal{O}}\hskip2.845276pt\left(D/K\right)sized message comprising exactly its local observations to handle communication failure. If the number of local observations grows to be larger in size than a local summary of predefined size, then the data fusion component of D{}^{2}FAS is more scalable than FGP and SoD in terms of communication overhead. For the partially decentralized active sensing component of D{}^{2}FAS, each sensor broadcasts {\mathcal{O}}\hskip2.845276pt\left(\Delta LU\right)sized \Phi_{k} and {\mathcal{O}}\hskip2.845276pt\left(K\right)sized a_{k} messages.
6 Experiments and Discussion
This section evaluates the predictive performance, time efficiency, and scalability of our D{}^{2}FAS algorithm on a realworld traffic phenomenon (i.e., speeds (km/h) of road segments) over an urban road network (top figure) in Tampines area, Singapore during evening peak hours on April 20, 2011. It comprises 775 road segments including highways, arterials, slip roads, etc. The mean speed is 48.8 km/h and the standard deviation is 20.5 km/h.
The performance of D{}^{2}FAS is compared to that of centralized active sensing (17) coupled with the stateoftheart data fusion methods: full GP (FGP) and SoD (Section 2). A network of K mobile sensors is tasked to explore the road network to gather a total of up to 960 observations. To reduce computational time, each sensor repeatedly computes and executes maximumentropy walks of length L=2 (instead of computing a very long walk), unless otherwise stated. For D{}^{2}FAS and SoD, U is set to 64 . For the active sensing component of D{}^{2}FAS, \varepsilon is set to 0.1, unless otherwise stated. The experiments are run on a Linux PC with Intel\circledR Core{}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}{TM}}}2 Quad CPU Q9550 at 2.83 GHz.
6.1 Performance Metrics
The first metric evaluates the predictive performance of a tested algorithm: It measures the root mean squared error (RMSE) \sqrt{V^{1}\sum_{s\in V}\left(z_{s}\widehat{\mu}_{s}\right)^{2}} over the entire domain V of the road network that is incurred by the predictive mean \widehat{\mu}_{s} of the tested algorithm, specifically, using (1) of FGP, (4) of SoD, or (10) of D{}^{2}FAS. The second metric evaluates the time efficiency and scalability of a tested algorithm by measuring its incurred time; for D{}^{2}FAS, the maximum of the time incurred by all subsets \mathcal{V}_{1},\ldots,\mathcal{V}_{\mathcal{K}} of sensors is recorded.
6.2 Results and Analysis
(a) K=4  (b) K=6  (c) K=8 
(d) K=4  (e) K=6  (f) K=8 
Predictive performance and time efficiency. Fig. 1 shows results of the performance of the tested algorithms averaged over 40 randomly generated starting sensor locations with varying number K=4,6,8 of sensors. It can be observed that D{}^{2}FAS is significantly more timeefficient and scales better with increasing number D of observations (Figs. 1d to 1f) while achieving predictive performance close to that of centralized active sensing coupled with FGP and SoD (Figs. 1a to 1c). Specifically, D{}^{2}FAS is about 1,2,4 orders of magnitude faster than centralized active sensing coupled with FGP and SoD for K=4,6,8 sensors, respectively.
(a) D{}^{2}FAS  (b) FGP  (c) SoD 
(d) D{}^{2}FAS  (e) FGP  (f) SoD 
Scalability of D{}^{2}FAS. Using the same results as that in Fig. 1, Fig. 2 plots them differently to reveal the scalability of the tested algorithms with increasing number K of sensors. Additionally, we provide results of the performance of D{}^{2}FAS for K=10,20,30 sensors; such results are not available for centralized active sensing coupled with FGP and SoD due to extremely long incurred time. It can be observed from Figs. 2a to 2c that the predictive performance of all tested algorithms improve with a larger number of sensors because each sensor needs to execute fewer walks and its performance is therefore less adversely affected by its myopic selection (i.e., L=2) of maximumentropy walks. As a result, more informative unobserved road segments are explored.
As shown in Fig. 2d, when the randomly placed sensors gather their initial observations (i.e., D<400), the time incurred by D{}^{2}FAS is higher for greater K due to larger subsets of sensors being formed to coordinate their walks (i.e., larger \kappa). As more observations are gathered (i.e., D\geq 400), its partially decentralized active sensing component directs the sensors to explore further apart from each other in order to maximize the entropy of their walks. This consequently decreases \kappa, leading to a reduction in incurred time. Furthermore, as K increases from 4 to 20, the incurred time decreases due to its decentralized data fusion component that can distribute the computational load among a greater number of sensors. When the road network becomes more crowded from K=20 to K=30 sensors, the incurred time increases slightly due to slightly larger \kappa. In contrast, Figs. 2e and 2f show that the time taken by FGP and SoD increases significantly primarily due to their centralized active sensing incurring exponential time in K. Hence, the scalability of our D{}^{2}FAS algorithm in the number of sensors allows the deployment of a largerscale mobile sensor network (i.e., K\geq 10) to achieve more accurate traffic modeling and prediction (Figs. 2a to 2c).
(a) D{}^{2}FAS  (b) FGP  (c) SoD 
Scalability of data fusion. Fig. 3 shows results of the scalability of the tested data fusion methods with increasing number K of sensors. In order to produce meaningful results for fair comparison, the same active sensing component has to be coupled with the data fusion methods and its incurred time kept to a minimum. As such, we impose the use of a fully decentralized active sensing component to be performed by each mobile sensor k: w^{\ast}_{k}=\mathop{\arg\max}_{w_{k}}\mathbb{H}[Z_{Y_{w_{k}}}Z_{D}]. For D{}^{2}FAS, this corresponds exactly to (24) by setting a large enough \varepsilon (in our experiments, \varepsilon=2) to yield \kappa=1; consequently, computational and communicational operations pertaining to the coordination graph can be omitted.
It can be seen from Fig. 3a that the time incurred by the decentralized data fusion component of D{}^{2}FAS decreases with increasing K, as explained previously. In contrast, the time incurred by FGP and SoD increases (Fig. 3b and 3c): As discussed above, a larger number of sensors results in a greater quantity of more informative unique observations to be gathered (i.e., fewer repeated observations), which increases the time needed for data fusion. When K\geq 10, D{}^{2}FAS is at least 1 order of magnitude faster than FGP and SoD. It can also be observed that D{}^{2}FAS scales better with increasing number of observations. So, the realtime performance and scalability of D{}^{2}FAS’s decentralized data fusion enable it to be used for persistent largescale traffic modeling and prediction where a large number of observations and sensors (including static and passive ones) are expected to be available.
(a) D{}^{2}FAS  (b) FGP  (c) SoD 
(d) D{}^{2}FAS  (e) FGP  (f) SoD 
Varying length L of walk. Fig. 4 shows results of the performance of the tested algorithms with varying length L=2,4,6,8 of maximumentropy joint walks; we choose to experiment with just 2 sensors since Figs. 2 and 3 reveal that a smaller number of sensors produce poorer predictive performance and higher incurred time with large number of observations for D{}^{2}FAS. It can be observed that the predictive performance of all tested algorithms improve with increasing walk length L because the selection of maximumentropy joint walks is less myopic. The time incurred by D{}^{2}FAS increases due to larger \kappa but grows more slowly and is lower than that incurred by centralized active sensing coupled with FGP and SoD. Specifically, when L=8, D{}^{2}FAS is at least 1 order of magnitude faster (i.e., average of 60 s) than centralized active sensing coupled with SoD (i.e., average of >732 s) and FGP (i.e., not available due to excessive incurred time). Also, notice from Figs. 2a and 2d that if a large number of sensors (i.e., K=30) is available, D{}^{2}FAS can select shorter walks of L=2 to be significantly more timeefficient (i.e., average of >3 orders of magnitude faster) while achieving predictive performance comparable to that of SoD with L=8 and FGP with L=6.
7 Conclusion
This paper describes a decentralized data fusion and active sensing algorithm for modeling and predicting spatiotemporal traffic phenomena with mobile sensors. Analytical and empirical results have shown that our D{}^{2}FAS algorithm is significantly more timeefficient and scales better with increasing number of observations and sensors while achieving predictive performance close to that of stateoftheart centralized active sensing coupled with FGP and SoD. Hence, D{}^{2}FAS is practical for deployment in a largescale mobile sensor network to achieve persistent and accurate traffic modeling and prediction. For our future work, we will assume that each sensor can only communicate locally with its neighbors (instead of assuming alltoall communication) and develop a distributed data fusion approach to efficient and scalable approximate GP prediction based on D{}^{2}FAS and consensus filters (OlfatiSaber, 2005).
Acknowledgments. This work was supported by SingaporeMIT Alliance Research and Technology (SMART) Subaward Agreement 14 R252000466592.
References
 Borg and Groenen (2005) Borg, I. and Groenen, P. J. F. (2005). Modern Multidimensional Scaling: Theory and Applications. Springer, NY.
 Chen et al. (2011) Chen, H., Rakha, H. A., and Sadek, S. (2011). Realtime freeway traffic state prediction: A particle filter approach. In Proc. IEEE ITSC, pages 626–631.
 Chu et al. (2007) Chu, C.T., Kim, S. K., Lin, Y.A., Yu, Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. (2007). MapReduce for machine learning on multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 281–288, Cambridge, MA. MIT Press.
 Chung et al. (2004) Chung, T. H., Gupta, V., Burdick, J. W., and Murray, R. M. (2004). On a decentralized active sensing strategy using mobile sensor platforms in a network. In Proc. CDC, pages 1914–1919.
 Coates (2004) Coates, M. (2004). Distributed particle filters for sensor networks. In Proc. IPSN, pages 99–107.
 Cortes (2009) Cortes, J. (2009). Distributed kriged Kalman filter for spatial estimation. IEEE Trans. Automat. Contr., 54(12), 2816–2827.
 Dolan et al. (2009) Dolan, J. M., Podnar, G., Stancliff, S., Low, K. H., Elfes, A., Higinbotham, J., Hosler, J. C., Moisan, T. A., and Moisan, J. (2009). Cooperative aquatic sensing using the telesupervised adaptive ocean sensor fleet. In Proc. SPIE Conference on Remote Sensing of the Ocean, Sea Ice, and Large Water Regions, volume 7473.
 Golub and Van Loan (1996) Golub, G. H. and Van Loan, C.F. (1996). Matrix Computations. Johns Hopkins Univ. Press, third edition.
 Guestrin et al. (2004) Guestrin, C., Bodik, P., Thibaus, R., Paskin, M., and Madden, S. (2004). Distributed regression: an efficient framework for modeling sensor network data. In Proc. IPSN, pages 1–10.
 Ipsen and Lee (2003) Ipsen, I. C. F. and Lee, D. J. (2003). Determinant approximations. Technical Report CRSCTR0330, Center for Research in Scientific Computation, North Carolina State University, Raleigh, NC.
 Kamarianakis and Prastacos (2003) Kamarianakis, Y. and Prastacos, P. (2003). Forecasting traffic flow conditions in an urban network: Comparison of multivariate and univariate approaches. Transport. Res. Rec., 1857, 74–84.
 Krause et al. (2008a) Krause, A., Horvitz, E., Kansal, A., and Zhao, F. (2008a). Toward community sensing. In Proc. IPSN, pages 481–492.
 Krause et al. (2008b) Krause, A., Singh, A., and Guestrin, C. (2008b). Nearoptimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 9, 235–284.
 Lawrence et al. (2003) Lawrence, N., Seeger, M., and Herbrich, R. (2003). Fast sparse Gaussian process methods: The informative vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 609–616, Cambridge, MA. MIT Press.
 Low et al. (2007) Low, K. H., Gordon, G. J., Dolan, J. M., and Khosla, P. (2007). Adaptive sampling for multirobot widearea exploration. In Proc. IEEE ICRA, pages 755–760.
 Low et al. (2008) Low, K. H., Dolan, J. M., and Khosla, P. (2008). Adaptive multirobot widearea exploration and mapping. In Proc. AAMAS, pages 23–30.
 Low et al. (2009a) Low, K. H., Dolan, J. M., and Khosla, P. (2009a). Informationtheoretic approach to efficient adaptive path planning for mobile robotic environmental sensing. In Proc. ICAPS, pages 233–240.
 Low et al. (2009b) Low, K. H., Podnar, G., Stancliff, S., Dolan, J. M., and Elfes, A. (2009b). Robot boats as a mobile aquatic sensor network. In Proc. IPSN09 Workshop on Sensor Networks for Earth and Space Science Applications.
 Low et al. (2011) Low, K. H., Dolan, J. M., and Khosla, P. (2011). Active Markov informationtheoretic path planning for robotic environmental sensing. In Proc. AAMAS, pages 753–760.
 Low et al. (2012) Low, K. H., Chen, J., Dolan, J. M., Chien, S., and Thompson, D. R. (2012). Decentralized active robotic exploration and mapping for probabilistic field classification in environmental sensing. In Proc. AAMAS, pages 105–112.
 Min and Wynter (2011) Min, W. and Wynter, L. (2011). Realtime road traffic prediction with spatiotemporal correlations. Transport. Res. CEmer., 19(4), 606–616.
 Neumann et al. (2009) Neumann, M., Kersting, K., Xu, Z., and Schulz, D. (2009). Stacked Gaussian process learning. In Proc. ICDM, pages 387–396.
 Oh et al. (2010) Oh, S., Xu, Y., and Choi, J. (2010). Explorative navigation of mobile sensor networks using sparse Gaussian processes. In Proc. CDC, pages 3851–3856.
 OlfatiSaber (2005) OlfatiSaber, R. (2005). Distributed Kalman filter with embedded consensus filters. In Proc. CDC, pages 8179–8184.
 Paskin and Guestrin (2004) Paskin, M. A. and Guestrin, C. (2004). Robust probabilistic inference in distributed systems. In Proc. UAI, pages 436–445.
 Podnar et al. (2010) Podnar, G., Dolan, J. M., Low, K. H., and Elfes, A. (2010). Telesupervised remote surface water quality sensing. In Proc. IEEE Aerospace Conference.
 QuiñoneroCandela and Rasmussen (2005) QuiñoneroCandela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. JMLR, 6, 1939–1959.
 Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
 Rosencrantz et al. (2003) Rosencrantz, M., Gordon, G., and Thrun, S. (2003). Decentralized sensor fusion with distributed particle filters. In Proc. UAI, pages 493–500.
 Schölkopf and Smola (2002) Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, first edition.
 Schrank et al. (2011) Schrank, D., Lomax, T., and Eisele, B. (2011). TTI’s 2011 Urban Mobility Report. Texas Transportation Institute, Texas A&M University.
 Seeger and Williams (2003) Seeger, M. and Williams, C. (2003). Fast forward selection to speed up sparse Gaussian process regression. In C. M. Bishop and B. J. Frey, editors, Proc. AISTATS.
 Srinivasan and Jovanis (1996) Srinivasan, K. K. and Jovanis, P. P. (1996). Determination of number of probe vehicle required for reliable travel time measurement in urban network. Transport. Res. Rec., 1537, 15–22.
 Stewart and Sun (1990) Stewart, G. W. and Sun, J.G. (1990). Matrix Perturbation Theory. Academic Press.
 Stranders et al. (2009) Stranders, R., Farinelli, A., Rogers, A., and Jennings, N. R. (2009). Decentralised coordination of mobile sensors using the maxsum algorithm. In Proc. IJCAI, pages 299–304.
 Sukkarieh et al. (2003) Sukkarieh, S., Nettleton, E., Kim, J., Ridley, M., Goktogan, A., and DurrantWhyte, H. (2003). The ANSER project: Data fusion across multiple uninhabited air vehicles. IJRR, 22(78), 505–539.
 Turner et al. (1998) Turner, S. M., Eisele, W. L., Benz, R. J., and Holdener, D. J. (1998). Travel time data collection handbook. Technical Report FHWAPL98035, Federal Highway Administration, Office of Highway Information Management, Washington, DC.
 Wang and Papageorgiou (2005) Wang, Y. and Papageorgiou, M. (2005). Realtime freeway traffic state estimation based on extended Kalman filter: a general approach. Transport. Res. BMeth., 39(2), 141–167.
 Work et al. (2010) Work, D. B., Blandin, S., Tossavainen, O., Piccoli, B., and Bayen, A. (2010). A traffic model for velocity data assimilation. AMRX, 2010(1), 1–35.
Appendix A Proof of Theorem 1B
We have to first simplify the \Gamma_{YD}\left(\Gamma_{DD}+\Lambda\right)^{1} term in the expressions of \mu_{YD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}PITC}} (12) and \Sigma_{YYD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}PITC}} (13).
\hskip5.690551pt\begin{array}[]{l}\left(\Gamma_{DD}+\Lambda\right)^{1}\\ \displaystyle=\left(\Sigma_{DU}\Sigma_{UU}^{1}\Sigma_{UD}+\Lambda\right)^{1}% \\ \displaystyle=\Lambda^{1}\Lambda^{1}\Sigma_{DU}\left(\Sigma_{UU}+\Sigma_{UD% }\Lambda^{1}\Sigma_{DU}\right)^{1}\Sigma_{UD}\Lambda^{1}\\ \displaystyle=\Lambda^{1}\Lambda^{1}\Sigma_{DU}\ddot{\Sigma}_{UU}^{1}% \Sigma_{UD}\Lambda^{1}\ .\end{array}  (26) 
The second equality follows from matrix inversion lemma. The last equality is due to
\begin{array}[]{l}\displaystyle\Sigma_{UU}+\Sigma_{UD}\Lambda^{1}\Sigma_{DU}% \\ =\displaystyle\Sigma_{UU}+\sum_{k=1}^{K}\Sigma_{UD_{k}}\Sigma_{D_{k}D_{k}U}^{% 1}\Sigma_{D_{k}U}\\ =\displaystyle\Sigma_{UU}+\sum_{k=1}^{K}\dot{\Sigma}_{UU}^{k}=\displaystyle% \ddot{\Sigma}_{UU}\ .\end{array}  (27) 
\begin{array}[]{l}\Gamma_{YD}\left(\Gamma_{DD}+\Lambda\right)^{1}\\ =\displaystyle\Sigma_{YU}\Sigma_{UU}^{1}\Sigma_{UD}\left(\Lambda^{1}\Lambda% ^{1}\Sigma_{DU}\ddot{\Sigma}_{UU}^{1}\Sigma_{UD}\Lambda^{1}\right)\\ =\displaystyle\Sigma_{YU}\Sigma_{UU}^{1}\left(\ddot{\Sigma}_{UU}\Sigma_{UD}% \Lambda^{1}\Sigma_{DU}\right)\ddot{\Sigma}_{UU}^{1}\Sigma_{UD}\Lambda^{1}\\ =\displaystyle\Sigma_{YU}\ddot{\Sigma}_{UU}^{1}\Sigma_{UD}\Lambda^{1}\end{array}  (28) 
The third equality is due to (27).
From (12),
\displaystyle\mu_{YD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}PITC}}  \displaystyle=\displaystyle\mu_{Y}+\Gamma_{YD}\left(\Gamma_{DD}+\Lambda\right)% ^{1}\left(z_{D}\mu_{D}\right)  
\displaystyle=\displaystyle\mu_{Y}+\Sigma_{YU}\ddot{\Sigma}_{UU}^{1}\Sigma_{% UD}\Lambda^{1}\left(z_{D}\mu_{D}\right)  
\displaystyle=\displaystyle\mu_{Y}+\Sigma_{YU}\ddot{\Sigma}_{UU}^{1}\ddot{z}_% {U}  
\displaystyle=\overline{\mu}_{Y}\ . 
The second equality is due to (28). The third equality follows from \Sigma_{UD}\Lambda^{1}\left(z_{D}\mu_{D}\right)=\sum_{k=1}^{K}\Sigma_{UD_{k}% }\Sigma_{D_{k}D_{k}U}^{1}\left(z_{D_{k}}\mu_{D_{k}}\right)=\sum_{k=1}^{K}% \dot{z}^{k}_{U}=\ddot{z}_{U}.
From (13),
\begin{array}[]{l}\Sigma_{YYD}^{\mbox{\@setsize{\tiny}{7pt}{\vipt}{\@vipt}% PITC}}\\ \displaystyle=\Sigma_{YY}\Gamma_{YD}\left(\Gamma_{DD}+\Lambda\right)^{1}% \Gamma_{DY}\\ \displaystyle=\Sigma_{YY}\Sigma_{YU}\ddot{\Sigma}_{UU}^{1}\Sigma_{UD}\Lambda% ^{1}\Sigma_{DU}\Sigma_{UU}^{1}\Sigma_{UY}\\ \displaystyle=\Sigma_{YY}\Big{(}\Sigma_{YU}\ddot{\Sigma}_{UU}^{1}\Sigma_{UD}% \Lambda^{1}\Sigma_{DU}\Sigma_{UU}^{1}\Sigma_{UY}\\ \displaystyle\ \ \ \ \Sigma_{YU}\Sigma_{UU}^{1}\Sigma_{UY}\Big{)}\Sigma_{YU% }\Sigma_{UU}^{1}\Sigma_{UY}\\ \displaystyle=\Sigma_{YY}\Sigma_{YU}\ddot{\Sigma}_{UU}^{1}\left(\Sigma_{UD}% \Lambda^{1}\Sigma_{DU}\ddot{\Sigma}_{UU}\right)\Sigma_{UU}^{1}\Sigma_{UY}\\ \displaystyle\ \ \ \ \Sigma_{YU}\Sigma_{UU}^{1}\Sigma_{UY}\\ \displaystyle=\Sigma_{YY}\left(\Sigma_{YU}\Sigma_{UU}^{1}\Sigma_{UY}\Sigma_% {YU}\ddot{\Sigma}_{UU}^{1}\Sigma_{UY}\right)\\ \displaystyle=\Sigma_{YY}\Sigma_{YU}\left(\Sigma_{UU}^{1}\ddot{\Sigma}_{UU}% ^{1}\right)\Sigma_{UY}\\ =\overline{\Sigma}_{YY}\ .\end{array} 
The second equality follows from (14) and (28). The fifth equality is due to (27).
Appendix B Proof of Theorem 2
Let \widetilde{\Sigma}_{Y_{w}Y_{w}}\triangleq\overline{\Sigma}_{Y_{w}Y_{w}}% \widehat{\Sigma}_{Y_{w}Y_{w}} and \rho_{w} be the spectral radius of \left(\widehat{\Sigma}_{Y_{w}Y_{w}}\right)^{\hskip2.845276pt1}\widetilde{% \Sigma}_{Y_{w}Y_{w}}. We have to first bound \rho_{w} from above.
For any joint walk w, \left(\widehat{\Sigma}_{Y_{w}Y_{w}}\right)^{\hskip2.845276pt1}\widetilde{% \Sigma}_{Y_{w}Y_{w}} comprises diagonal blocks of size \leftY_{w_{\mathcal{V}_{n}}}\right\times\leftY_{w_{\mathcal{V}_{n}}}\right with components of value 0 for n=1,\ldots,\mathcal{K} and offdiagonal blocks of the form \left(\overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}}}\right% )^{\hskip2.845276pt1}\hskip2.845276pt\overline{\Sigma}_{Y_{w_{\mathcal{V}_{% n}}}Y_{w_{\mathcal{V}_{n^{\prime}}}}} for n,n^{\prime}=1,\ldots,\mathcal{K} and n\neq n^{\prime}. We know that any pair of sensors k\in\mathcal{V}_{n} and k^{\prime}\in\mathcal{V}_{n^{\prime}} reside in different connected components of coordination graph \mathcal{G} and are therefore not adjacent. So, by Definition 5,
\displaystyle\max_{i,i^{\prime}}\left\left[\overline{\Sigma}_{Y_{w_{\mathcal{% V}_{n}}}Y_{w_{\mathcal{V}_{n^{\prime}}}}}\right]_{ii^{\prime}}\right\leq\varepsilon  (29) 
for n,n^{\prime}=1,\ldots,\mathcal{K} and n\neq n^{\prime}. Using (25) and (29), each component in any offdiagonal block of \left(\widehat{\Sigma}_{Y_{w}Y_{w}}\right)^{\hskip2.845276pt1}\widetilde{% \Sigma}_{Y_{w}Y_{w}} can be bounded as follows:
\displaystyle\max_{i,i^{\prime}}\left\left[\left(\overline{\Sigma}_{Y_{w_{% \mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n}}}}\right)^{\hskip2.845276pt1}\hskip2% .845276pt\overline{\Sigma}_{Y_{w_{\mathcal{V}_{n}}}Y_{w_{\mathcal{V}_{n^{% \prime}}}}}\right]_{ii^{\prime}}\right\leq\leftY_{w_{\mathcal{V}_{n}}}\right% \xi\varepsilon  (30) 
for n,n^{\prime}=1,\ldots,\mathcal{K} and n\neq n^{\prime}. It follows from (30) that
\displaystyle\max_{i,i^{\prime}}\left\left[\left(\widehat{\Sigma}_{Y_{w}Y_{w}% }\right)^{\hskip2.845276pt1}\widetilde{\Sigma}_{Y_{w}Y_{w}}\right]_{ii^{% \prime}}\right\leq\max_{n}\leftY_{w_{\mathcal{V}_{n}}}\right\xi\varepsilon% \leq L\kappa\xi\varepsilon\ .  (31) 
The last inequality is due to \displaystyle\max_{n}\leftY_{w_{\mathcal{V}_{n}}}\right\leq L\max_{n}\left% \mathcal{V}_{n}\right\leq L\kappa. Then,
\begin{array}[]{rl}\rho_{w}&\displaystyle\leq\left\left\left(\widehat{\Sigma% }_{Y_{w}Y_{w}}\right)^{\hskip2.845276pt1}\widetilde{\Sigma}_{Y_{w}Y_{w}}% \right\right_{2}\\ &\displaystyle\leq\leftY_{w}\right\max_{i,i^{\prime}}\left\left[\left(% \widehat{\Sigma}_{Y_{w}Y_{w}}\right)^{\hskip2.845276pt1}\widetilde{\Sigma}_{% Y_{w}Y_{w}}\right]_{ii^{\prime}}\right\\ &\displaystyle\leq KL^{2}\kappa\xi\varepsilon\ .\end{array}  (32) 
The first two inequalities follow from standard properties of matrix norm (Golub and Van Loan, 1996; Stewart and Sun, 1990). The last inequality is due to (31).
The rest of this proof utilizes the following result of Ipsen and Lee (2003) that is revised to reflect our notations:
Theorem 3
If Y_{w}\rho^{2}_{w}<1, then \log\hskip1.422638pt\left\overline{\Sigma}_{Y_{w}Y_{w}}\right\leq\log\hskip% 1.422638pt\left\widehat{\Sigma}_{Y_{w}Y_{w}}\right\leq\log\hskip1.422638pt% \left\overline{\Sigma}_{Y_{w}Y_{w}}\right\log\hskip2.845276pt\left(1Y_{w% }\rho^{2}_{w}\right) for any joint walk w.
Using Theorem 3 followed by (32),
\hskip17.071654pt\begin{array}[]{rl}\displaystyle\log\hskip1.422638pt\left% \widehat{\Sigma}_{Y_{w}Y_{w}}\right\log\hskip1.422638pt\left\overline{% \Sigma}_{Y_{w}Y_{w}}\right&\leq\displaystyle\log\hskip1.422638pt\frac{1}{1% Y_{w}\rho^{2}_{w}}\\ &\leq\displaystyle\log\hskip1.422638pt\frac{1}{1\hskip1.422638pt\hskip1.42% 2638pt\left(K^{1.5}L^{2.5}\kappa\xi\varepsilon\right)^{2}}\end{array}  (33) 
for any joint walk w.
\begin{array}[]{l}\overline{\mathbb{H}}\hskip1.422638pt\left[Z_{Y_{w^{\ast}}}% \right]\overline{\mathbb{H}}\hskip1.422638pt\left[Z_{Y_{\widehat{w}}}\right]% \\ =\displaystyle\frac{1}{2}\left(\left(\leftY_{w^{\ast}}\right\leftY_{% \widehat{w}}\right\right)\log\hskip1.422638pt\left(2\pi e\right)+\log\hskip% 1.422638pt\left\overline{\Sigma}_{Y_{w^{\ast}}Y_{w^{\ast}}}\right\log\hskip% 1.422638pt\left\overline{\Sigma}_{Y_{\widehat{w}}Y_{\widehat{w}}}\right% \right)\\ \leq\displaystyle\frac{1}{2}\left(\left(\leftY_{w^{\ast}}\right\leftY_{% \widehat{w}}\right\right)\log\hskip1.422638pt\left(2\pi e\right)+\log\hskip% 1.422638pt\left\widehat{\Sigma}_{Y_{w^{\ast}}Y_{w^{\ast}}}\right\log\hskip% 1.422638pt\left\overline{\Sigma}_{Y_{\widehat{w}}Y_{\widehat{w}}}\right% \right)\\ \leq\displaystyle\frac{1}{2}\left(\left(\leftY_{\widehat{w}}\right\leftY_{% \widehat{w}}\right\right)\log\hskip1.422638pt\left(2\pi e\right)+\log\hskip% 1.422638pt\left\widehat{\Sigma}_{Y_{\widehat{w}}Y_{\widehat{w}}}\right\log% \hskip1.422638pt\left\overline{\Sigma}_{Y_{\widehat{w}}Y_{\widehat{w}}}% \right\right)\\ \leq\displaystyle\frac{1}{2}\log\frac{1}{1\left(K^{1.5}L^{2.5}\kappa\xi% \varepsilon\right)^{2}}\ .\end{array} 
The first equality is due to (18). The first, second, and last inequalities follow from Theorem 3, (23), and (33), respectively.