A streaming featurebased compression method for data from instrumented infrastructure
Abstract
An increasing amount of civil engineering applications are utilising data acquired from infrastructure instrumented with sensing devices. This data has an important role in monitoring the response of these structures to excitation, and evaluating structural health. In this paper we seek to monitor pedestrianevents (such as a person walking) on a footbridge using strain and acceleration data. The rate of this data acquisition and the number of sensing devices make the storage and analysis of this data a computational challenge. We introduce a streaming method to compress the sensor data, whilst preserving key patterns and features (unique to different sensor types) corresponding to pedestrianevents. Numerical demonstrations of the methodology on data obtained from strain sensors and accelerometers on the pedestrian footbridge are provided to show the tradeoff between compression and accuracy during and inbetween periods of pedestrianevents.
1 Introduction
Sensor networks are being increasingly used in civil engineering applications. The data from these sensor networks are being used to monitor and detect changes in the behaviour of instrumented physical structures (Cawley, 2018). In this study we are concerned with monitoring and detecting changes in data from a sensor network installed on a pedestrian bridge. The sensor network consists of two different types of sensor: accelerometers and strain gauges, that measure the vertical deflection of the bridge. Both types of sensors record measurements at a high frequency. These measurements are recorded without end, which will inevitably pose computational storage and runtime issues (Bao et al., 2011). Data compression has therefore been increasingly utilised in literature involving instrumented infrastructures (Khoa et al., 2014; Bao et al., 2019; Bose et al., 2016). Despite the computational challenges associated with the sensor data acquisition, it is of interest to study the observed measurements during pedestrianevents, such as a person walking over the bridge, so that we can monitor how the response of the structure to this particular excitement changes over time.
We seek to develop a streaming method that is able to summarise the data corresponding to these pedestrianevents, whilst representing the data obtained from the sensors in a compressed form. The compressed version of the data will therefore retain features of the original data, that are prescribed by the user to be relevant. This is challenging for a number of reasons. First, constructing a sequential algorithm (for data compression) in the streaming data regime that can update at dataacquisition rate is difficult (Lau et al., 2018); for possibly indefinite data streams this means that such analyzes need to be incrementally updated rather than recomputed every time new data is observed. Second, determining which features in the original data are important so that they are retained in the compressed version requires expert knowledge. Embodying this expert knowledge so that relevant points of the data, with respect to user prescription, are preserved in the compressed version is not straightforward.
In this paper we propose a novel streaming method that summarises data in a compressed form, whilst preserving data relevant and corresponding to pedestrianevents. The developed method is based on segmenting the timeseries – breaking the timeseries up into varyinglength parts. The segmentation (time) points in our method are determined by a relevance score (Moniz et al., 2016; Torgo and Ribeiro, 2007). This relevance score quantifies the importance of each data point in the timeseries. We use two types of relevance score: one that is based only on the data and another that uses a query shape. Using a query shape allows for particular features in the data to be preserved in the compressed version of the data. The segmentation points can then be used to compress the time series e.g. by using a piecewise linear function between the segmentation points.
Segmentation is a commonly used method to compress timeseries (Keogh et al., 2004; Fu, 2011). Typically, the segmentation points are computed using dynamic programming which minimises an error between the original series and the compressed version (Terzi and Tsaparas, 2006). In our method, we develop a segmentation method for streaming applications – where segmentation is done as the timeseries is observed in realtime. The proposed method uses optimal transport (Villani, 2008) and linear programming to compute the segmentation points. Moreover, the notion of finding patterns and features in timeseries, as is done here using the relevance score, has received much attention (Cassisi et al., 2012; Keogh et al., 2004; Keogh, 1997). Our proposed method forms a bridge between this notion and data compression by finding a segmentation of a timeseries that is probabilistically optimal with respect to representing these features. This allows us to compress the sensor data from the aforementioned pedestrian bridge, whilst retaining relevant data corresponding to pedestrianevents.
The remainder of this paper is organised as follows. Section 2 provides details of the sensor network installed on the pedestrian footbridge that we study, the accelerometers and strain sensors and the data obtained from both of them. In Section 3, we introduce relevance scores that are used to weight each data point according to their importance. Sections 4 and 5 introduce the segmentation methodology using the relevance score designed for the sensor data. Section 6 reports a simulation study designed to gauge the performance of the method. Further, the methodology is deployed against the sensor network data.
2 Strain and accelerometer data for instrumented infrastructure
The monitoring of civil infrastructure has typically been performed over finite periods of time. Engineering issues such as fatigue, corrosion and general degradation of concrete and other materials can require long term studies. Due to storage limitations, health monitoring periods tend to be significantly less than the lifetime of the structure. However, certain projects require more continuous monitoring to help ensure safety and to study new materials in the environment. To prepare for this form of continuous monitoring and to provide a testbed for the development of new algorithms, we outfitted an existing indoor pedestrian footbridge with a variety of sensors including accelerometers and strain gauges. The bridge serves as a walkway over a machine shop in a building in San Francisco (Figure 1) and has a span of 55’ and a width of 88.75”. The design is a common steel truss bridge. To monitor strain, we installed foil strain gauges at the midpoint of the primary structural element of the bridge, as well as at half the distance between the middle and the ends as seen in Figure 3. The strain gauges were set up to monitor the bending of the primary structural elements of the bridge in a standard Wheatstone bridge configuration using 2 extra gauges for temperature compensation. To keep costs low, custom hardware was designed for the 24 bit Analog to Digital converter (ADC) as a cape for a raspberry pi single board computer which provided the primary interface to the ADC. The core ADC chip used was the Texas Instruments ADS1231 and is capable of supporting 80 samples per second with a mV range. Accelerometers were manufactured by Analog Devices with a sensitivity of g and wired to a separate 10 bit ADC cape for raspberry pi. Accelerometers were placed in the geometric center of each deck plate section, spaced approximately 5’ 1” apart as seen in Figure 3. No filtering or smoothing was performed for any of the sensors resulting in only raw voltage readings stored to a remote cloud based timeseries database. Accelerometers were sampled at 40 Hz while strain gauges were sampled at 80 Hz. All devices were synchronized using SNTP (Simple Network Time Protocol) and timestamps were generated by the Raspbian operating system through a Python script at the time of sampling.
Figure 5 shows a snippet of timeseries data from a single strain sensor (Cityside / left: pipier9bridgestrain2lefts0). The same measurements are shown in Figure 5 only with a truncated yscale. Due to environmental electrical noise in the machine shop, initial autoscaled plots are dominated by numerous short outliers with high amplitude. However for the application of using this data to monitor pedestrianevents (such as a person walking) on the bridge, what an analyzer of such strain sensor data is concerned with are the small sinusoidallike signals towards the start of the illustrated timeseries. This represents the structural response of the bridge as a pedestrianevent occurs. It is the periods of data that contain recurrent loading, the traversal of the pedestrian’s interaction with the bridge, that are of interest for our application. Next, we consider the snippet of timeseries data shown in Figure 7; this timeseries shows measurements from an accelerometer (Cityside: pipier9bridgeaccel59az9). Figure 7 shows a segment of this timeseries that contains a visibly high magnitude oscillatorylike signal, that represents a pedestrianevent occurring near the accelerometer. This particular pattern of data obtained from an accelerometer is therefore of interest to an analyzer when aiming to monitor the response of the bridge to pedestrianevents over time. Any compressed version of these two datasets should aim to preserve relevant signals such as the ones discussed here, in order to effectively be able to monitor the pedestrianevents to a similar degree as one can do with the raw data shown here. The next section explains how one can weight which data points within the timeseries obtained from the accelerometers and strain sensors considered in this section are relevant, with respect to being able to monitor these pedestrianevents.
3 Relevance scores for features in timeseries
In this section we introduce the notion of relevance scores. A relevance score operates on the timeseries and is used to characterise the importance of each data point. Two types of relevance scores are selected for use with the accelerometer and strain data. Consider the univariate realvalued time series observed at the timestamps respectively. A relevance score operates on the timeseries, transforming each value into a realvalue score that quantifies its importance; the higher the score, the more important the data point. Denote the relevance score for a timeseries as . We now introduce two different types of relevance scores.
Datadriven relevance scores
Define the following relevance scores, based on only the data in the timeseries ,

,

,

,
where is the Heaviside function. The various relevance scores capture different features of the timeseries. In (i), large scores are associated with high magnitude values. In (ii), large magnitude differences in the contiguous pairs of the time series lead to high relevance scores. In (iii), the scores at instance satisfying will have a nonzero value otherwise the score is zero. This represents a procedure where only large differences between contiguous pairs of values lead to a nonzero score. The higher the value of in these relevance scores, the greater the difference between the score for relevant and nonrelevant points. The scores (i) and (ii) were used in Liu and Müller (2004). For the accelerometer data we shall use the relevance scores presented in (ii) with . This choice is motivated by the oscillatory features in the data as noted in Section 2.
Querybased relevance scores
Another type of relevance score can take a query shape as an
additional input, which captures a feature of the original data that
we wish to retain in the compressed version of the data. A query shape is described by the realvalued vector
, with odd. One such score is
(1) 
where , also and is the Euclidean norm. The relevance score in (1) is used for the strain sensor data so that the sinusoidalwave type shape seen in the previous section is preserved in the compressed data. For scale and shift invariance, there are warping distances (Keogh and Ratanamahatana, 2005; Paparrizos and Gravano, 2015) that can be used instead of the Euclidean metric used above; this is sufficient for the scope of this paper. When denotes the center of a subsequence of the timeseries that matches the query shape well, the greater the importance associated with the value through the relevance score.
The following example will illustrate the behaviour of both types of relevance score. Figure 11 shows an example of an ECG timeseries of length . The query shape, of length , is displayed in Figure 11. Figure 11 shows the relevance scores in (1) for the timeseries, with . Note that the relevance score is higher for subsequences of the original ECG timeseries that match the query shape well, e.g. during the three occurences of peak signal. On the other hand, Figure 11 shows the relevance scores in (ii) for the timeseries, with . Due to the high magnitude of the peak signals relative to the rest of the timeseries, these relevance scores are similar to those in Figure 11, but exhibit slightly more variance.
The next section will introduce a method that divides a timeseries based on its relevance scores. This segmentation of the timeseries is used to generate a compressed version of the data.
4 Segmentation and compression
Relevance scores, introduced in Section 3, characterise the desired features in a time series that are of interest. This section introduces a method that divides the timeseries based on its relevance scores into segmentation (time) points. Interpolation between these segmentation time points leads to a compressed version of the timeseries. First, Section 4.2 describes how the segmentation points are computed in the static case where the timeseries is not a data stream. When streaming data is considered, the method to divide the timeseries into segmentation points is required to be recursive, as it assumed infeasible to recompute this segmentation after every point is added to the timeseries. Section 5 therefore introduces a method to incrementally compute an approximation to the segmentation points that one would obtain by following the static methodology in Section 4.2, for use in the streaming data setting.
4.1 Segmentation of timeseries
The segmentation of a timeseries leads to timeseries data compression. As aforementioned, this compression is important for data acquired by sensor networks fitted to instrumented infrastructure, in order to reduce the complexity and increase the efficiency of any analysis. Segmentation can be used for timeseries data compression by breaking the original timeseries up into segments; one can then reconstruct a compressed version of the original timeseries using these segments alongside some interpolation method. The type of compressed reconstructions that this paper considers are known as piecewise aggregate approximations (PAA) (Fu, 2011). For the timeseries observed at the timestamps , we denote segmentation points by , where and , for all . Since , the data is compressed. These points define a unique approximation , which is a compressed reconstructed version of the original timeseries point , for . Exact forms for a couple of these compressed reconstructions are given in Sec. 4.3. A common metric to assess the spaceefficiency of the compression of the original timeseries is the compression ratio, given by,
In practice, a segmentation algorithm will typically be implemented and evaluated by (a) specifying a desired compression ratio, and then reporting the error of the approximation, or (b) specifying a condition for the segmentation to satisfy (e.g. maximum approximation error), and then reporting the compression ratio. The methodology in this paper is inline with (a). A naive choice for the segmentation points would simply be evenly spaced points; however for relatively small values of this would lead to important signals in the data being lost in the compression. Therefore the segmentation problem is concerned with choosing the points for some predefined objective, such as minimizing the approximation error. For example, our predefined objective here is to preserve key, relevant features in the compression. Dynamic programming can be implemented to find the particular segmentation which minimizes the total error of the reconstruction away from the original data; this is at the computational expense of (Terzi and Tsaparas, 2006). In the next two sections we propose a segmentation algorithm that will focus on points in the original timeseries data that have high relevance to an analyzer.
4.2 Computing the segmentation points
The contribution of this work, a methodology for timeseries segmentation whilst preserving relevant features of the original data, is now described in this section for the case where the timeseries is not being streamed. We consider the opposite case in the Sec. 5. We seek a segmentation of a timeseries that is constructed using the segmentation points , where . Recall that the original timeseries points were assigned relevance scores in Sec. 3 that described their importance. These can be made into weights via the normalization, . The segmentation points can also be assigned weights, and we would like each of them to be as equally relevant in the compression. Therefore, let , where for , be these weights.
We will now describe the method used to compute the segmentation points , based on the method of optimal transport (Villani, 2008). At first glance, it seems reasonable to resample the segmentation points, using any standard resampling method (Douc and Cappé, 2005), from the weighted timeseries points. This approach is not ideal for the objectives in this paper, since there is little guarantees using these resampling methods of a particular placement for a segmentation point. Instead in this paper we use a deterministic linear transformation from the original timestamps to the segmentation points. It will be shown later in this section and in Sec. 4.3 that this transformation allows one to guarantee particular placements of the segmentation points and prove properties about the corresponding compressed reconstruction. Define the coupling matrix , with the constraints,
(2) 
Using the coupling matrix, segmentation points can be computed for this methodology via the following linear transformation,
(3) 
for . We are interested in the particular coupling matrix, known as the optimal coupling, that solves the wellknown MongeKantorovitch optimization problem,
(4) 
In our case, the scheme chooses segmentation points that are as close to as possible whilst satisfying the constraints in (2). Linear programming can be used to numerically compute the optimal coupling , at the computational expense of . The matrix will have at most nonzero elements. The pseudocode of this algorithm is given in Algorithm 1 in Appendix A. Note that the timestamps are not an input for this algorithm; this is because all timestamps are assumed ordered. The segmentation scheme does not use information about the reconstruction, and only which features of the original timeseries the reconstruction would be required to preserve. This is why it can utilise linear programming to solve the problem, and hence become significantly cheaper than alternative segmentation methods. If it is not acceptable in a particular application for the segmentation points to not necessarily take integer values, then we can use , for , instead of (3). The procedure in this section was proposed as a resampling scheme for nonparameteric dataassimilation in Reich (2013). The constraints in (2), that dictate the form that the optimal coupling matrix takes, are influenced by the relevance score of the timeseries points. By computing the segmentation points in this section using the coupling matrix , we therefore designate more segmentation points towards periods of timestamps that correspond to timeseries points with high relevance scores.
Note that following Algorithm 1 in Appendix A, each segmentation point falls inside a particular interval, i.e.
(5) 
for and with . This interval forms an important part of the extension of this methodology to the streaming data case considered in Sec. 5. The expression in (5) is also an important aspect of the proposed methodology, as it makes sure that there will be a segmentation point within a certain period of data, even if all the points within it are not particularly relevant. This is useful in many applications where sensor data have longterm drifts in background noise; this guaranteed interval will allow sparsely placed segmentation points to keep track of this drift. The next section will now consider how to reconstruct a compressed version of the original timeseries using the segmentation points computed in this section. This reconstruction will therefore preserve highly relevant periods of timeseries data in the compression to a greater extent than irrelevant periods of data.
4.3 Compressed reconstruction
This section will explore the PAA (Fu, 2011) compressed reconstruction , for , of the original timeseries using the segmentation points computed within the methodology presented in the previous section. Let the segmentation points have the indices in the original timeseries so that , for . Two examples of PAA reconstructions are now given: define the piecewise constant approximation,
(6) 
and piecewise linear approximation,
(7) 
for and where and . Another simple alternative to this approximation is the piecewise regression approximation.
The error of the relevance scores of the compressed reconstruction, utilising the segmentation points computed in the previous section, away from the relevance scores of the original timeseries is now considered. This error metric is of particular interest to the scope of this paper since the proposed methodology is designed for when the practioner would like to preserve relevant features in the compressed version. We shall assume a piecewise reconstruction satisfying
where . Recall that are the relevance scores of the original timeseries, and now let be the relevance score of for . Then,
(8) 
for all . The derivation of this is shown in Appendix B.
5 Streaming timeseries segmentation
This section will now remove the assumption made in the previous section that the timeseries is not streamed. In the streaming data case, new data points are added to the timeseries sequentially, possibly indefinitely. We propose a recursive approximation to the segmentation points that one would obtain from using the methodology presented in the previous section; this approximation is updated every time a new data point is added to the timeseries. An approximation is required since the segmentation points are derived using the linear transformation in (3) and this transformation is affected by the constraints in (2). These constraints are dependent on the normalized weights , for ; each time a data point is added to the timeseries all previous weights will change, leading to the position of all segmentation points changing. On another note, since the approximation is recursive, it is more efficient than recomputing the segmentation points using the methodology in the previous section each time the timeseries is added to. There are two aspects to this approximation that are discussed in this section. First, we explain how one can update the number of segmentation points used for the compression as the timeseries increases in size. Second, we outline how we approximate the segmentation points; a userdefined approximation error controls how accurate one wants the approximation to be. A general outline of the approximation is given below:

Initialize the algorithm by observing the first points of the timeseries, setting , to be the user’s choice, and . Also initialize:

,

,

.


Prescribe a userdefined level of accuracy and set .

Set , and observe the new data point in timeseries (or using a buffer, if required for a querydriven relevance score), at the timestamp , and compute associated relevance score . Set , and .

Update and prune , and : Set . If:
(9) then implement:

Set , , and .

Set and while , implement:

If , then set . On the other hand, if , then compute
(10) and set , and . Set , and .



For implement:

Return to step (3).
The procedure outlined in the steps above is given in more detail in Appendix C. The intuition behind the approximation is the following. The vectors , and keep a synopsis of the timestamps, relevance scores, and products of timestamps and relevance scores over the timeseries. In step (4ii), some elements of these vectors are combined and summed together when the corresponding relevance scores are low; these elements are unlikely to have segmentation points on them. As this synopsis is pruned over time, generating approximations from the elements of it instead of the entire timeseries will be efficient. Now, since we know that the segmentation point , for , is within the interval , it is important to always maintain approximations to the endpoints of the interval in (5); this is done in step (5i). Using the condition in (10) we have that approximations to the endpoints of the interval in (5) satisfy,
(11) 
A consequence of this on the accuracy of the segmentation point approximation, given by a rolling weighted sum in step (5ii), is that,
(12) 
for and where . This bound on the segmentation points is proved in Appendix D.
As an example to see how the number of segmentation points is updated in step (4) as data points are added to the timeseries, consider the following timeseries:
We assume that we start with , and let with . Therefore, and . After the fifth data point, , enters the timeseries at the timestamp , we have that and . If , we have that and therefore . In this case we would increase the number of segmentation points by one. If on the other hand , we have that and therefore . In this case the number of segmentation points would stay as . The next section gives a numerical demonstration of this approximation to the segmentation points of a timeseries, in the streaming data regime.
6 Numerical demonstrations
The following section will demonstrate the methodology presented throughout the paper with the application of the method to simulated streaming data and data from the accelerometers and strain gauges instrumented on the pedestrian footbridge introduced in Sec. 2. These demonstrations prove the effectiveness of the proposed compression technique for the application of efficiently managing data from instrumented infrastructure, whilst preserving key features in the original sensor data.
6.1 Simulated streaming data
This section will investigate the effectiveness of the streaming data approximation, presented in Sec. 5, for the segmentation points obtained from the optimal transport algorithm introduced in Sec. 4. Recall that this approximation is required in the streaming data regime since it is assumed infeasible to recompute the segmentation points every time a new element is added to the timeseries. The implementation of the approximation is described in Appendix C and segmentation points are added onthefly when the condition in (9) is met. The simulated timeseries considered in this problem is,
(13) 
where , and the relevance score used is , with . This timeseries is chosen to simulate frequent occurrences of a particular magnitudebased feature in the data, that will hopefully allow the segmentation points to shift periodically to the peaks of the sinusoidal waves when they enter into the timeseries. Initially we start with , and . The value is used. After all elements have been added to the timeseries, there are segmentation points.
First, Figure 13 shows the relative error of the approximate segmentation points,
with , after every 5000’th element has been added to the timeseries. The theoretical bound in (12) is also shown. The relative error stays approximately constant over time, and below the bound. Next, Figure 13 shows the runtime (in seconds) of computing the segmentation points via the approximation presented in Algorithm 5 within Appendix C, after every 5000’th element has been added to the timeseries. It shows this runtime in comparison to that of computing the actual segmentation points using an implementation of linear programming (Algorithm 1). Note that the runtime of the approximation is far less than that of reimplementing linear programming each time a new element is added to the timeseries for large . This shows the feasibility of applying the segmentation methodology (or an approximation of it) proposed in this paper to timeseries obtained from a sensor acquiring data at a fast pace. Finally, Figure 14 shows the ratio of reconstruction error from using a piecewise linear reconstruction, in (7), alongside both the approximate segmentation points and those obtained by continuously implementing Algorithm 1. Note that this ratio of errors is approximately equal to one, over the data stream, showing that there is negligible loss in reconstruction accuracy in computing approximate segmentation points instead of using the linear programming algorithm in Algorithm 1.
6.2 Accelerometer data
This section applies the proposed compression methodology to a timeseries generated by accelerometers instrumented on the pedestrian footbridge, introduced in Sec. 2. Data from one of the sensors (Cityside: pipier9bridgeaccel59az9) in the described accelerometer network is considered here. This timeseries is acquired at 40Hz over a total time of 20 seconds. There are three signals in the timeseries, seemingly corresponding to a pedestrianevent occuring on the bridge near the sensor three times. As aforementioned, accelerometer signals have an oscillatorylike shape, and therefore the relevance score we use to generate the segmentation points in this example corresponds to . The intuition behind this choice is that oscillations in the timeseries are larger during a signal, rather than during background sensor noise. This can be seen in Figure 15, where segmentation points, obtained using the methodology presented in Sec. 4, are also shown.
Notice that the segmentation points gather around the points where the oscillations are largest, that could represent a pedestrianevent being detected by the accelerometer. On the other hand, they are more sparsely spread out at times when there appears to be only sensor noise present. Interestingly, the third and final signal has the least dense concentration of segmentation points out of all three signals, given it does not exhibit as large oscillations as the other two signals do.
6.3 Strain sensor data
This section now applies the proposed compression methodology to a timeseries obtained from a strain sensor (Cityside / left: pipier9bridgestrain2lefts0) within the network instrumented on the pedestrian bridge introduced in Sec. 2. Relevant signals within the timeseries, seemingly corresponding to a pedestrianevent occuring near the sensor, appear as a sinusoidallike wave (see Figure 5). Inspired by the form of this signal, the relevance score used to generate the segmentation points in this example corresponds to that in (1), where the query shape is given by,
where and are the sample mean and standard deviation of the subsequence . Therefore normalization is employed here to aid the pattern detection metric in (1). Figures 17 and 17 show the placements of the segmentation points for two snippets of data from the strain sensor (acquired at 80Hz), each containing a single signal that seemingly corresponds to a pedestrianevent. In both cases, the segmentation points are very sparse for all times that aren’t in the immediate interval of the signal; instead they are concentrated on the signal itself.
An interesting aspect of the piecewise linear compressed reconstruction from the segmentation points, obtained from using the proposed methodology, is the relevance score for this reconstruction. A piecewiselinear approximation using the segmentation points computed in Figure 17 is obtained, and Figure 18 shows the value of the relevance score in (1) for this approximation alongside that from the original timeseries. As one can see, the relevance scores match well for large values (corresponding to values which are close to segmentation points), but do not match well for the lower, less relevant values. This shows the benefit of the proposed methodology at being able to preserve key features of the original timeseries, specified by the relevance score used. The error of the relevance score for compressed reconstructions using segmentation points obtained via the proposed methodology is investigated in Sec. 4.3.
6.3.1 Reconstruction from compression
We will now assess how the compressed reconstruction of a timeseries obtained from the strain sensor instrumented to the pedestrian footbridge introduced in Sec. 2, obtained using the compression methodology in this paper, performs at representing the original timeseries in a lower dimensional form. To do this, we will concentrate on assessing the reconstruction error and spaceefficiency within parts of the original strain sensor timeseries that are highly relevant to the analyzer: the signals corresponding to pedestrianevents. A 400 second long timeseries is obtained from the strain sensor. This snippet contained 5 pedestrianevent signals. Five 2 second long intervals, containing these signals, were extracted manually, and the nonevent periods in between these intervals were recorded separately. Segmentation points were computed as in the previous section, and a compressed reconstruction was obtained using a piecewise regression approximation to the original timeseries. The reconstruction during one of the extracted event intervals containing a signal is shown in Figure 20, in addition to the reconstruction during one of the nonevent periods in between the extracted intervals shown in Figure 20. Compression ratios, and the average relative squared reconstruction error, , where corresponds to all indices within a particular time interval of the reconstruction, were computed for each extracted event interval and each nonevent period in between the extracted intervals. These are shown in Table 1. One can note from the aforementioned figures and this table that the reconstruction is refined, leading to lower error, in the periods of high relevance (signals seemingly caused by pedestrian events). Compression ratios are lower in these periods, than during the nonevent periods, as more segmentation points have concentrated on them. The higher compression ratios in the nonevent periods, coupled with the lower error during the extracted event intervals, show the effectiveness of the reconstruction at producing very similar values to the original timeseries during relevant periods only in a much lower dimensional form.
Period  Compression Ratio  Relative Squared Reconstruction Error 

Nonevent 1  183.1  0.99 
Nonevent 2  168.5  0.24 
Nonevent 3  143.4  0.41 
Nonevent 4  149.0  0.44 
Nonevent 5  190.5  0.41 
Event 1  5.31  0.02 
Event 2  5.94  0.03 
Event 3  4.80  0.01 
Event 4  4.81  0.02 
Event 5  9.18  0.05 
7 Conclusion and Discussion
This paper has presented a compression technique for data streamed from instrumented infrastructure, such as bridges, roads and tunnels fitted with sensor networks, for applications including condition and structural health monitoring. Especially when data is acquired frequently, relative to any changes exhibited in the structure, it is important to wisely compress data down to a manageable quantity for storage and analysis. Methodology is presented for cases where data is given in a single batch, and where data is acquired sequentially in an indefinite stream. The proposed compression technique produces a piecewise aggregate approximation (segmented timeseries) that preserves userdefined particular patterns or features that exist in the original timeseries. This paper uses the motivating example of particular patterns of signals from accelerometers and strain sensors instrumented on a pedestrian footbridge, that could represent a pedestrianevent (such as a person walking) in the vicinity of these sensors, as the features that one would like to preserve in a compression.
The methodology works as follows. A userdefined relevance score is first used to create weights for each data point in a timeseries; points are weighted relative to how important it is to the appearance of features or patterns that one would like to preserve in the compression. Then optimal transport is used to find the optimal piecewise segmentation that preserves sequences of points within the timeseries that have high relevance. This can be done via linear programming which can be implemented quickly even for relatively large datasets. In the case where the data is streamed sequentially over time (e.g. from a sensor network instrumented on an operational structure), a bounded approximation to the optimal piecewise segmentation can be maintained over time and queried in a significantly reduced runtime relative to recomputing the linear programming result.
The features that the compression should preserve inform the choice of the relevance score used alongside the proposed methodology. For example, similarity search and distance measures can be used to preserve a particular query shape or pattern in the timeseries data. Future extensions of this work should explore the properties of compressions constructed using the proposed methodology alongside more exoctic relevance scores (e.g. Markovmodels (Ge and Smyth, 2000) or probabilistic warping (Bautista et al., 2012)) and compositions of relevance scores. It only seems natural that by doing the latter for example, one could extend this methodology to compressing a timeseries whilst preserving multiple important features or patterns, such as data acquired from sensors that produce different signals for different types of events (e.g. railway bridges, where different types of trains frequently pass over it).
The motivating example of applying the proposed methodology to compressing datasets obtained from a pedestrian footbridge instrumented with strain sensors and accelerometers is considered via a series of numerical demonstrations towards the end of this paper. These demonstrations highlight the effectiveness of the compression at preserving key features (e.g. a sinusoidaltype wave of measurements) in the original data from the strain sensors and accelerometers that represent a pedestrianevent near the sensor location. Due to the choice of these relevance scores for the implementation of the proposed methodology, the compression ignores largemagnitude noise and outliers (possibly due to electrical currents) that often make traditional analyzes of raw data obtained from strain sensors and accelerometers difficult. The reconstructed compressed data obtained from this methodology exhibits low error, with respect to the original data, during occurences of pedestrianevents, and high compression ratios (a metric for the spaceefficiency of the compression) during unimportant periods of data. These demonstrated properties are necessary in alleviating the high complexity of storing and analyzing streaming sensor data from instrumented infrastructure. Therefore this work contributes towards important research efforts to improve structural health and condition monitoring systems used alongside novel and contemporary sensing technologies.
References
 Bao et al. (2011) Bao Y, Beck JL and Li H (2011) Compressive sampling for accelerometer signals in structural health monitoring. Structural Health Monitoring 10(3): 235–246.
 Bao et al. (2019) Bao Y, Tang Z and Li H (2019) Compressivesensing data reconstruction for structural health monitoring: A machinelearning approach. arXiv preprint arXiv:1901.01995 .
 Bautista et al. (2012) Bautista MA, HernándezVela A, Ponce V, PerezSala X, Baró X, Pujol O, Angulo C and Escalera S (2012) Probabilitybased dynamic time warping for gesture recognition on rgbd data. In: International Workshop on Depth Image Analysis and Applications. Springer, pp. 126–135.
 Bose et al. (2016) Bose T, Bandyopadhyay S, Kumar S, Bhattacharyya A and Pal A (2016) Signal characteristics on sensor data compression in iotan investigation. In: 2016 13th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). IEEE, pp. 1–6.
 Cassisi et al. (2012) Cassisi C, Montalto P, Aliotta M, Cannata A and Pulvirenti A (2012) Similarity measures and dimensionality reduction techniques for time series data mining. In: Advances in data mining knowledge discovery and applications. InTech.
 Cawley (2018) Cawley P (2018) Structural health monitoring: Closing the gap between research and industrial deployment. Structural Health Monitoring 17(5): 1225–1244.
 Douc and Cappé (2005) Douc R and Cappé O (2005) Comparison of resampling schemes for particle filtering. In: ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005. IEEE, pp. 64–69.
 Fu (2011) Fu T (2011) A review on time series data mining. Engineering Applications of Artificial Intelligence 24(1): 164–181.
 Ge and Smyth (2000) Ge X and Smyth P (2000) Deformable markov model templates for timeseries pattern matching. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 81–90.
 Greenwald and Khanna (2001) Greenwald M and Khanna S (2001) Spaceefficient online computation of quantile summaries. In: ACM SIGMOD Record, volume 30. ACM, pp. 58–66.
 Keogh (1997) Keogh E (1997) A fast and robust method for pattern matching in time series databases. In: Proceedings of WUSS 97.1, volume 99.
 Keogh et al. (2004) Keogh E, Chu S, Hart D and Pazzani M (2004) Segmenting time series: A survey and novel approach. In: Data mining in time series databases. World Scientific, pp. 1–21.
 Keogh and Ratanamahatana (2005) Keogh E and Ratanamahatana CA (2005) Exact indexing of dynamic time warping. Knowledge and information systems 7(3): 358–386.
 Khoa et al. (2014) Khoa NLD, Zhang B, Wang Y, Chen F and Mustapha S (2014) Robust dimensionality reduction and damage detection approaches in structural health monitoring. Structural Health Monitoring 13(4): 406–417.
 Lau et al. (2018) Lau FDH, Butler L, Adams N, Elshafie M and Girolami M (2018) Realtime statistical modelling of data generated from selfsensing bridges. In: Proceedings of the Institution of Civil Engineers  Civil Engineering.
 Liu and Müller (2004) Liu X and Müller HG (2004) Functional convex averaging and synchronization for timewarped random curves. Journal of the American Statistical Association 99(467): 687–699.
 Moniz et al. (2016) Moniz N, Branco P and Torgo L (2016) Resampling strategies for imbalanced time series. In: Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on. IEEE, pp. 282–291.
 Paparrizos and Gravano (2015) Paparrizos J and Gravano L (2015) kshape: Efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1855–1870.
 Reich (2013) Reich S (2013) A nonparametric ensemble transform method for bayesian inference. SIAM Journal on Scientific Computing 35(4): A2013–A2024.
 Terzi and Tsaparas (2006) Terzi E and Tsaparas P (2006) Efficient algorithms for sequence segmentation. In: Proceedings of the 2006 SIAM International Conference on Data Mining. SIAM, pp. 316–327.
 Torgo and Ribeiro (2007) Torgo L and Ribeiro R (2007) Utilitybased regression. In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer, pp. 597–604.
 Villani (2008) Villani C (2008) Optimal transport: old and new, volume 338. Springer Science & Business Media.
Appendix A Linear programming algorithm
The algorithm for linear programming is given in Algorithm 1.
Appendix B Proof of the error in relevance of reconstructions
This section explains the derivation of the error bound in the relevance scores of the compressed reconstruction with respect to that of the original timeseries, given in (8). We start by assuming smoothness of the relevance score ,
where . Let . We will now assume that the piecewise linear reconstruction in (7) is used, and a relevance score that just depends on , for , is used. We then investigate two cases: (a) , (b) . For case (a), it is clear that at any point , the error of the relevance score for the reconstructed timeseries is,
assuming . Now for case (b), we know that the last point in the interval will be the value of , where . Since a piecewise linear approximation is assumed, we have , and therefore
We also note that of course . Then at any timestamp , the error of the relevance score for the reconstructed timeseries is,
Therefore for either case, and for the assumptions placed on the reconstruction, we have that
for all .
∎
Appendix C Streaming approximation to segmentation points
The approximation to the segmentation points outlined in Sec. 5 is explained in more detail here. The construction of the approximation is based on storing a synopsis of the data points in the timeseries, and is inspired by the work in Greenwald and Khanna (2001). The synopsis is a set formed of the triples , for , where the values , for , are a succinct collection of timestamps within the timeseries. They are such that , with and . The values represent the sum of over all the timestamps . Finally the values represent the sum of the products of and , over all the timestamps . The approximation is more efficient than recomputing the segmentation points via Algorithm 1 since the approximation operates on only the data points stored in this synopsis, and given that . The approximation starts by initializing the values , , and in step (1) of the outline in Sec 5. These triples are maintained over time to generate the approximations , to the segmentation points , using the Algorithms 2, 3 and 4 below. Then, an approximation to the segmentation points , for , can be queried at any time via Algorithm 5. The triples are maintained as follows. Every time a new element is added to the timeseries at the timestamp , Algorithm 4 is implemented to update the synopsis. This routine uses Algorithm 2 and Algorithm 3; the latter algorithm allows the synopsis to be cut down in size in order to make the approximation efficient.
Appendix D Proof of streaming approximation error
This section provides a proof of the bound given in (12) for the error of the approximation to the segmentation points , for . First, let and recall that the indices and are the smallest and largest nonzero elements in respectively. Also recall that and . Assume that the approximations in Algorithm 5 to the indices and (corresponding to the ’th and ’th triple in respectively) are given by and . Note that due to the way that is constructed and maintained, we have that if we must have , and that if we must have . Also recall that
from (11). The error of the streaming approximation to the segmentation points , for , can be expressed as,
(14) 
where
and
Now, let and . Then let and . Finally let