Real-time processing of high resolution video and 3D model-based tracking in remote tower operations
High quality video data is a core component in emerging remote tower operations as it inherently contains a huge amount of information on which an air traffic controller can base decisions. Various digital technologies also have the potential to exploit this data to bring enhancements, including tracking ground movements by relating events in the video view to their positions in 3D space. The total resolution of remote tower setups with multiple cameras often exceeds 25 million RGB pixels and is captured at 30 frames per second or more. It is thus a challenge to efficiently process all the data in such a way as to provide relevant real-time enhancements to the controller. In this paper we discuss how a number of improvements can be implemented efficiently on a single workstation by decoupling processes and utilizing hardware for parallel computing. We also highlight how decoupling the processes in this way increases resilience of the software solution in the sense that failure of a single component does not impair the function of the other components.
Since the deployment of the first remote tower implementation in Örnsköldsvik, Sweden in 2014 interest in remote tower has been growing internationally. Today there is a focus on validating the multiple remote tower concept, in which an air traffic controller (ATCO) is responsible for more than one airport simultaneously [13, 11]. This movement of the concept towards more complex scenarios demands new technologies that support ATCO’s situational awareness in order to ensure that their cognitive capacities are not exceeded. Such technologies may include providing visual enhancements or even fully automating parts of the ATCO’s responsibilities. This requires processing of large amounts of data, including high-resolution video data, in real-time. Despite the ever-increasing capacity of modern computers, their computational power is still insufficient to allow for naive implementations when dealing with large amounts of data. There are, however, a number of solutions, both hardware and software based, that make such processing attainable.
In this paper we describe the implementation of several features that enhance remote tower based on raw video. These include: tracking of objects in the video, 3D event localization, video exposure correction, and map integration.
A software architecture describing how the various components utilize parallel processes on a single workstation is presented. We also emphasize how the software components communicate in such a way that failure of a non-vital component (e.g. tracking) does not affect the performance of the vital ones (i.e., video stream rendering).
Ii-a Video processing
Live streamed video is a core component of remote tower systems, including both light spectrum and infrared imaging. Given that the cameras are often placed some distance from the runway, high-resolution of the streamed imagery is considered an essential component. In contrast, it has been shown that video frame rate is of less importance when it comes to maintaining visual detection performance and does not impact physiological stress levels . Nevertheless, it is desirable from the point-of-view of video quality and system perception to utilize as high a frame rate as the bandwidth allows. Remote tower data is most often transferred on high bandwidth networks meaning that, in many cases, high frame rates are available in addition to high resolution. Essentially, this means that a huge amount of information is being continually updated at a fast pace.
Another feature of remote tower implementations is that they often use multiple cameras to cover angles of up to 360 degrees. The exposure of each of these cameras is typically controlled individually in order to ensure that they present optimal contrast of the scene to the ATCO. For example, if the sun is shining directly towards one camera, the exposure profile required should be completely different to the profile of the camera pointing in the opposite direction. Nevertheless, this local correction of contrast often leads to visible ‘seams’ between the images when presented side-by-side (see Figure 1). It is therefore of interest to apply local filters that smoothly adjust the contrast to avoid such seams appearing as prominently (stitching).
The aim of a stitching algorithm is to produce a visually plausible mosaic, in the sense that it is as similar as possible to the input images, but in which the seam between the stitched images is invisible .
There exist many methods for stitching panorama images and thus reducing the visible seam if the exposures do not match up. One example of such methods that has shown good results is Laplacian pyramid blending  using a feathered blend. However, common to many of these methods in a panorama setting is that they operate on adjacent images with a region of overlap. In our scenario, we do not have such an overlap as the cameras are intended to be perfectly aligned. Another method is gradient domain blending , where the method performs the operations in the gradient domain. The authors discuss two main image stitching methods. The optimal seam algorithm involves searching for a specific curve in the overlap region, thus not applicable in our scenario. The second method is relevant for our setup, using a minimization over the seam artifacts by smoothing the transition between the images. It is, however, too compute-intensive to be run for every frame with our setup of up to 14 Full HD cameras.
Ii-B 3D modelling for remote tower
One major difference between remote towers and traditional towers is that the video presented in remote tower lacks any depth information. At the same time, accurate digital elevation models (DEMs) that capture 3D behaviour of terrain and other static objects such as buildings and trees are readily available. It is therefore of interest to explore methods for presenting 3D/depth information in the ATCO working position in order to enhance the situational awareness of ATCOs.
Although remote tower systems are often supplemented with pan-tilt-zoom cameras, the majority of cameras in a remote tower system are static. This gives the possibility to accurately calibrate the cameras with respect to a 3D coordinate system. Calibration involves determining the parameters of the cameras, that is: position, direction, tilt, field of view, aspect ratio and resolution. In some cases lens distortion is also an issue, but in our setting the problem is negligible. The advantage of calibrating the cameras is that any surface event that is detected on the video can be immediately positioned in 3D space . A surface event is understood to be any event that is both visible on the video and that occurs on the surface of the airport, excluding only aerial targets. Such events could, for example, arise from an ATCO querying a certain location or from 2D video tracking software tracking the location of an object.
A common format for DEM is GeoTIFF, which allows for encoding both a raster height map and georeferencing information in a single file. In our work we have had access to 2 m 2 m resolution digital surface model (DSM) in GeoTIFF format that covered an area of approximately 30 km in and around Sundsvall-Timrå airport. The vertical resolution of the data was 1 m.
Tracking an object in a video scene is based on repeated detection and localization of the object on successive frames and obtaining a continuous association between detections through time. This association is made easier if a classification of the object is available.
In recent years, machine learning algorithms have made great strides in performance, both with respect to computational efficiency and accuracy of the results. This has resulted in some tasks that were previously considered to be insurmountable, now exhibiting superhuman performance. In particular, this is true for:
detection — deciding whether an image contains an object;
classification — determining the class of the dominating object in an image (in our case aircraft, vehicle, or person);
localization — estimating the location of an object (in our case as a tight bounding box);
tracking — locating a moving object over time.
In the context of remote tower, these tasks are of particular interest as they reflect some of the responsibilities of the ATCO.
A popular method for real-time object detection from image data is known as YOLO (you only look once) [14, 15]. YOLO frames the problem of detection and localization of objects in a scene as a regression problem that can be solved with a single evaluation of a neural network. This approach provides both better performance and is faster to evaluate than most of its predecessors. In addition to detecting and localizing objects, YOLO also provides a classification of the object and an estimated probability that the classification is correct.
Traditional approaches for real-time object tracking include methods such as mean shift . More recent approaches include making use of Siamese convolutional neural networks (CNNs) . In our setting, the main complication with respect to these approaches is that it is hard to achieve real-time tracking when considering the extremely high resolution of the data.
Ii-D Computational power
Given one of the main drivers behind the remote tower concept is cost reduction, it is of interest to investigate how best to utilize computational resources. Despite the ever increasing performance of today’s computers, particularly with respect to Graphics Processing Units (GPUs), the sheer amount of data to be processed in our setting raises a challenge. Nevertheless, we still consider it feasible to implement real-time video processing, tracking and 3D event localization on a single workstation equipped with one or more GPUs. Utilizing computational resources efficiently involves balancing the requirements of the different software components in terms of memory consumption (both main memory and GPU memory), and exploiting both multi-threading and highly parallel GPU processes such as matrix multiplication, which is a core-component of modern machine learning approaches. Tasks such as video decoding can be performed on fixed-function chips (e.g. NVIDIA’s PureVideo) that are part of modern GPUs, enabling other GPU resources to be utilized for different processes.
Iii-a Video processing
White balance and exposure correction
For a smooth transition between camera frames, narrow bands on each side of a seam should have close to identical colour spectra. Our fundamental assumption is the converse: If we are able to accurately match the colour spectra of adjacent narrow bands, then the transition will appear natural. We determined this assumption to be reasonable, as we expect the landscape to be approximately identical in these regions. This local constancy prior holds as long as the video streams are well aligned geometrically and temporally. The video processing is performed for the synchronized frames.
To obtain a smooth transition, we estimate a shared spectrum at the seams based on averaging the intensity distributions in narrow bands along the seam (see Figure 3). To flawlessly map from the measured spectra to the shared spectra would require a highly nonlinear and high-dimensional map. However, this would take too much time to compute, as well as being too slow to apply in real-time. Moreover, the measured spectra are only approximations of the underlying spectrum of the landscape, so a simpler mapping correctly matching the essential features of these distributions would be more appropriate and prevent overfitting.
GPUs are well known for their ability to swiftly apply linear operations. To keep our algorithm as efficient as possible, we consider an approach that only depends on adjacent video streams. Considering the 14 cameras in our setup, such a local approach yields a significant reduction in complexity. A further reduction is achieved by matching the spectra using a linear affine map for each individual colour channel . These maps were applied from the middle of the adjacent images towards the common border (see Figure 3). A gradual transition was obtained using convex combinations, from applying the identity map to the center of the image to the map on the border.
It turned out these maps were not expressive enough to accurately transform all dominant features in the spectra. For instance, in certain cases we obtained a map yielding a seamless transition in the sky but a poor transition on the ground (and vice versa). To overcome this issue, we decided to partition the stream vertically in blocks of identical size.
Another issue, manifesting itself as a local flickering, appeared when an object moved from one stream to the next. In this case the pixels of this object suddenly outshine the pixels of the background landscape, dominating the colour spectrum and violating our fundamental assumption. To resolve this issue we implemented two supplemental methods.
The first method detects the movement using the thresholded absolute differences between frames (see Figure 5), removing the corresponding pixels from consideration in the measured spectra when defining our exposure correction map. This object removal approach is viable for objects that do not dominate the domain of the local exposure correction map.
Should a moving object cover most or the whole block, there will be few or no pixels left for defining our map. For these cases we use an exponential smoothing approach, which reduces the contribution of the moving object by blending the newly computed exposure function with the exposure function from the previous frame
Iii-B 3D modelling for remote tower
The first step in combining the “out the window” video stream view with the 3D model is to calibrate the cameras in a 3D coordinate system. In our case we use the SWEREF99 geodetic reference system .
To accurately calibrate the camera, it is normally possible to fix certain parameters in advance. For example, the position of the camera can normally be accurately determined by referring to map data, orthophotos or by GPS or equivalent positioning technologies, together with technical drawings of the remote tower structure. In addition, it is almost always the case that the cameras are horizontally aligned; that is, there is negligible tilt on the cameras. Information about the aspect ratio and resolution can easily be extracted from image metadata. The cameras considered in this paper also have negligible lens distortion, so non-linear correction is not required. Thus, it remains to define the direction and the field of view of the cameras. To aid this process, an interactive application has been implemented that allows both navigation in the 3D model and blending the video dynamically, see Figure 4. The known parameters can be supplied via a graphical user interface. The 3D navigation controls then allow the user to interactively align the features in the 3D model with the video imagery, and the parameters are updated dynamically.
Based on the calibrated camera parameters, we can accurately compute the distance in 3D space between the camera origin and any point on the ground corresponding to a pixel location. These depths can either be computed by intersecting a ray with the 3D model , or by simply rendering the scene with the depth buffer active. The resulting depths can then be used to position events in 3D space. They can either be computed on the fly or can be pre-computed for all pixels and stored in a depth map. In the spirit of decoupling the components as much as possible, we opt to pre-compute the depths. This results in a modest increase in memory requirements, but enables much better utilization of computational resources, freeing up CPU or GPU for other tasks.
Iii-C AI-based video tracking
In this work we use YOLO for detection, localization and classification [14, 15]. YOLO is a convolutional neural network that extracts and uses the same features for classification and localization, in the form of multiple bounding box prediction. This makes the method both extremely fast and accurate, due to better generalization to unseen images achieved by this multitask learning.
Each detection consists of an object category (aircraft, vehicle, or person), axis-aligned bounding box, as well as a probability signifying the confidence of the detection. The three object classes were consistently colour-coded in their appearance as bounding boxes and close-ups in the video streams and as markers on the map (see Figure 2). To avoid false positives, only detections whose probability exceeds a threshold are processed.
The multiple detections returned by the YOLO architecture could correspond to the same object. Such superfluous detections are eliminated using greedy non-maximum suppression  as follows: for each category, pick the detection with highest probability, and suppress overlapping detections within this category by setting their probabilities to zero. This process is then repeated for the remaining detections, until only detections with zero probability remain.
Overlapping of bounding boxes is quantified in terms of their Interection over Union (IoU), defined as the quotient of the areas of their intersection and union, i.e.,
It measures similarity of the boxes, taking value 0 for disjoint boxes, value 1 for identical boxes, and otherwise values in between. Overlapping detections are then suppressed whenever their IoU exceeds a threshold .
However, YOLO cannot be applied directly to our situation, as it applies to square input images of fixed size. The image obtained by concatenating video streams is not square; it has size . Moreover, our high-end consumer grade GPU (GTX 1080 Ti, with 11Gb) runs out of memory, even when attempting to run YOLO on a subimage.
While memory is not an issue for running YOLO on an image of size , the entire visual range can only be scanned once every couple of seconds in this manner. For the high spatial and temporal resolution of our setup, it is therefore important to develop effective attention mechanisms, i.e., strategies for deciding where to look.
We consider the following three mechanisms:
Sliding window approach. After concatenating the frames of all 14 cameras in a single image, slide a fixed-sized window across this image and run a detection in each window. As an option, it is possible to use overlapping windows to avoid unfortunate cropping of objects. Another option is to (in addition) resize the image to detect objects at various scales.
This strategy is computationally expensive, and therefore only run on start-up to get a good overview of the initial situation.
Difference approach. Moving objects can be detected by detecting significant local changes in the video streams. Technically, this is achieved by thresholding the absolute difference of two consecutive frames, as shown in Figure 5. Sliding a window across the resulting binary image, one runs a detection whenever the number of on-pixels (representing a significant change) exceeds a given threshold.
Expectation approach. Once we have an inventory of tracked objects with their locations and movements, we can predict its expected position in a future frame, and run a detection there.
These mechanisms are combined in a high-level scheduler to effectively track objects, subject to the cost constraints imposed by the available computational resources.
Upon start-up of the tracker, one first applies the sliding window approach to yield an initial list of detections. During the remainder of the tracking process, the difference and expectation approaches are used for deciding where to run detections. Besides being used within each YOLO detection, non-maximum suppression is used here to remove superfluous detections by the various attention mechanisms. To avoid the creation of duplicate objects (and an ensuing cascade effect), a low suppression tolerance is used here.
The problem of optimally assigning a set of detections to existing objects can be expressed as an assignment problem. For this, one first defines a cost function , in which a higher cost reflects a less desirable match. The values of this function are assembled in a cost matrix
For the linear sum assignment problem, the goal is to find a one-to-one assignment , for which the total cost
is minimal. This problem can be solved rapidly (in cubic running time) using the Hungarian algorithm . Such an assignment problem is solved for every category separately.
Let and be the bounding boxes of detection and object measured at frame numbers and . To impose a penalty for dissimilarity, we consider a cost function complementary to the IoU, defined by
This function imposes a higher cost for matching a detection with an object last observed in a distant frame, by discounting their IoU by a factor for every frame that has since passed.
After finding the optimal assignment , each detection is added to the history of the object if
i.e., if the discounted IoU exceeds a given tolerance. If this is not the case, as well as for the unassigned detections, it is checked whether
i.e., whether the detection wasn’t just outmatched, but not relevant for any of the existing objects. If this is the case, it is added as a new object. This rather strict tolerance avoids the duplication of objects due to inaccurate detections.
The video streams enter the system as H.264 compressed video streams  in resolution. In our case, 13 such streams had to be decoded and displayed in real-time. In order to achieve the required performance, we offload the decoding to the GPU using Nvidia NVDEC, which on our system with a GeForce GTX 1070 GPU was able to decode up to 14 such streams in real-time.
With the decoding being done on the GPU, and the video frames residing in GPU memory after decoding, rendering in real-time and at full resolution is easily achieved. The frames are only moved into RAM, a relatively slow operation, a few times per second in order to calculate the white balance and exposure correction on the CPU, and when the object recognition-module requests a new frame.
With the object tracking written as a separate Python application, ZeroMQ  is used for inter-process communication. The object detection also runs on a GPU, and since it is essential not to degrade the performance of the live video view, a separate GPU (GeForce GTX 1080 Ti) is used for this task. This also has the benefit that if the object tracking code were to experience a crash or a slow-down, it will not inhibit the operator. She will still get a live video stream while the object recognition module recovers.
For a full block diagram of the application, see Figure 6.
Expert user feedback
A questionnaire answered by four ATCOs provided qualitative expert user feedback on a preliminary version of the developed functionality. The general consensus was that the developed technologies are promising, but require more testing. We summarize their feedback here by functionality (cf. Figure 2):
Tracking: Overall, the ATCOs considered the tracking with 3D integration as either somewhat useful or useful. At the time of the questionnaire, unsteady bounding boxes were mentioned as distracting. In a later version this was largely resolved, leaving only a minor wobbling. The object classification abstraction level was determined to be sufficient, and the possibility of adding a class for foreign flying objects (e.g. birds, drones) was mentioned. It was remarked that it is harder to see smaller objects in the video stream than from an air traffic control tower, and that visual tracking could help to increase their visibility.
3D event positioning: It was mentioned that interaction with the 3D model improved depth perception and situational awareness, and the various 3D functionalities were considered either somewhat useful or useful. Most ATCOs agreed that more testing is needed regarding the situational awareness and reliability.
Map view: Overall, the map view was considered either somewhat useful or useful. Some of the ATCOs found the shaded overlay useful, and some did not. In the map view, the orthophoto view was generally preferred over the abstract map view, as it is easier to relate to the video streams.
The exposure correction predominantly yields a mosaic with natural transitions, as visualized for cloudy and sunny weather conditions in Figure 1. This is also the case for video, in the sense that also temporal changes generally seem natural.
In the presence of moving objects, the method generates natural results most of the time. However, the method can struggle when moving objects cross the image seams, sometimes resulting in a local flickering. Typically, the problem is most pronounced right before and after a full crossing of the seam, i.e., when the object is fully present in the boundary band of one of the images but not in the other.
Table I shows the results of the proposed exposure correction methods with 16 and 64 blocks vertically, when applied to concatenated video streams with a moving object right after a full crossing of the seam. The original concatenated image is shown twice for easy comparison with the correction methods.
The standard exposure correction introduces a noticeable discolouration in the block next to the car, both for large and small blocks. The object removal approach shows natural results if the remaining number of pixels in the block is relatively high (left case). However, if the moving object fills most of the block (right case), too few pixels remain for computing a natural exposure correction. The exponential smoothing approach generally shows natural results. It does, however, add a slight delay to the update of the exposure correction. For this reason we prefer using the object removal approach when applicable.
|16 blocks||64 blocks|
|Original frame (duplicate)|
|Standard exposure correction|
|Exposure correction with object removal|
|Exposure correction with exponential smoothing|
Manually tuning several parameters (position, view direction, field of view, etc.) for aligning the 3D model to the video streams is a demanding process. It typically involves a field trip, expensive measuring equipment, and it can take several person-days for obtaining an accurate result.
On the other hand, the camera calibration application we developed provides a virtual environment in which the calibration can be performed, requiring only video/image data and a DSM (GeoTIFF). The application thereby greatly sped this process up to the order of minutes.
The methods were run on a single workstation. Efficiently exploiting hardware resources and reducing unnecessary computations makes it possible to achieve real-time performance on consumer-grade hardware.
Running the exposure correction on 14 HD cameras at 30 FPS simultaneously introduced only a minor overhead on the GPU (GTX 1070).
While visual inspection can indicate improvement of the quality of the seam, it is a highly subjective metric and difficult to judge consistently for a long scenario with up to 14 cameras. For a quantitative evaluation of the performance of our exposure correction algorithm, we use a cost function based on the method described in .
Consider adjacent images (left) and (right) of size with columns from left to right of the seam. A naive measure of continuity is to directly compare the columns at the seam, i.e., , but this measure is sensitive to geometric misalignment and asymmetrical details. Instead, the trend at row can be captured by measuring whether the gradients between the final (resp. initial) image column pixels continues across the seam, i.e., whether
Hence the total discrepancy of the trend can be measured as
To further reduce the contribution from geometric misalignment we down-sampled the input frames by a factor 8 in both directions.
|image interior||image seam||image seam|
Table II shows the average value of (1) for two scenes with 6 cameras and a duration of 60 seconds. The reference value in the left columns was evaluated at the middle of the input streams. The exposure corrected result is a significant improvement over the initial uncorrected seam, but is still significantly higher than the reference value. This deviation can partly be explained by a slight geometric misalignment at the seam.
We have developed and tested a number of techniques based on video processing, 3D modelling and object tracking that apply to high resolution video arising from remote towers.
It was shown that the methods can be implemented on a single workstation and still retain real-time performance by efficiently exploiting hardware resources and by reducing unnecessary computations. The techniques do not rely on expensive or special-made hardware, thereby supporting the cost-effectiveness of the remote tower concept by limiting start-up and maintenance costs.
Results from a questionnaire answered by ATCOs indicated that the developed technologies are promising, but require more testing. The visual tracking was remarked to have the potential to increase visibility, both of small objects and using night-vision technologies, with the potential to improve safety.
As future work, there are several possibilities for improving the proposed functionality. The attention mechanisms considered in this paper are based on detecting movements and expected locations of existing tracked objects. In the future, we could also consider where ATCOs concentrate their attention, by looking at heat maps from tracked eye movements  in order to attain better performance.
The tracking functionality has so far only been tested in relatively high visibility conditions during daytime. More testing is needed to see how reliable the tracking is in low visibility and nighttime scenarios. Testing using infrared sensors is also subject to future work.
Although the exposure correction works well in general, there are still situations where it could be improved. The exposure correction maps currently act on each colour channel separately. The quality of the corrections can be expected to improve when using linear maps combining the three channels, at a negligible computational overhead. Moreover, currently the exposure correction map is defined separately for each vertical block. The transition between these maps could be improved by either using convex combinations of the adjacent maps or by adding boundary conditions. Finally, more tuning is needed for automatically selecting which of the proposed methods to use.
The authors would like to thank Saab and Luftfartsverket Sweden (LFV) for their assistance in providing access to the remote tower video data from Sundsvall, Sweden. We also express our gratitude to the four ATCOs from the COOPANS partnership for their valuable feedback.
-  (2015-07) Geometric modelling for 3D support to remote tower air traffic control operations. See 5, pp. . External Links: Cited by: §II-B, §III-B.
-  (2016) Fully-convolutional Siamese networks for object tracking. In European conference on computer vision, pp. 850–865. Cited by: §II-C.
-  (1983) A multiresolution spline with applications to image mosaics. ACM Transactions on Graphics 2, pp. 217–236. Cited by: §II-A.
-  (2000) Real-time tracking of non-rigid objects using mean shift. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), Vol. 2, pp. 142–149. Cited by: §II-C.
-  (2015-07) Fifth international air transport and operations symposium. . External Links: Cited by: 1.
-  (2018-04) Effects of lower frame rates in a remote tower environment. See The tenth international conference on advances in multimedia, Mauri and Gersbeck-Schierholz, pp. 16–24. Cited by: §II-A.
-  (2000) SWEREF 99–an updated EUREF realisation for Sweden. In Symposium of the IAG Subcommission for Europe (EUREF), Publication, Cited by: §III-B.
-  (1955) The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly 2, pp. 83–97. Cited by: §III-C.
-  (2018) HOW to design & implement a modern communication middleware based on ZeroMQ. In Proceedings, 16th International Conference on Accelerator and Large Experimental Physics Control Systems (ICALEPCS 2017), pp. . External Links: Cited by: §III-D.
-  (2004) Seamless image stitching in the gradient domain. In Eight European Conference on Computer Vision (ECCV 2004), pp. 377–389. Cited by: §II-A, §II-A, §IV-B.
-  (2018) How much is too much on monitoring tasks? Visual scan patterns of single air traffic controller performing multiple remote tower operations. International journal of industrial ergonomics 67, pp. 135–144. Cited by: §I, §V.
-  J. L. Mauri and B. Gersbeck-Schierholz (Eds.) (2018-04) The tenth international conference on advances in multimedia. IARIA. External Links: Cited by: 6.
-  (2016) Head up only – a design concept to enable multiple remote tower operations. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pp. 1–10. Cited by: §I.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §II-C, §III-C.
-  (2018) YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §II-C, §III-C.
-  (2010) The H.264 advanced video compression standard. 2nd edition, Wiley Publishing. External Links: Cited by: §III-D.
-  (2014) Non-maximum suppression for object detection by passing messages between windows. In ACCV, Cited by: §III-C.