Structural Prior Driven Regularized Deep Learning for Sonar Image Classification

Structural Prior Driven Regularized Deep Learning for Sonar Image Classification

Isaac D. Gerg and Vishal Monga This work was partially supported by Office of Naval Research under grants N00014-19-1-2638, N00014-19-1-2513. I. D. Gerg is with the Applied Research Laboratory and School of EECS at the Pennsylvania State University. V. Monga is with the School of EECS at the Pennsylvania State University (

Deep learning has been recently shown to improve performance in the domain of synthetic aperture sonar (SAS) image classification. Given the constant resolution with range of a SAS, it is no surprise that deep learning techniques perform so well. Despite deep learning’s recent success, there are still compelling open challenges in reducing the high false alarm rate and enabling success when training imagery is limited, which is a practical challenge that distinguishes the SAS classification problem from standard image classification set-ups where training imagery may be abundant. We address these challenges by exploiting prior knowledge that humans use to grasp the scene. These include unconscious elimination of the image speckle and localization of objects in the scene. We introduce a new deep learning architecture which incorporates these priors with the goal of improving automatic target recognition (ATR) from SAS imagery. Our proposal – called SPDRDL, Structural Prior Driven Regularized Deep Learning – incorporates the previously mentioned priors in a multi-task convolutional neural network (CNN) and requires no additional training data when compared to traditional SAS ATR methods. Two structural priors are enforced via regularization terms in the learning of the network: (1) structural similarity prior – enhanced imagery (often through despeckling) aids human interpretation and is semantically similar to the original imagery and (2) structural scene context priors – learned features ideally encapsulate target centering information; hence learning may be enhanced via a regularization that encourages fidelity against known ground truth target shifts (relative target position from scene center). Experiments on a challenging real-world dataset reveal that SPDRDL outperforms state-of-the-art deep learning and other competing methods for SAS image classification.

Deep learning, self-supervised learning, automatic target recognition, synthetic aperture sonar


I Introduction

Underwater sonar was historically pursued for military purposes, but as the field matured and the commercialization of these systems became feasible, remote-sensing applications for the civilian domain developed. Predictably, as synthetic aperture sonar (SAS) was initially pursued for mine countermeasure applications [57], it has broad civilian applications in remote-sensing of the undersea environment today.

Over the last few years, SAS has matured to a capability accessible in the civilian space with several companies offering systems [55] [21]. Fundamental work in obtaining high-quality SAS images was carried out in the late 1990’s to early 2000’s [20, 23, 24, 6, 15, 29]. Contemporary SAS systems are capable of producing image quality that lends itself to tasks such as automatic target recognition (ATR). Despite the improvements, the problem of detecting and classifying objects in imagery remains challenging because of distractors in the environment and the complex configurations possible by targets. Fig.1 shows examples of difficult cases along with prototypes of the object classes used in this work.

(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 1: SAS is capable of producing high-quality, high-resolution imagery of seafloor and objects. The images here are examples collected from the Centre for Maritime Research and Experimentation MUSCLE system. In this work, we provide an ATR algorithm to classify MUSCLE images into four classes. Example objects from each class are shown: (a) background, (b) cylinder, (c) truncated cone, and (d) wedge. Difficulties in classification result because there are often objects which appear target-like (e,f), but are not targets (false alarms); and some targets are difficult to discern (g,h) because of orientation, burial depth, or background topography causing them to be ignored (missed detections).

I-a Open Challenges in SAS ATR

SAS ATR algorithms were originally established from ATR algorithms used in side scan sonar, which we call real aperture sonar (RAS). RAS systems have similar collection geometry to SAS, but cannot produce the constant resolution with range achieved by SAS. We refer the reader to Chapter Two of [6] and Chapter Three of [22] for the differences between RAS and SAS imaging. Despite these differences, initial SAS imagery looked similar enough to RAS for researchers to reasonably justify the use of RAS ATR algorithms on SAS.

Some of the more popular early work in sonar ATR involved the use of kernel filters [30, 11]. When large amounts of SAS imagery began to be produced, these were some of the first techniques applied to it. Over time, these methods began to utilize multiple target looks to aid in classification by exploiting the overlapped coverage of the seafloor most SAS surveys exhibit [12]. Other techniques focused on model-based approaches [50] and uncertainty modeling. Eventually, the popular classification algorithms of the early 2000’s, before the boon of deep learning, were applied to imagery including decision trees [45] and Markov random fields [51].

Coincidentally, this paper is about the use of neural networks (NN) to address the classification problem. The use of such techniques is not new and one of the more popular early works employed them [14].

With all the recent success of SAS, there remains persistent challenges with respect to ATR. One of the biggest challenges for obtaining good results is collecting and labeling large amount of imagery which is needed for contemporary machine learning (ML) algorithms using deep learning. SAS collection from unmanned underwater vehicles (UUV’s) requires an inordinate amount of support infrastructure including: support vessels, ship crew, and divers making the endeavor financially expensive. Furthermore, the objects often sought upon during surveys are scarce.

It is also difficult to create an environment-independent ML algorithm when little training data is available. Practitioners quickly discover that deploying classification in unseen environments often results in high false alarm rates, even for state-of-the-art SAS ATR algorithms. Despite this, the detection performance of these methods is often quite good as the literature shows. However, all the false alarms returned by the algorithm quickly overwhelm human operators. Furthermore, objects simple to rule out by humans are often called by the ATR, preventing trust between the operators and the algorithms – this renders the ATR useless. When combined, these factors result in manual human inspection as a preferred means to cull the imagery; a costly process.

I-B Overview of Our Proposal

We present a deep learning classifier exhibiting significantly reduced false alarm rates compared to contemporary SAS ATR algorithms while maintaining high detection accuracy. Our approach integrates high-level, domain knowledge unique to the SAS domain in order to achieve these good results. We do this by integrating parcels of domain knowledge, which we call priors, into the training objective.

For a given problem, there exist attributes which are directly represented by the given training data. In the case of image classification, this would be the images/label pairs used for training; this information is explicitly provided to the training algorithm. This is the common scheme for the vast majority of image classification problems. However, there exists domain-specific information which is projected into the training data but may not be explicitly represented by it. For example, for the dataset used in this work, we have some knowledge of how it was pre-processed before given to us for use. Specifically, we know that the detection algorithm used to produce the image chips is generally good at centering targets within the chip. However, we do not have any kind of bounds or statistic on how well the targets are centered. Furthermore, the true target centers are not explicitly encoded for each image. Thus, the fact that the detector is reasonably good at centering the targets is domain knowledge derived from a subject matter expert (SME) (we will see this forms the scene context prior which will we discuss in future sections). We refer to these parcels of domain knowledge as priors. In a deep learning framework, each prior is employed through a regularization loss which is augmented to the primary task’s objective function. Fig.2 is a Venn diagram illustrating the concept.

Fig. 2: The relationship of priors to the problem domain and the training data. Our priors encapsulate domain knowledge which is not explicitly represented in the data but projected into the data. The priors used in this work, structural similarity and structural scene context, are employed through regularization losses which are augmented to the primary task’s objective function which is classification error. We jointly train all losses so the network finds a minimum consistent with both the data and the domain priors.

In this work, we define two priors which, when used individually each improve classification performance, but when used together, act synergistically to improve performance beyond the use of each exclusively. The first prior we use addresses an often overlooked part of the SAS image reconstruction pipeline: image enhancement. This prior originates from the domain knowledge that image enhancement algorithms applied to SAS imagery aid in improving human interpretation. An example of such an algorithm is despeckling [48]. We name this prior the structural similarity prior because it encapsulates the function of any image enhancement algorithm: improve scene content in a way which improves downstream task performance (in this case classification) while simultaneously preserving scene structure and semantics. Quantitatively, this prior is captured by the regularization term in Eq (3) which is described in detail in Section III-C.

The second prior we use leverages a common quality exhibited by detection algorithms: the ability to localize targets. The majority of SAS classification algorithms are preceded by a detection algorithm whose purpose is to quickly find target-like objects in the queue of mostly-benign seafloor images. This prior originates from the domain knowledge that the detector algorithm is usually able to localize the target in the image which humans also do when parsing a scene. We name this prior the structural scene context prior because it encapsulates the role of ground truth target position knowledge: well learned features for image interpretation should encode target location. The images output by the detector usually center the target and we can translate the target through image crops. We then encourage prediction of the new ground truth target location but using the same features used for classification. In this manner, we improve the quality of the features which consequently improves classification performance. Quantitatively, this prior is captured Eq (4) which is described in detail in Section III-D.

Recall that our two priors, structural similarity prior and structural scene context prior, exist for the purpose of improving our primary task: classification. The domain knowledge captured by these two priors is employed through the use of regularization losses, Eq (3) and Eq (4), augmented to the primary task objective function, Eq (6). Together this forms the final loss we jointly-optimize during training, Eq (7).

Using the aforementioned priors above, this paper makes the following technical contributions:

  1. Image enhancement through despeckling is often used to improve image interpretability for humans. We ask the question, Is there an image enhancement function which improves classification? to which we will answer in the affirmative (results in Table III). To this end, we incorporate a data adaptive image enhancement network with a self-supervised, domain-specific loss to an existing classification network for purposes of improving classification performance. Our image enhancement function is learned from the data removing the onerous task of selecting a fixed despeckling algorithm. Furthermore, ground truth noise/denoised image pairs as required by previous methods [63] are not needed.

  2. Most SAS ATRs only determine the presence of a target object but are aloof to where in the image it appears. To this end, we incorporate a target localization network in addition to a classification network for the purpose of also improving classification performance. Like the image enhancement network, this is also trained using a self-supervised, domain-specific loss. Now, our classifier not only learns target class, but also target position thus acquiring scene context. Our target localization network is trained using the domain knowledge that objects are centered when passed to the classifier. We use the common data augmentation technique of image translation through cropping to induce new target positions when training, and have the target localization network estimate the induced target position in addition to the primary task of classification. This encourages the model to learn “where” of the scene in addition to the “what”.

  3. We train the two aforementioned networks and the classification network simultaneously through the addition of regularization terms to the primary classification loss objective function. This makes our formulation self-supervised and thus requires no extra data or labels making it suitable as a drop-in replacement for training against existing datasets. Table III shows through ablation that each domain-specific loss improves classification performance and that when combined, the best classification performance is achieved.

Both of our priors incorporate structural domain knowledge so we call our method Structural Prior Driven Regularized Deep Learning (SPDRDL). Each prior mentioned has never used in previous SAS classification works. Table I shows the relationship among losses used in our final objective function.

Prior Name Employed Domain Knowledge Loss Type Loss Equation
Structural similarity Image enhancement improves human interpretability Domain-specific Equation 3
Structural scene context Targets output from detector are image centered Domain-Specific Equation 4
N/A N/A Primary Task, Classification Equation 6
Table I: Brief summary of the loss terms in our objective function admitted from our domain priors. We list the relationships among the incorporated domain knowledge, the employed prior, and the associated regularization loss in our formulation. We also list the primary task, classification, and its loss function for completeness.

An overview of this paper follows. Section II provides a synopsis of past ATR approaches. Section III presents the necessary background and development of SPDRDL and its use of domain priors. Section IV shows experimental results of our algorithm and compares our results to other contemporary algorithms on a challenging real-world dataset. Finally, Section V provides a summary of our findings.

Ii Previous Work

Recent SAS ATR schemes have focused on improving feature representations through various means. For many years, representations were hand crafted and much of the research was in attempting to discover useful features through subject matter expert input. Techniques employed bag-of-words models [31] using handcrafted features as complex vocabularies describing SAS images. Eventually, techniques emerged which removed the need for this explicit feature engineering task. Dictionary learning methods were some of the first to forgo the explicit feature engineering path [43, 44] and automatically learn features as part of the classification process. Today, deep learning techniques are employed in the same vein [69].

Recently, investigations into alternate representations to improve classification have shown be a fruitful endeavor. [18, 68, 71] have examined representations derived from the -space and have found they contain useful information for classification. Traditionally, the human consumable image, which arrives after extensive post-processing of the raw SAS data, has been used for input to the classifier. The human consumable image is the result of a lengthy signal processing pipeline which discards information related to the frequency and direction of the received acoustic wavefronts. A coarse explanation of a typical image reconstruction pipeline is as follows: (1) raw sonar echos are collected from the sonar array over multiple transmissions, (2) signal processing is applied to these echoes to correct them for imperfections, (3) the data is matched filtered to obtain resolution in the range dimension, (4) an optional motion compensation step is performed to interpolate the data to a regular grid (e.g. preparation for -k beamforming), (5) the data is beamformed to generate a single look complex (SLC) image, and (6) a human consumable image is formed by taking the absolute value of the SLC and applying dynamic range compression (DRC). Consequently, the absolute value operation removes the phase portion of the SLC potentially discarding useful information.

Deep learning has been applied to sonar ATR resulting in a substantial improvement in classification performance. An initial work in the area is [69] whereby convolutional neural networks (CNNs) were used to automatically learn features for classification. In [41], the authors demonstrated the canonical transfer learning approach commonly used in training data-limited networks works well for SAS; a pre-trained CNN trained on the Imagenet dataset [10] was fine-tuned on SAS imagery yielding good results. This work was generalized in [42], where the authors integrate the feature learning of both SAS images and selected photographs simultaneously, yielding good ATR performance in the midst of limited training data. Finally, transfer learning among SAS sensors was demonstrated in [72] where a CNN initially trained on one SAS sensor was used to quickly train with another.

In several of the works describe thus far, class imbalance has been mentioned as a noteworthy issue. Many SAS datasets have far more imagery of the benign seafloor than of objects of interest. To combat this issue, general adversarial network (GANs) have recently been applied to SAS for the purpose of generating more training data to balance the classes. In [49], a hybrid simulation and GAN based approach is used to generate a simulated, optical version of the desired scene and then a learned transform is applied to the simulated scene to give the appearance of a real SAS image. Their hybrid approach gives fine control over the generated scene content so the data balancing procedure can be accomplished with precision; particular objects, their orientations, and their range from the sonar can be specifically generated. Model based GAN approaches have not been limited to SAS, but also have been used for real-aperture sonar (RAS) systems as described in [36] where GANs are applied to RAS imagery to augment data for underwater person detection.

Today, SAS systems are multi-band and operate over several frequency ranges. This ability has not been overlooked in the context of ATR. An early work utilizing multi-band sonars for classification is [61]. In this work, the authors demonstrate good detection performance when using a low-resolution broadband sonar in addition to a high frequency SAS. Even more recently, deep learning has been applied to multi-band SAS imagery with good success and without the need of using a pre-trained network [13, 17].

Iii Proposed Classification Method: SPDRDL

Iii-a Motivation of Approach

Recent ATR schemes using deep learning demonstrate great performance but at the cost of requiring large amounts of training data. As previously discussed, SAS data collection is costly resulting in small datasets which are almost always class imbalanced. Consequently, it is crucial to use all the available information from a SAS image during classifier training. To this end, we propose a new scheme which incorporates prior knowledge of SAS images in a novel way as a mechanism to extract more information from each image. This additional information is used to positively influence classifier training.

One mechanism by which we inject prior knowledge into the classification pipeline is by addressing the inherent speckle phenomenon present within every SAS image. The speckle is often seen as noise and a nuisance for human interpretation. Much work has been done in the development of despeckling algorithms [48] with the purpose of enhancing image interpretability. A natural outcome of this work is to ask if such types of enhancement are beneficial for improving classification performance and if so, which methods provide the most benefit. Furthermore, can we forgo the onerous choice of selecting an enhancement algorithm algorithm and have the network learn the image enhancement transform in an unsupervised fashion?

Another prior which thus far has been overlooked in SAS ATR, are the assumptions given by the detector, sometimes called a pre-screener. As background, traditional SAS ATR methods use a detector-classifier approach. In this approach, a simple detection algorithm is first passed over the scene. The detector produces candidate images of interest, sometimes called chips, which are then passed to a classification algorithm for further inspection. Usually the detector is computationally efficient and can quickly prune areas of the image which appear to be benign (e.g. a flat sandy sea-floor). Such a process reduces the amount of imagery the classifier has to process. It is believed such an approach was adopted initially for compute reasons – early SAS classification systems were not capable of processing every possible sub-tile of an image in a timely manner due to limited compute power. However, current compute capabilities, specifically in form of graphics processing units (GPUs), provide ample compute power enabling a classifier to examine whole scene quickly removing the need for the explicit detection step. Notwithstanding, for this approach to work, the classifier must be translation equivariant.

Good detectors can localize the target well and output SAS images with the target well-centered in the image. The Mondrian detector [70] is a good example of such a detector. It uses prior knowledge of the target and sonar geometries to model expected relationships among local pixel neighborhoods; its quite capable for returning well-centered targets to a classifier. However, current classifiers for SAS do not use this information. They assume a target is present in the image, but do not explicitly estimate or assumes its position. On the other hand, our proposed method jointly estimates target class and target position.

Because our proposed method estimates target position (in addition to object class), it is desirable to have a feature space embedding which is translation equivariant. By equivariant, we mean that as the target translates smoothly across an image, its associated embedding also translates smoothly. Despite the convolutional nature of CNNs, they do not inherently provide translation equivariance. Recent works such as [73, 2, 40] have pointed out this common misnomer and have made progress towards improvement. We utilize these techniques in our proposed method making it very robust to scene translation.

Having the classifier robust to translations has an added benefit in that we can forgo the traditional detection step and run the classifier on across the entire scene. This has an immediate benefit: the detection rate of the ATR is no longer bounded by the detector performance. For example, if a detector exhibits an eighty-percent detection rate, the overall ATR can do no better than an eighty-percent detection rate. Hence, even with an oracle classifier, the best detection rate that can be achieved is eighty-percent.

Iii-B Feature Extraction Network

Traditional image classification pipelines using deep learning are composed of a feature extraction network followed by classification network. Much recent work has been spent on designing an optimal feature extraction network as illustrated by the vast number of off-the-shelf (OTS) options available. DensetNet [28], Resnet [26], Inception [58], MobileNet [27], VGGNet [56], and AlexNet [35] are popular examples of such OTS networks. We leverage the good results these OTS network architectures and begin the construction of SPDRDL around a popular one: DenseNet-121. SPDRDL is composed of Densetnet-121 as a feature extraction network (pink box) followed by a standard classification network (yellow box) shown in Fig.3.

Iii-C Structural Similarity Prior Via Data-Adaptive Image Enhancement Network

Building upon the feature extraction and classifier networks, we introduce a data-adaptive image enhancement network which is added to the front of the feature extraction network. This enhancement network is shown in the blue box in Fig.3. The purpose of this network is to learn an image transformation which improves classification performance while still maintaining the original image semantics by obtaining an enhanced image (i.e. enhanced for classification purposes not necessarily human consumption) that is structurally similar to the original. We implement this network as a U-Net architecture [52] with the original image as input and the enhanced image as output.

To encourage image enhancement for classification, we utilize a novel loss function between the desired enhanced image and the original, the multiscale structural similarity measure (MS-SSIM) [65, 67] which is a scale-aware version of the SSIM measure,


where and are images being compared, and are the patch-wise mean and standard deviation respectively of the corresponding image, is the covariance between image and , are shaping constants, and are calibration constants. where higher values indicate higher perpetual similarity between the images. The SSIM is differentiable and tractable for incorporation into a deep learning network [60].

MS-SSIM introduces scale dependence by computing structural and contrast factors of SSIM over several staged, low-pass-filtered versions of the input image and then combining their results. It is given by,


where functions and represents the corresponding luminance, contrast, and structural components of Eq (1) respectively, and is the total number of scales to evaluate. For all constants, we use the same values as specified in [67].

By seeking to maximize the MS-SSIM between the original and the enhanced image in the enhancement net, we leverage human visual system priors designed into the MS-SSIM perceptual loss function [66]. This utilizes the desired domain knowledge we which seek to embed in our formulation: there exists an enhanced imaged which is structurally similar to the original image but is able to yield improved classification performance. Finally, we define the structural similarity prior (SSP) regularization loss as,


where is the input image and is the improved image output by the enhancement network. Without this loss term, the network has no notion of a “noise model” and simply seeks to minimize the weights of the function with no understanding of the hand-crafted network structure we designed to exploit the domain prior.

Iii-D Structural Scene Context Prior Via Target Localization Network

As previously mentioned, the detector returns targets centered in the image. During the data augmentation process, these targets are translated by a random amount. This augmentation procedure is commonly used in other SAS ATR methods. However, our method is different in that we do not discard the translation parameters but encourage the network to estimate them while simultaneously performing classification. In this manner, we embed the target position domain knowledge into the feature extraction network by encouraging it to learn a spatially-aware context of the scene in addition to features for classification. With this prior, the likelihood of the network learning features which are not target-centric is reduced, and the creation of features derived from biases within the dataset, like seafloor texture, is reduced.

We encourage the network to learn target localization by augmenting our feature extractor with a target localization network whose task is to estimate the target position from the feature embedding. Recall that the ground truth for this estimate is determined through the data augmentation procedure. The target localization network is represented by the orange box in Fig.3. It is composed of a set of 1 1 convolutions to reduce the dimensionality of the embedding. This reduction serves as a bottleneck which then feeds two dense layers which both have no post-activation function simply returning the position estimates. Formally, we define the structural scene context prior (SSCP) regularization loss as,


where represents the mean-squared error between the shift (i.e. translation) applied during data augmentation, , and the shift estimated by the network, .

Fig. 3: The SPDRDL network architecture; the network input is a SAS image and the output is a classification and target center position when a target is present. SPDRDL is composed of four modules: image enhancement network, feature extraction network, target localization network, and a classification network. SPDRDL leverages two domain priors to improve classification: (1) image enhancement algorithms like despeckling improve image interpretability and (2), the detector produces SAS images with well-centers targets. For the former, an enhancement network leverages the human visual system priors incorporated into the structural similarity prior to enhance the image for classification. For the latter, input images are translated as part of the data augmentation procedure during training and this translation is estimated in addition to predicting image class. Image classification, enhancement, and target localization are simultaneously trained.
Fig. 4: The detector returns well-centered target images. We can use this prior during our data augmentation procedure. During data augmentation, we translate the image via random crops and record the effective translation shift induced as . To incorporate this prior into the network, we add a target localization task which encourages the network to estimate this translational shift in addition to outputting classification. As an example of this procedure, the detector returns a well-centered target image (left) and random crops of this image are fed to the network as data augmentation (middle, right). The target localization network estimates the difference between the true center of the target (green) and the shifted center (red). The translation shift estimate is denoted as .

Fig.4 shows an example of how the positional shifts are created. First, the Mondrian detector returns an image with the target centered. Next, a random crop is applied to the image during data augmentation. This induces a translation of the target in the image. We denote this translation as and add this information to the network via the backpropagation through the loss of Eq (4).

Thus far in this sub-section, we have developed a novel method to encourage the network to learn target position within the scene, and we have done so in a self-supervised fashion. One assumption we have made which have not yet addressed is to assume that the feature space is translation equivariant. Despite the use of convolutional layers in our network, non-unitary strides associated with convolution and pooling operations prevent translation equivariance as we will show. Additionally, we have not provided local pixel positioning information to the network likely resulting in position information being determined by specific neurons in the dense layers which is undesirable for generalization. In the next two subsections, we will address each assumption.

Iii-D1 Addition of Anti-Aliasing Filtering Before Pooling Layers

Most CNNs are not inherently shift invariant [2] when combined with pooling layers. This is caused by the lack of proper filtering done during image subsampling in pooling and convolutional layers when the stride is greater than one. Strided layers perform two operations: (1) a filtering procedure which is run over the entire image (in the case of max pooling, this is an order-statistic filter), and (2), image subsampling to reduce the image dimensions most commonly done by striding to reduce the compute burden. The striding operation is subsampling the image for the purposes of decimation. During this procedure, the energy of the discarded frequencies is folded into the desired lower frequency band reducing the signal-to-noise ratio (SNR) of the resulting embedding. This results in a feature space which is not translation equivariant: translations in the input image do not correspond to translations in the feature embedding.

We can overcome the faults of the traditional pooling layer by introducing an anti-aliasing (AA) filter before all strided operations [74]. In our setup, this means placing an AA filter before all strided convolutions and pooling operations. The AA filters prevents out-of-band frequencies from aliasing back into the remaining spectrum post-subsampling. This results in increasing the signal-to-noise ratio (SNR) of the embedding and to encourage translation equivariance.

Iii-D2 Feature Position Encoding

As previously mentioned, CNNs do not naturally provide translation invariance when used in tandem with strided convolution and/or pooling layers. In addition, [38] demonstrated that CNN’s have difficulty with position oriented tasks because they do not encode feature position. Indeed, this at first seems surprising given the translational nature of the convolution operator. However, the convolution operator takes as input a 2D map and also outputs a 2D map; a feature’s position through this process is a function of the map domain but not explicitly coded in the representation. Hence, when a 2D map is flattened and used as input to a dense layer, a feature’s position is lost.

The interesting problem of CNNs not recording positional information was not only noted by [38]. In an early and popular work, [75] noted that CNNs are good at providing “what” but not the “where.” They specifically design their CNN architecture to compensate for this fault. Furthermore, the supplementary material of [62] also notes this as they cite the addition of positional information improved image in-painting tasks for their Deep Image Prior technique.

We augment SPDRDL with target position information by using the CoordConv solution of [38]. In particular, we augment the output of the image enhancement network, , with two additional channels, each one describing a positional dimension of the input map as described by Eq (5),


where and are the generated 2D maps augmented to the channel dimension of the input, are the height and width of the image respectively in pixels, and are pixel locations.

Iii-E Classification Loss

Categorical cross-entropy is a commonly used loss function for penalizing classification error in neural networks. It accounts for errors probabilistically by measuring the amount of surprise between the predicted and true labels. The measure works well in the presence of balanced class and accurate labels. However, we know for SAS that the number of negative examples far outweighs the positive examples.

To mitigate the shortcomings of categorical cross-entropy in the presence of class imbalanced, we use a specified weighted version of the measure called the focal loss [37]. The focal loss is given by Eq (6),


where is the number of classes, is the true probability of class , is the estimated probability of class , and we use the strength coefficients given by the paper of and . Focal loss is a weighted version of the cross-entropy loss whereby correct classification is de-weighted. Consequently, the effect of the focal loss is to place more emphasis on grossly mis-classified samples compared to virtually correct classified samples. Through the use of the focal loss, error gradients of correct classifications are greatly diminished during training time while grossly incorrect classification maintain their error magnitude. In this way, the focal loss focuses the training on the misclassification samples and largely leaves the easy, correct classifications untouched.

There are several ways to place emphasis on negative samples during training of which a common one is to assign label weights. However, we chose the focal loss because of several positive properties it offers for our setup. Following, we describe each.

The first benefit realized by focal loss is that it can be viewed through the lens of importance sampling [32] but without the explicit overhead associated with such techniques. Recently, [68] showed importance sampling works well to improve the performance of SAS ATR. In importance sampling, misclassified samples are shown more often during the training procedure than correctly classified samples. In a similar manner, using the focal loss can instill a similar training policy without the overhead of maintaining a list of the misclassified samples. Using the focal loss, a mini-batch of images is fed to the training algorithm and the misclassified samples are dynamically weighted proportional to their error. For each batch, correctly classified samples induce little error gradient and effectively are removed from the batch.

The second benefit realized by focal loss is that the effective batch size is reduced over time. Reducing batch sizes has been associated with better generalization error [33]. Assuming the distribution of easy- and hard-to-classify samples is uniform throughout the minibatch, at the beginning of training all samples in the minbatch are considered hard-to-classify. As training progresses, some samples become easier-to-classify and their error gradients vanish effectively removing them from the minibatch reducing the effective minibatch size.

Iii-F SPDRDL: Jointly Learned Image Enhancement and Object Location Estimation

In the previous sections, we examined sources of structural information in SAS images currently not utilized by contemporary ATR methods. Our proposed approach builds upon an existing CNN backbone network commonly used for feature extraction by utilizing this overlooked structural information. In this section, we bring together the aforementioned sections and fully present our proposed method, SPDRDL.

Incorporating the losses discussed in the previous section, we arrive at the final loss function for SPDRDL, Eq (7),


where is the input image, is the data adaptive enhanced image, is the true target class, is the predicted target class, is the true target translation, is the estimated target translation, are the target localization network parameters, are the image enhancement network parameters, is the classification network parameters, is the feature extraction network parameters, and are regularization weights. Finally, is a class-dependent weight for the localization task given by Eq (8),


SPDRDL’s network description is in Table II. Convolutional layers are followed by ReLU activation and use initialization of [25]. Anywhere subsampling was used (which includes pooling layers and strided convolutions), anti-aliasing filtering was applied before subsampling using a 3 3 kernel of .

Layer Name Layer Function Dimensions # Filters Input
input1 Input N/A N/A N/A
conv1a Convolution 3x3 16 input1
conv1b Convolution 3x3 16 conv1a
pool1 AA Max Pooling 2x2 N/A conv1b
conv2a Convolution 3x3 32 pool1
conv2b Convolution 3x3 32 conv2a
pool2 AA Max Pooling 2x2 N/A conv2b
conv3a Convolution 3x3 64 pool2
conv3b Convolution 3x3 64 conv3a
pool3 AA Max Pooling 2x2 N/A conv3b
conv4a Convolution 3x3 128 pool3
conv4b Convolution 3x3 128 conv4a
pool4 AA Max Pooling 2x2 N/A conv4b
conv5a Convolution 3x3 256 pool4
conv5b Convolution 3x3 256 conv5a
up1 Upsampling 2x2 N/A conv5b
conv6a Convolution 3x3 128 up1
merge1 Concatenate N/A N/A conv6a, conv4b
conv6b Convolution 3x3 128 merge1
conv6c Convolution 3x3 128 conv6b
up2 Upsampling 2x2 N/A conv6c
conv7a Convolution 3x3 64 up2
merge2 Concatenate N/A N/A conv7a, conv3b
conv7b Convolution 3x3 64 merge2
conv7c Convolution 3x3 64 conv7b
up3 Upsampling 2x2 N/A conv7c
conv8a Convolution 3x3 32 up3
merge3 Concatenate N/A N/A conv8a, conv2b
conv8b Convolution 3x3 32 merge3
conv8c Convolution 3x3 32 conv8b
up4 Upsampling 2x2 N/A conv8c
conv9a Convolution 3x3 16 up4
merge4 Concatenate N/A N/A conv9a, conv1b
conv9b Convolution 3x3 16 merge4
conv9c Convolution 3x3 16 conv9b
conv9d Convolution 3x3 2 conv9c
conv9e Convolution 1x1 1 conv9d
lambda1 N/A N/A conv9e
densenet1 AA Densenet121 N/A N/A lambda1
gap1 Global Average Pooling N/A N/A densenet1
classification Dense with softmax 4 N/A gap1
conv10 Convolution 1x1 256 densenet1
conv11 Convolution 1x1 128 conv10
conv12 Convolution 1x1 64 conv11
flatten1 Flatten layer N/A N/A conv12
xPosEstimate Dense 1 N/A flatten1
yPosEstimate Dense 1 N/A flatten1
Table II: Description of SPDRDL architecture network architecture. The network input is a 256 256 pixel grayscale SAS image normalized to . The network has two outputs, a classification output and a target position output. AA indicates anti-aliasing was applied to the layer.

Iv Experiments

In this section, we describe how we measure the performance of SPDRDL and demonstrates its efficacy against contemporary methods. First, we will describe how we setup the experiments. Next, we describe the comparison methods. Finally, we show results by comparing all the methods.

Iv-a Setup

The ultimate goal of our experiments is to show the superiority of SPDRDL over existing methods. Equally important are two regimes to characterize. The first regime is classification performance of each object class. We show results in this regime by using confusion matrices whose purpose is to provide an overview of the classifier accuracy as a function of class. The second regime is classification performance in a one-versus-all scenario, whereby the target classes are collated into a single group. This collation converts our four class problem to a two class problem consisting of a target class and background class. Additionally, we will use a variation of this regime by showing performance of a particular target class versus all others.

As mentioned, we present results in a one-versus-all regime through conversion of a multi-class problem into a binary one. For binary class problems, many metrics exist by which to measure efficiency. A popular method is to measure area under the receiver operating characteristic curve (AUCROC). AUCROC reports the statistics of any chosen pair of samples being classified correctly. However, the method has been shown to be sensitive to class imbalance [9] which is pervasive here. Therefore, we choose area under the precision-recall (AUCPR) as our performance metric based on the analysis of [9] which determined that AUCPR is superior over AUCROC when the number of negative class samples greatly outnumbers the positive class samples which is true here. Furthermore, [9] demonstrate that the stability of AUCPR over AUCROC meaning a performance curve dominating in precision-recall (PR) space also dominates in ROC space but not vice-versa.

As previously mentioned, we will use confusion matrices to measure per class accuracy. For the one-versus-all cases mentioned, we will use AUCPR. This metric has a benefit over a confusion matrix because it does not force us to specify a threshold as all thresholds are evaluated. Usually, a threshold is set to optimize for a specific performance metric which is context dependent. In lieu of having to select a particular context, we simply report AUCPR on a one-versus-all basis.

Sonar image fidelity is often a function of range. For example, spreading and absorption losses in the medium attenuate the reflected signals as a function of range. Therefore, the SNR of sonar echoes is reduced at long ranges. To measure the contribution of such effects on our classifier, we evaluate the classifier performance as a function of observation range.

Recalling that CNNs with strides or pooling are not translation invariant, we also evaluate translation performance. Ideally, translation invariance would be evaluated at every possible translation of the target but this becomes prohibitively expensive to compute. So, we evaluate translation invariance at eight extreme shifts of 59cm as shown in Fig.6. Good translation invariance will yield the same classification regardless of shift so we compute performance by measuring the standard deviation (stdev) of the output score as a function of these nine shifts (eight extreme shifts plus the center crop).

Iv-B Dataset Description

We train and evaluate SPDRDL on the dataset of images which are output from the Mondrian detector of SAS imagery collected by the CMRE MUSCLE SAS sensor [3]. It is the same dataset used in [18] but with three modifications: (1) The original dataset contains detections on image boundaries and these images are extrapolated by mirroring resulting in target shapes which are not seen in the real environment. (2) Some of the images contained quadratic phase error (QPE) based on visual inspection [8]. We removed this error by applying a brute force autofocus in the -space domain. (3) The images were dynamic range compressed using an algorithm based on the rational mapping function of [53].

Fig. 5: The dataset we use for our experiment is split into two groups based on collection year. Dataset A is used for the training set and Dataset B is used for the validation and tests sets.

Overall, the dataset is composed of two partitions (dataset A and dataset B) based on collection year which Fig.5 depicts. Dataset A is composed of 27,748 images containing 1,385 targets collected from 2008 through 2013. Dataset B from a set of 21,181 images composed of 639 targets collected from 2013 through 2018. Each image in the dataset has resolution of 1.5cm and of size 335 335 pixels. However, translation is induced through cropping the images down to 256 256 pixels at training/inference time.

(a) (b)
Fig. 6: The validation and test sets are derived form eight-way neighbor translations of 59cm. The original tile is from Dataset B and cropped. All nine croppings are used for the test set with a random crop used for the validation set shown by the boxed imaged. We show the test and validation data generation scheme for example images of the (a) background class and (b) target class.

The data is partitioned into three sections for evaluation purposes: (1) The training set is composed of Dataset A and augmentations of it. (2) The validation set is composed of a single translation of each image of Dataset B. (3) The test set is composed of nine translations of each image of Dataset B.

The translations applied to the validation and test set are from the set of shifts in the horizontal and vertical dimensions of the set . This configuration yields a total of nine possible shifts (including the center crop which is not shifted at all) for each image of the test set. The validation set is composed of one random crop from the set of nine available for each image. Fig.6 shows two examples of how the proposed cropping scheme contributes to the test and validation sets. In the examples, a given image from Dataset B is eight-way shifted 59cm yielding a total of nine images (the mosaic) which are assigned to the test set and of one image (bounded by the solid line) which is randomly selected to be in the validation set. Overall, the training set is composed of 27,748 images, the validation set is composed of 21,181 images, and the test set is composed of 190,629 images.

Iv-C SPDRDL Training Procedure

We train SPDRDL in a similar fashion to most other CNNs. We use a mini-batch size of sixteen of which we assign half the batch an image from the background class and half the batch an image from the one of the three target classes. This 50:50 background-to-target-class split is based on the analysis of [64]. Recall, images from the dataset are 335 335 pixels. For training, a random crop of 256 256 pixels is selected for the mini-batch. The associated translational shift induced by cropping is recorded for the images containing a target. For each image in the mini-batch, the network estimates the class and translational shift (when the image is of a target class) and the errors are backpropagated appropriately. One epoch of training consists of the number of mini-batches required to see each image of the training set once on average.

CNNs perform best when lots of training data is available. In many situations though, large amounts of training data are not available. We call these instances low training data scenarios and in them, application of a domain prior becomes particularly important as its presence can significantly boost classification performance. To study this effect, we trained each of the methods on a random (but consistent across methods) 10% subset of the training. Fig.10 shows these results, and for convenience, the results when the full training data is available; recall these results are the AUCPRs of Fig.9. We show that the application of domain priors results in improved performance over all of the comparison methods when operating in low training data scenarios. These results demonstrate how application of our domain priors, image enhancement and target localization, improve performance on both abundant and low training data scenarios. The priors we introduce utilize information implicit during human interpretation and provide useful contextual information for our image classification method.

Deep networks often require a hyper-parameter search for optimal performance; SPDRDL is no different. SPDRDL uses the RMSProp optimization scheme [59] with a fixed learning rate of which was determined through cross-validation. Furthermore, the weights of the domain priors in Eq (7) were also found through cross-validation giving the best results when and .

Iv-D Ablation Study: Impact of Domain Priors

Table III shows the performance of SPDRDL with no additional multi-task losses and the incremental addition of each domain prior. For the low and high training scenarios, each additional loss provides improved classification performance with both priors giving better performance than each individual prior.

Next, we compare SPDRDL against three common despeckling techniques to show the benefit of using the learned enhancement network with the SSP. Table IV shows the results for the high training scenario when we retrain the network by setting in Eq (7) and supplanting the enhancement network with one of the following despeckling filters: Gaussian filter, median filter, and total variation [19, 5]. We can see that the best AUCPR performance of the pre-processed despeckled images is 0.9451 which is not as good as when the SSP is present, 0.9538.

Domain Priors 10% 100%
None, Only Classification Loss (CL) 0.8742 0.9281
CL + Structural Scene Context Prior (SSCP) 0.8919 0.9503
CL + Structural Similarity Prior (SSP) 0.8969 0.9456
Both, CL + SSCP + SSP 0.9079 0.9538
Table III: We evaluate performance of each domain prior in our loss function, Eq (7), to demonstrate their utility. AUCPR is reported on the test set for the high (100% of training data available) and low (10% of training data available) training data scenarios. Note, the enhancement network is still present in the CL and CL+SSCP scenarios but the associated regularization loss is removed from the objective function.
Despeckling Algorithm 100%
CL + SSCP + Gaussian Filter 0.9450
CL + SSCP + Median Filter 0.9409
CL + SSCP + Total Variation 0.9451
Table IV: We evaluate the performance of our SSP domain prior against several off-the-shelf despeckling methods to show SSP’s utility. We do this by retraining the network but removing SSP’s associated loss term in Eq (7) and the Enhancement Network in Fig.3 (SSCP is still included). We feed to network three types of despeckled imagery and report AUCPR using 100% of the available training data. We see that despeckling does give some performance gains over the CL configuration of Table III, but not as much as when the SSP is active (CL+SSP and CL+SSP+SSCP configurations of Table III). Recall, our method (CL+SSCP+SSP) yields an AUCPR of 0.9538 as shown in Table III.

Iv-E Comparison Against State of the Art

We demonstrate the efficacy of SPDRDL by comparing against three state of the art deep learning methods for sonar ATR and two recent shallow-learning methods. The deep learning methods were trained using Tensorflow 1.13.1 [1]. The number of trainable parameters for each network is given in Table V. Development of the deep learning algorithms and their specific parameters follows.

Emigh, et al. (IOA SAS/SAR 2018) [13]. This network is based on the Resnet-18 architecture, does not rely on Imagenet pre-training, ingests dual-band SAS images, and is a binary classifier with output of a single scalar indicating target/non-target score. We modify the network to input a single-band SAS image and output four classes by using a softmax function after the last dense layer. We use the categorical cross-entropy loss when training the network as binary cross-entropy loss was originally specified by the authors and we adapted it to this multi-class scenario. We trained using the Adam optimizer [34] with a learning rate of . The paper mentions decreasing the learning rate when a loss plateau occurs but does not give details on its parameters. In lieu of this, we forgo the learning rate schedule and invoke the same early stopping rule used during SPDRDL training.

Galusha, et al. (SPIE 2019) [17]. This network has only a few layers and is an Alexnet-like architecture. Like Emigh, et al., this network originally consumed dual-band SAS imagery and output a binary classification score representing target/non-target. We modify the network in a similar fashion as Emigh, et al. by using only a single-band SAS image as input and modify the output to support classification of four classes using a softmax activation after the last dense layer. As in Emigh, et al., we use the categorical cross-entropy loss when training the network as the binary cross-entropy loss was originally specified by the authors. We train the network in the same fashion the authors used in their work: stochastic gradient descent for 2,000 epochs at a learning rate of .

DensetNet121 (CVPR 2017) [28]. This is a common state-of-the-art off-the-shelf DenseNet architecture with 121 layers pre-trained on the Imagnet dataset. We choose this network for comparison because it is the feature extraction network used in SPDRDL, and serves as a baseline to provide evidence demonstrating our proposed priors improve classification performance. As with SPDRDL, use use the focal loss instead of the cross-entropy loss in order to demonstrate that our performance gains are not simply from this different classification loss. Densenet121 ingests three-channel color imagery imagery; we simply replicate the SAS input image over two additional channels to arrive at a three-channel image. Finally, we use the global average pooling option after the feature extraction layer and apply a four class output with softmax. We train using the same procedure as SPDRDL.

BoW-HOG (CVPR 2015). This method is a bag-of-words using histogram of oriented gradients features inspired by [31]. A comparison to a similar algorithm was also made by [43]. For this approach, each image is divided into pixel tiles of which the HOG features are computed. These features are then clustered using mini-batch k-means clustering [54] into what are known as words in this setup. The clusters of words form the vocabulary for the bag-of-words model.

The size of the vocabulary, , and the regularization parameters for the SVM, , are chosen using a random search [4, 47] of fifty iterations. The hyper-parameters for the search are chosen from and where represents the uniform distribution over the specified interval. The hyper-parameter search returned best results for . A radial basis function kernel was used. BoW-HOG method is costly to compute so we use no data augmentation during training and measure performance solely on the center crops of Dataset B.

DSRC (IEEE TGRS 2017). This method is a dictionary sparse reconstruction (DSRC) algorithm inspired by [43]. We use mini-batch dictionary learning [39] with coordinate descent to learn the dictionary atoms.

At test time, the learned dictionary for each class is used to reconstruct the test images one class at a time. The inverse of these errors from each reconstruction is transformed by the softmax function to class probabilities.

The sparsity of the reconstruction loss, , and the number of dictionary atoms per class, , are chosen using a random search of fifty iterations. The hyper-parameters for the search are chosen from and and the best results were . A mini-batch size of ten was used for dictionary learning.

Similar to the BoW-HOG method, the extensive compute resources necessitated by this algorithm lead to using only center crops of the images of the training and validation sets. As with the BoW-HOG method, performance is only reported on the validation set.

Network Number of Trainable Parameters
Emigh, et al.
Galusha, et al.
Table V: Number of trainable parameters for the deep learning methods.

Iv-F Results and Analysis

In this section, we present results comparing SPDRDL against the several contemporary methods, demonstrate the necessity of each prior in our loss function formulation, analyze properties of the learned image enhancement function, and demonstrate the ability to reduce the network size considerably for use in low power embedded systems while maintain good classification performance.

Iv-F1 Classification Task

We show confusion matrices in Fig.7, precision recall curves for each target class versus background in Fig.8, and precision recall curves for target versus background in Fig.9. We demonstrate the learning efficiency of our method by showing results from the ablation study. From these figures, we can see the benefits our domain priors afford us. In almost all metrics, SPDRDL outperforms existing methods.

(a) BoW-HOG (b) DSRC (c) Emigh, et al.
(d) Galusha, et al. (e) Densenet121 (f) SPDRDL
Fig. 7: Confusion matrices for SPDRDL and comparison methods. Larger numbers along the diagonal indicate better performance. We see the benefits of the added domain priors through improved performance, especially of the wedge class.
(a) Cylinder (b) Wedge (c) Truncated Cone
Fig. 8: Precision-recall curves for target type: (a) cylinder, (b) wedge, (c) truncated cone. AUCPR in parentheses; larger values are better.
Fig. 9: PR curves for all methods obtained using a one-versus-all method. AUCPR in parentheses; larger values are better.
Fig. 10: Deep learning methods work well when the training data is plentiful. Sometimes such a situation is not possible as collection of training data can be expensive. To demonstrate the necessity of the domain priors our method introduces, we train each method using all the training data available and then train using a random 10% subset. We can see from the low training data scenario that SPDRDL still produces quite competitive results indicating the increased performance granted by the domain priors.
Fig. 11: To understand the sensitivity of selection bias in a low training data scenario, we trained the top three methods each ten times with a random sample of 10% of the training data and plot the results. Due to the large run-time associated with evaluating all the test set imagery, we report results on the validation set only. Larger AUCPR indicates better performance. As shown, we can clearly see a benefit in performance with the addition of the domain priors over the next best method, Densetnet121.

Indeed, the results of Fig.10 show SPDRDL performance very similar to Densenet121 in the low training data scenario with SPDRDL exhibiting a slight performance gain. However, the gap between SPDRDL and Denset121 may be much larger in reality. In this case, it is quite likely we are seeing the effects of selection bias of the training data subset. To examine issues of sample selection bias, we train the top three methods ten times each using a random subset of 10% of the training data; the same subsets were used for each algorithm evaluation. Evaluating the test set on all of these trained models would be computational prohibitive due to the large test set size. Therefore, we report on the validation set performance for the best epoch of each. As shown in Fig.11, the performance gap between SPDRDL and the next best method, Densenet121, becomes much more distinct when viewing the results in the context of selection bias.

We also examined the performance gain of each regularization term of Eq (7) to demonstrate its necessity. This was done by setting each loss term to zero and then retraining and reevaluating the network. Consequently, we can see both priors of Eq (7) improve classification performance even in the low training data scenario demonstrating the benefits of incorporating domain knowledge.

(a) SPDRDL (b) Densenet121
(c) Image under test for (a) and (b) which contains a target-class object.
Fig. 12: (a) and (b) Target scores (SPDRDL/Densenet121 predicted classification probability) as a function of diagonal image translation of the top two performing algorithms for a sample input image. We can see for the Densenet121 method that a small translation of the input image results in an unpredictable classification whereas SPDRDL does not exhibit this. (c) Image under test for (a) and (b).
Network Shift Invariance Score (Lower is Better)
Emigh, et al.
Galusha, et al.
Table VI: Classifier scores were compiled for all nine shifts of each image in the test set. Next, we compute the standard deviation of scores for each images. Finally, the table shows the mean of those scores over the entire test set; lower number indicate more shift invariance (i.e. better performance). We see that SPDRDL has the best translation invariance of all the methods. We hypothesize that Densetnet121 has the second best performance because it contains only a single instance of MaxPooling and has many levels of feature averaging.

We can see from the results that the SPDRDL method outperforms the deep learning methods and the shallow learning methods by a significant amount demonstrating the usefulness of the domain priors. For our ablation analysis, we focus on two aspects: (1) Generalization efficacy using in a low training data scenario where only 10% of the training data is used. (2) Necessity of the additional loss terms from our domain priors used in Eq (7).

Fig.10 shows AUCPR for all methods using 10% and 100% of the training data; SPDRDL outperforms all the comparison methods showing the efficacy of the additional domain priors.

Fig. 13: Top panel, AUCPR as a function of range from the sonar for all methods. Larger values indicated better performance. Bottom panel is a zoom of the top panel highlighting the differences of the top three methods: SPDRDL, DenseNet121, and Emigh, et al.

Finally, we examine the two-class performance as a function of object range from sensor. Fig.13 shows SPDRDL performing well over an extensive range from the sonar due to the addition of the domain priors. Especially at the nearest and furthers ranges from the SAS, SPDRDL outperforms the comparison methods.

Iv-F2 Image Enhancement Task

Because of speckle noise, image despeckling algorithms are often employed to enhance SAS images for the purpose of improving human interpretability. We posit that there exists an image enhancement function which improves classification performance and we call this the structural similarity prior; any enhancement algorithm must preserve the structural similarity between the input and output images. We employ this prior in SPDRDL through the use of a learned image enhancement transform which is constrained by the loss of Eq (3).

(a) (b)
Fig. 14: We analyze the frequency spectrum of the original images input to the network (a) and of our learned image enhancement (b). In panel (b), we can see selective attenuation across spatial frequencies and orientations due to the use of the data-adaptive image enhancement prior.

We analyze the output of the enhancement network by examining its frequency response and compare it to the frequency response of the original images. That is, we compute the frequency response all the images in target classes and present their averaged spectra; all images are windowed with a 2D hamming window prior to frequency domain conversion. Fig.14 depicts these results. We can see selective suppression in both spatial frequency and orientation from the integration of our image enhancement network. This behavior is in contrast to what one would get with a simple 2D Gaussian filter which would give an isotropic frequency response.

Iv-F3 Target Localization Task and Translation Equivariance

CNNs lose their translation equivariance through the addition of non-unitary strided pooling and convolution layers. In this section, we quantify this phenomenon over the deep learning methods to demonstrate the necessity of adding anti-aliasing filters before subsampling in the network. To first illustrate the problem, we show classification scores of an image over a large set of translations, Fig.12. As shown, classification scores can vary drastically even at small pixel shifts. For example, suppose a well-centered target is presented to the classifier and is correctly classified. By translating the image by one pixel, the classifier will now misclassify it!

We analyze the translation invariance performance of each deep learning algorithm to assess its ability in providing translation invariance. We only examined the target/background scenario by converting the class prediction estimates to a single scalar from indicating target score. For a single image, we compute its scores for a center crop and eight extreme crops. We then measure the standard deviation of these scores and call this metric, , the image’s shift invariance score. More formally, we define the shift invariance score as


where function is the inference model, is the input image translated by and , and . As a result of this formulation, lower shift invariance scores indicate better translation robustness. We report the results in Table VI. We can see our proposed method has the greatest translation invariance due to the addition of the anti-aliasing filters layers employing non-unitary stride.

Iv-F4 Network Compute Burden and Reduction

Deep networks consisting of a large number of parameters can be challenging to deploy on embedded hardware because of the large memory and computational footprint required. However, we can reduce the number of parameters used during inference by utilizing a network pruning algorithm. Although such algorithms remove weights, classifier performance is often maintained and in some cases, even improved.

We reduce the number of free parameters of network using the pruning method of [16]. In this method, we sort the absolute value of the network weights and set the lowest proportion of weights to zero. Fig.15 illustrates the results of performing this operation on SPDRDL. We can see that even when the number of parameters is reduced by half, the network still results in competitive classification performance.

Fig. 15: AUCPR as a function of weight pruning proportion. Weights are sorted by magnitude and removed starting with the lowest magnitude weights first. Significant pruning is accomplished while still maintaining good classification performance.

V Conclusion

SAS is relatively new field of remote sensing and has shown increased interest in the last two decades because of the high-quality imagery such systems produce. Recent, technological advancements in computing (e.g. GPUs) have made it easy to deploy SAS-equipped UUVs with on-board image formation. This capability paves the way for automatic machine interpretation of the imagery with the goal of influencing vehicle behavior. Despite the success of SAS, high false alarm rates provide difficulty for autonomy to make decisions in situ.

In this work, we developed a SAS ATR algorithm exhibiting improved performance over state-of-the-art methods by integrating domain knowledge of SAS images previously overlooked by existing methods. Our formulation jointly learns image enhancement and target localization for the purposes of improving the downstream task of image classification. We compare our method to several state-of-the-art techniques, including two recent deep-learning methods, and demonstrates its efficacy. Finally, we use a recently proposed pruning technique to show we can halve the number of free parameters in our network and still achieve competitive performance, thus demonstrating feasibility for real-time deployment.

Future work includes extending the approach to phase and frequency representations of the SAS images.


The authors would like to thank the NATO Centre for Maritime Research & Experimentation (CMRE) for providing the data used in this work. The collection of the data was funded by the NATO Allied Command Transformation. Colors used in the plots are derived from [7] and [46]. I.G. would like to thank Dr. David Williams and Dr. John McKay for their helpful comments during the progress of this work.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu and X. Zheng (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. Note: Software available from External Links: Link Cited by: §IV-E.
  • [2] A. Azulay and Y. Weiss (2019) Why do deep convolutional networks generalize so poorly to small image transformations?. Journal of Machine Learning Research 20 (184), pp. 1–25. External Links: Link Cited by: §III-A, §III-D1.
  • [3] F. Baralli, M. Couillard, J. Ortiz and D. G. Caldwell (2013) GPU-based real-time synthetic aperture sonar processing on-board autonomous underwater vehicles. In OCEANS, pp. 1–8. Cited by: §IV-B.
  • [4] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §IV-E.
  • [5] J. Bush (2011) Bregman algorithms. Senior Thesis. University of California, Santa Barbara. Cited by: §IV-D.
  • [6] H. J. Callow (2003) Signal processing for synthetic aperture sonar image enhancement. Ph.D. Thesis, University of Canterbury. Cited by: §I-A, §I.
  • [7] Color Brewer 2.0. Note: \urlhttp://colorbrewer2.orgAccessed: 2020-01-01 Cited by: Acknowledgments.
  • [8] D. A. Cook and D. C. Brown (2008) Analysis of phase error effects on stripmap SAS. IEEE Journal of Oceanic Engineering 34 (3), pp. 250–261. Cited by: §IV-B.
  • [9] J. Davis and M. Goadrich (2006) The relationship between precision-recall and ROC curves. In International Conference on Machine learning, pp. 233–240. Cited by: §IV-A.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In International Conference on Computer Vision, Cited by: §II.
  • [11] G. J. Dobeck and J. C. Hyland (1997) Automated detection and classification of sea mines in sonar imagery. In Detection and Remediation Technologies for Mines and Minelike Targets II, Vol. 3079, pp. 90–110. Cited by: §I-A.
  • [12] G. J. Dobeck (1999) Fusing sonar images for mine detection and classification. In Detection and Remediation Technologies for Mines and Minelike Targets IV, Vol. 3710, pp. 602–614. Cited by: §I-A.
  • [13] M. Emigh, B. Marchand, M. Cook and J. Prater (2018) Supervised deep learning classification for multi-band synthetic aperture sonar. In Synthetic Aperture Sonar & Synthetic Aperture Radar Conference, Vol. 40, pp. 140–147. Cited by: §II, §IV-E.
  • [14] B. Erkmen and T. Yıldırım (2008) Improving classification performance of sonar targets by applying general regression neural network with PCA. Expert Systems with Applications 35 (1-2), pp. 472–475. Cited by: §I-A.
  • [15] S. Fortune, M. Hayes and P. Gough (2001) Statistical autofocus of synthetic aperture sonar images using image contrast optimisation. In OCEANS, Vol. 1, pp. 163–169. Cited by: §I.
  • [16] J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §IV-F4.
  • [17] A. Galusha, J. Dale, J. Keller and A. Zare (2019) Deep convolutional neural network target classification for underwater synthetic aperture sonar imagery. In Detection and Sensing of Mines, Explosive Objects, and Obscured Targets XXIV, Vol. 11012, pp. 1101205. Cited by: §II, §IV-E.
  • [18] I. D. Gerg and D. Williams (2018) Additional representations for improving synthetic aperture sonar classification using convolutional neural networks. In Synthetic Aperture Sonar & Synthetic Aperture Radar Conference, pp. 11–22. Cited by: §II, §IV-B.
  • [19] P. Getreuer (2012) Rudin-osher-fatemi total variation denoising using split bregman. Image Processing On Line 2, pp. 74–95. Cited by: §IV-D.
  • [20] P. Gough (1986) A synthetic aperture sonar system capable of operating at high speed and in turbulent media. IEEE Journal of Oceanic Engineering 11 (2), pp. 333–339. Cited by: §I.
  • [21] R. E. Hansen, H. J. Callow, T. O. Sæbø, S. A. Synnes, P. E. Hagen, T. G. Fossum and B. Langli (2009) Synthetic aperture sonar in challenging environments: results from the HISAS 1030. Underwater Acoustic Measurements. Cited by: §I.
  • [22] D. W. Hawkins (1996) Synthetic aperture imaging algorithms: with application to wide bandwidth sonar. Cited by: §I-A.
  • [23] M. P. Hayes and P. T. Gough (1991) Results from an experimental synthetic aperture sonar. In Acoustical Imaging, pp. 455–466. Cited by: §I.
  • [24] M. P. Hayes and P. T. Gough (1992) Broad-band synthetic aperture sonar. IEEE Journal of Oceanic Engineering 17 (1), pp. 80–94. Cited by: §I.
  • [25] K. He, X. Zhang, S. Ren and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on Imagenet classification. In International Conference on Computer Vision, pp. 1026–1034. Cited by: §III-F.
  • [26] K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Computer vision and pattern recognition, pp. 770–778. Cited by: §III-B.
  • [27] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §III-B.
  • [28] G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger (2017) Densely connected convolutional networks. In Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §III-B, §IV-E.
  • [29] A. J. Hunter, M. P. Hayes and P. T. Gough (2003) Simulation of multiple-receiver, broadband interferometric SAS imagery. In OCEANS, Vol. 5, pp. 2629–2634. Cited by: §I.
  • [30] J. C. Hyland and G. J. Dobeck (1995) Sea mine detection and classification using side-looking sonar. In Detection Technologies for Mines and Minelike Targets, Vol. 2496, pp. 442–453. Cited by: §I-A.
  • [31] J. C. Isaacs (2015) Sonar automatic target recognition for underwater UXO remediation. In Computer Vision and Pattern Recognition Workshops, Cited by: §II, §IV-E.
  • [32] A. Katharopoulos and F. Fleuret (2018) Not all samples are created equal: deep learning with importance sampling. In International Conference on Machine Learning, pp. 2530–2539. Cited by: §III-E.
  • [33] N. S. Keskar, D. Mudigere, J. N. andMikhail Smelyanskiy and P. T. P. Tang (2017) On large-batch training for deep learning: generalization gap and sharp minima. International Conference on Learning Representations. Cited by: §III-E.
  • [34] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-E.
  • [35] A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §III-B.
  • [36] S. Lee, B. Park and A. Kim (2019) Deep learning from shallow dives: sonar image generation and training for underwater object detection. In International Conference in Robotics and Automation, Cited by: §II.
  • [37] T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. In International Conference on Computer Vision, pp. 2980–2988. Cited by: §III-E.
  • [38] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev and J. Yosinski (2018) An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pp. 9605–9616. Cited by: §III-D2, §III-D2, §III-D2.
  • [39] J. Mairal, F. Bach, J. Ponce and G. Sapiro (2009) Online dictionary learning for sparse coding. In International Conference on Machine Learning, pp. 689–696. Cited by: §IV-E.
  • [40] J. Mairal, P. Koniusz, Z. Harchaoui and C. Schmid (2014) Convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 2627–2635. Cited by: §III-A.
  • [41] J. McKay, I. Gerg, V. Monga and R. G. Raj (2017) What’s mine is yours: pretrained CNNs for limited training sonar atr. In OCEANS, pp. 1–7. Cited by: §II.
  • [42] J. McKay, I. Gerg and V. Monga (2018) Bridging the gap: simultaneous fine tuning for data re-balancing. In International Geoscience and Remote Sensing Symposium, pp. 7062–7065. Cited by: §II.
  • [43] J. McKay, V. Monga and R. G. Raj (2017) Robust sonar ATR through bayesian pose-corrected sparse classification. IEEE Transactions on Geoscience and Remote Sensing 55 (10), pp. 5563–5576. Cited by: §II, §IV-E, §IV-E.
  • [44] J. McKay, V. Monga and R. Raj (2016) Localized dictionary design for geometrically robust sonar ATR. In International Geoscience and Remote Sensing Symposium, pp. 991–994. Cited by: §II.
  • [45] J. Novakovic (2009) Using information gain attribute evaluation to classify sonar targets. In 17th Telecommunications Forum, pp. 1351–1354. Cited by: §I-A.
  • [46] J. R. Nuñez, C. R. Anderton and R. S. Renslow (2018) Optimizing colormaps with consideration for color vision deficiency to enable accurate interpretation of scientific data. PloS one 13 (7). Cited by: Acknowledgments.
  • [47] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IV-E.
  • [48] M. Poderico (2012) Denoising of SAR images. Ph.D. Thesis, Federico II University of Naples. Cited by: §I-B, §III-A.
  • [49] A. Reed, I. Gerg, J. McKay, D. Brown, D. Williams and S. Jayasuriya (2019) Coupling rendering and generative adversarial networks for artificial SAS image generation. In OCEANS, Cited by: §II.
  • [50] S. Reed, Y. Petillot and J. Bell (2004) Automated approach to classification of mine-like objects in sidescan sonar using highlight and shadow information. IEE Radar, Sonar and Navigation 151 (1), pp. 48–56. Cited by: §I-A.
  • [51] S. Reed, Y. Petillot and J. Bell (2003) An automatic approach to the detection and extraction of mine features in sidescan sonar. IEEE Journal of Oceanic Engineering 28 (1), pp. 90–105. Cited by: §I-A.
  • [52] O. Ronneberger, P. Fischer and T. Brox (2015) U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §III-C.
  • [53] C. Schlick (1995) Quantization techniques for visualization of high dynamic range pictures. In Photorealistic Rendering Techniques, pp. 7–20. Cited by: §IV-B.
  • [54] D. Sculley (2010) Web-scale k-means clustering. In International Conference on World Wide Web, pp. 1177–1178. Cited by: §IV-E.
  • [55] D. Shea, D. Dawe, J. Dillon and S. Chafwilliamspman (2014) Real-time SAS processing for high-arctic AUV surveys. In OES Autonomous Underwater Vehicles, pp. 1–5. Cited by: §I.
  • [56] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §III-B.
  • [57] J. Stack (2011) Automation for underwater mine recognition: current trends and future strategy. In Detection and Sensing of Mines, Explosive Objects, and Obscured Targets XVI, Vol. 8017, pp. 80170K. Cited by: §I.
  • [58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In Computer vision and pattern recognition, pp. 1–9. Cited by: §III-B.
  • [59] T. Tieleman and G. Hinton (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §IV-C.
  • [60] M. Tofighi, T. Guo, J. K. Vanamala and V. Monga (2019) Prior information guided regularized deep learning for cell nucleus detection. IEEE Transactions on Medical Imaging 38 (9), pp. 2047–2058. Cited by: §III-C.
  • [61] J. D. Tucker and M. R. Azimi-Sadjadi (2011) Coherence-based underwater target detection from multiple disparate sonar platforms. IEEE Journal of Oceanic Engineering 36 (1), pp. 37–51. Cited by: §II.
  • [62] D. Ulyanov, A. Vedaldi and V. Lempitsky (2018) Deep image prior. In Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §III-D2.
  • [63] T. H. Vu, L. Nguyen, T. Guo and V. Monga (2018) Deep network for simultaneous decomposition and classification in UWB-SAR imagery. In Radar Conference, pp. 0553–0558. Cited by: item 1.
  • [64] B. C. Wallace, K. Small, C. E. Brodley and T. A. Trikalinos (2011) Class imbalance, redux. In International Conference on Data Mining, pp. 754–763. Cited by: §IV-C.
  • [65] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §III-C.
  • [66] Z. Wang and A. C. Bovik (2009) Mean squared error: love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine 26 (1), pp. 98–117. Cited by: §III-C.
  • [67] Z. Wang, E. P. Simoncelli and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems & Computers, Vol. 2, pp. 1398–1402. Cited by: §III-C, §III-C.
  • [68] D. P. Williams, R. Hamon and I. D. Gerg (2019) On the benefit of multiple representations with convolutional neural networks for improved target classification using sonar data. Underwater Acoustics Conference. Cited by: §II, §III-E.
  • [69] D. P. Williams (2016) Underwater target classification in synthetic aperture sonar imagery using deep convolutional neural networks. In International Conference on Pattern Recognition, pp. 2497–2502. Cited by: §II, §II.
  • [70] D. P. Williams (2017) The Mondrian detection algorithm for sonar imagery. IEEE Transactions on Geoscience and Remote Sensing 56 (2), pp. 1091–1102. Cited by: §III-A.
  • [71] D. P. Williams (2018) Exploiting phase information in synthetic aperture sonar images for target classification. In OCEANS, pp. 1–6. Cited by: §II.
  • [72] D. P. Williams (2019) Transfer learning with SAS-image convolutional neural networks for improved underwater target classification. In International Geoscience and Remote Sensing Symposium, pp. 78–81. Cited by: §II.
  • [73] C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, Cited by: §III-A.
  • [74] R. Zhang (2019) Making convolutional networks shift-invariant again. In International Conference on Machine Learning, pp. 7324–7334. Cited by: §III-D1.
  • [75] J. Zhao, M. Mathieu, R. Goroshin and Y. Lecun (2016) Stacked what-where auto-encoders. International Conference on Learning Representations. Cited by: §III-D2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description