Structural Prior Driven Regularized Deep Learning for Sonar Image Classification
Deep learning has been recently shown to improve performance in the domain of synthetic aperture sonar (SAS) image classification. Given the constant resolution with range of a SAS, it is no surprise that deep learning techniques perform so well. Despite deep learning’s recent success, there are still compelling open challenges in reducing the high false alarm rate and enabling success when training imagery is limited, which is a practical challenge that distinguishes the SAS classification problem from standard image classification set-ups where training imagery may be abundant. We address these challenges by exploiting prior knowledge that humans use to grasp the scene. These include unconscious elimination of the image speckle and localization of objects in the scene. We introduce a new deep learning architecture which incorporates these priors with the goal of improving automatic target recognition (ATR) from SAS imagery. Our proposal – called SPDRDL, Structural Prior Driven Regularized Deep Learning – incorporates the previously mentioned priors in a multi-task convolutional neural network (CNN) and requires no additional training data when compared to traditional SAS ATR methods. Two structural priors are enforced via regularization terms in the learning of the network: (1) structural similarity prior – enhanced imagery (often through despeckling) aids human interpretation and is semantically similar to the original imagery and (2) structural scene context priors – learned features ideally encapsulate target centering information; hence learning may be enhanced via a regularization that encourages fidelity against known ground truth target shifts (relative target position from scene center). Experiments on a challenging real-world dataset reveal that SPDRDL outperforms state-of-the-art deep learning and other competing methods for SAS image classification.
Underwater sonar was historically pursued for military purposes, but as the field matured and the commercialization of these systems became feasible, remote-sensing applications for the civilian domain developed. Predictably, as synthetic aperture sonar (SAS) was initially pursued for mine countermeasure applications , it has broad civilian applications in remote-sensing of the undersea environment today.
Over the last few years, SAS has matured to a capability accessible in the civilian space with several companies offering systems  . Fundamental work in obtaining high-quality SAS images was carried out in the late 1990’s to early 2000’s [20, 23, 24, 6, 15, 29]. Contemporary SAS systems are capable of producing image quality that lends itself to tasks such as automatic target recognition (ATR). Despite the improvements, the problem of detecting and classifying objects in imagery remains challenging because of distractors in the environment and the complex configurations possible by targets. Fig.1 shows examples of difficult cases along with prototypes of the object classes used in this work.
I-a Open Challenges in SAS ATR
SAS ATR algorithms were originally established from ATR algorithms used in side scan sonar, which we call real aperture sonar (RAS). RAS systems have similar collection geometry to SAS, but cannot produce the constant resolution with range achieved by SAS. We refer the reader to Chapter Two of  and Chapter Three of  for the differences between RAS and SAS imaging. Despite these differences, initial SAS imagery looked similar enough to RAS for researchers to reasonably justify the use of RAS ATR algorithms on SAS.
Some of the more popular early work in sonar ATR involved the use of kernel filters [30, 11]. When large amounts of SAS imagery began to be produced, these were some of the first techniques applied to it. Over time, these methods began to utilize multiple target looks to aid in classification by exploiting the overlapped coverage of the seafloor most SAS surveys exhibit . Other techniques focused on model-based approaches  and uncertainty modeling. Eventually, the popular classification algorithms of the early 2000’s, before the boon of deep learning, were applied to imagery including decision trees  and Markov random fields .
Coincidentally, this paper is about the use of neural networks (NN) to address the classification problem. The use of such techniques is not new and one of the more popular early works employed them .
With all the recent success of SAS, there remains persistent challenges with respect to ATR. One of the biggest challenges for obtaining good results is collecting and labeling large amount of imagery which is needed for contemporary machine learning (ML) algorithms using deep learning. SAS collection from unmanned underwater vehicles (UUV’s) requires an inordinate amount of support infrastructure including: support vessels, ship crew, and divers making the endeavor financially expensive. Furthermore, the objects often sought upon during surveys are scarce.
It is also difficult to create an environment-independent ML algorithm when little training data is available. Practitioners quickly discover that deploying classification in unseen environments often results in high false alarm rates, even for state-of-the-art SAS ATR algorithms. Despite this, the detection performance of these methods is often quite good as the literature shows. However, all the false alarms returned by the algorithm quickly overwhelm human operators. Furthermore, objects simple to rule out by humans are often called by the ATR, preventing trust between the operators and the algorithms – this renders the ATR useless. When combined, these factors result in manual human inspection as a preferred means to cull the imagery; a costly process.
I-B Overview of Our Proposal
We present a deep learning classifier exhibiting significantly reduced false alarm rates compared to contemporary SAS ATR algorithms while maintaining high detection accuracy. Our approach integrates high-level, domain knowledge unique to the SAS domain in order to achieve these good results. We do this by integrating parcels of domain knowledge, which we call priors, into the training objective.
For a given problem, there exist attributes which are directly represented by the given training data. In the case of image classification, this would be the images/label pairs used for training; this information is explicitly provided to the training algorithm. This is the common scheme for the vast majority of image classification problems. However, there exists domain-specific information which is projected into the training data but may not be explicitly represented by it. For example, for the dataset used in this work, we have some knowledge of how it was pre-processed before given to us for use. Specifically, we know that the detection algorithm used to produce the image chips is generally good at centering targets within the chip. However, we do not have any kind of bounds or statistic on how well the targets are centered. Furthermore, the true target centers are not explicitly encoded for each image. Thus, the fact that the detector is reasonably good at centering the targets is domain knowledge derived from a subject matter expert (SME) (we will see this forms the scene context prior which will we discuss in future sections). We refer to these parcels of domain knowledge as priors. In a deep learning framework, each prior is employed through a regularization loss which is augmented to the primary task’s objective function. Fig.2 is a Venn diagram illustrating the concept.
In this work, we define two priors which, when used individually each improve classification performance, but when used together, act synergistically to improve performance beyond the use of each exclusively. The first prior we use addresses an often overlooked part of the SAS image reconstruction pipeline: image enhancement. This prior originates from the domain knowledge that image enhancement algorithms applied to SAS imagery aid in improving human interpretation. An example of such an algorithm is despeckling . We name this prior the structural similarity prior because it encapsulates the function of any image enhancement algorithm: improve scene content in a way which improves downstream task performance (in this case classification) while simultaneously preserving scene structure and semantics. Quantitatively, this prior is captured by the regularization term in Eq (3) which is described in detail in Section III-C.
The second prior we use leverages a common quality exhibited by detection algorithms: the ability to localize targets. The majority of SAS classification algorithms are preceded by a detection algorithm whose purpose is to quickly find target-like objects in the queue of mostly-benign seafloor images. This prior originates from the domain knowledge that the detector algorithm is usually able to localize the target in the image which humans also do when parsing a scene. We name this prior the structural scene context prior because it encapsulates the role of ground truth target position knowledge: well learned features for image interpretation should encode target location. The images output by the detector usually center the target and we can translate the target through image crops. We then encourage prediction of the new ground truth target location but using the same features used for classification. In this manner, we improve the quality of the features which consequently improves classification performance. Quantitatively, this prior is captured Eq (4) which is described in detail in Section III-D.
Recall that our two priors, structural similarity prior and structural scene context prior, exist for the purpose of improving our primary task: classification. The domain knowledge captured by these two priors is employed through the use of regularization losses, Eq (3) and Eq (4), augmented to the primary task objective function, Eq (6). Together this forms the final loss we jointly-optimize during training, Eq (7).
Using the aforementioned priors above, this paper makes the following technical contributions:
Image enhancement through despeckling is often used to improve image interpretability for humans. We ask the question, Is there an image enhancement function which improves classification? to which we will answer in the affirmative (results in Table III). To this end, we incorporate a data adaptive image enhancement network with a self-supervised, domain-specific loss to an existing classification network for purposes of improving classification performance. Our image enhancement function is learned from the data removing the onerous task of selecting a fixed despeckling algorithm. Furthermore, ground truth noise/denoised image pairs as required by previous methods  are not needed.
Most SAS ATRs only determine the presence of a target object but are aloof to where in the image it appears. To this end, we incorporate a target localization network in addition to a classification network for the purpose of also improving classification performance. Like the image enhancement network, this is also trained using a self-supervised, domain-specific loss. Now, our classifier not only learns target class, but also target position thus acquiring scene context. Our target localization network is trained using the domain knowledge that objects are centered when passed to the classifier. We use the common data augmentation technique of image translation through cropping to induce new target positions when training, and have the target localization network estimate the induced target position in addition to the primary task of classification. This encourages the model to learn “where” of the scene in addition to the “what”.
We train the two aforementioned networks and the classification network simultaneously through the addition of regularization terms to the primary classification loss objective function. This makes our formulation self-supervised and thus requires no extra data or labels making it suitable as a drop-in replacement for training against existing datasets. Table III shows through ablation that each domain-specific loss improves classification performance and that when combined, the best classification performance is achieved.
Both of our priors incorporate structural domain knowledge so we call our method Structural Prior Driven Regularized Deep Learning (SPDRDL). Each prior mentioned has never used in previous SAS classification works. Table I shows the relationship among losses used in our final objective function.
|Prior Name||Employed Domain Knowledge||Loss Type||Loss Equation|
|Structural similarity||Image enhancement improves human interpretability||Domain-specific||Equation 3|
|Structural scene context||Targets output from detector are image centered||Domain-Specific||Equation 4|
|N/A||N/A||Primary Task, Classification||Equation 6|
An overview of this paper follows. Section II provides a synopsis of past ATR approaches. Section III presents the necessary background and development of SPDRDL and its use of domain priors. Section IV shows experimental results of our algorithm and compares our results to other contemporary algorithms on a challenging real-world dataset. Finally, Section V provides a summary of our findings.
Ii Previous Work
Recent SAS ATR schemes have focused on improving feature representations through various means. For many years, representations were hand crafted and much of the research was in attempting to discover useful features through subject matter expert input. Techniques employed bag-of-words models  using handcrafted features as complex vocabularies describing SAS images. Eventually, techniques emerged which removed the need for this explicit feature engineering task. Dictionary learning methods were some of the first to forgo the explicit feature engineering path [43, 44] and automatically learn features as part of the classification process. Today, deep learning techniques are employed in the same vein .
Recently, investigations into alternate representations to improve classification have shown be a fruitful endeavor. [18, 68, 71] have examined representations derived from the -space and have found they contain useful information for classification. Traditionally, the human consumable image, which arrives after extensive post-processing of the raw SAS data, has been used for input to the classifier. The human consumable image is the result of a lengthy signal processing pipeline which discards information related to the frequency and direction of the received acoustic wavefronts. A coarse explanation of a typical image reconstruction pipeline is as follows: (1) raw sonar echos are collected from the sonar array over multiple transmissions, (2) signal processing is applied to these echoes to correct them for imperfections, (3) the data is matched filtered to obtain resolution in the range dimension, (4) an optional motion compensation step is performed to interpolate the data to a regular grid (e.g. preparation for -k beamforming), (5) the data is beamformed to generate a single look complex (SLC) image, and (6) a human consumable image is formed by taking the absolute value of the SLC and applying dynamic range compression (DRC). Consequently, the absolute value operation removes the phase portion of the SLC potentially discarding useful information.
Deep learning has been applied to sonar ATR resulting in a substantial improvement in classification performance. An initial work in the area is  whereby convolutional neural networks (CNNs) were used to automatically learn features for classification. In , the authors demonstrated the canonical transfer learning approach commonly used in training data-limited networks works well for SAS; a pre-trained CNN trained on the Imagenet dataset  was fine-tuned on SAS imagery yielding good results. This work was generalized in , where the authors integrate the feature learning of both SAS images and selected photographs simultaneously, yielding good ATR performance in the midst of limited training data. Finally, transfer learning among SAS sensors was demonstrated in  where a CNN initially trained on one SAS sensor was used to quickly train with another.
In several of the works describe thus far, class imbalance has been mentioned as a noteworthy issue. Many SAS datasets have far more imagery of the benign seafloor than of objects of interest. To combat this issue, general adversarial network (GANs) have recently been applied to SAS for the purpose of generating more training data to balance the classes. In , a hybrid simulation and GAN based approach is used to generate a simulated, optical version of the desired scene and then a learned transform is applied to the simulated scene to give the appearance of a real SAS image. Their hybrid approach gives fine control over the generated scene content so the data balancing procedure can be accomplished with precision; particular objects, their orientations, and their range from the sonar can be specifically generated. Model based GAN approaches have not been limited to SAS, but also have been used for real-aperture sonar (RAS) systems as described in  where GANs are applied to RAS imagery to augment data for underwater person detection.
Today, SAS systems are multi-band and operate over several frequency ranges. This ability has not been overlooked in the context of ATR. An early work utilizing multi-band sonars for classification is . In this work, the authors demonstrate good detection performance when using a low-resolution broadband sonar in addition to a high frequency SAS. Even more recently, deep learning has been applied to multi-band SAS imagery with good success and without the need of using a pre-trained network [13, 17].
Iii Proposed Classification Method: SPDRDL
Iii-a Motivation of Approach
Recent ATR schemes using deep learning demonstrate great performance but at the cost of requiring large amounts of training data. As previously discussed, SAS data collection is costly resulting in small datasets which are almost always class imbalanced. Consequently, it is crucial to use all the available information from a SAS image during classifier training. To this end, we propose a new scheme which incorporates prior knowledge of SAS images in a novel way as a mechanism to extract more information from each image. This additional information is used to positively influence classifier training.
One mechanism by which we inject prior knowledge into the classification pipeline is by addressing the inherent speckle phenomenon present within every SAS image. The speckle is often seen as noise and a nuisance for human interpretation. Much work has been done in the development of despeckling algorithms  with the purpose of enhancing image interpretability. A natural outcome of this work is to ask if such types of enhancement are beneficial for improving classification performance and if so, which methods provide the most benefit. Furthermore, can we forgo the onerous choice of selecting an enhancement algorithm algorithm and have the network learn the image enhancement transform in an unsupervised fashion?
Another prior which thus far has been overlooked in SAS ATR, are the assumptions given by the detector, sometimes called a pre-screener. As background, traditional SAS ATR methods use a detector-classifier approach. In this approach, a simple detection algorithm is first passed over the scene. The detector produces candidate images of interest, sometimes called chips, which are then passed to a classification algorithm for further inspection. Usually the detector is computationally efficient and can quickly prune areas of the image which appear to be benign (e.g. a flat sandy sea-floor). Such a process reduces the amount of imagery the classifier has to process. It is believed such an approach was adopted initially for compute reasons – early SAS classification systems were not capable of processing every possible sub-tile of an image in a timely manner due to limited compute power. However, current compute capabilities, specifically in form of graphics processing units (GPUs), provide ample compute power enabling a classifier to examine whole scene quickly removing the need for the explicit detection step. Notwithstanding, for this approach to work, the classifier must be translation equivariant.
Good detectors can localize the target well and output SAS images with the target well-centered in the image. The Mondrian detector  is a good example of such a detector. It uses prior knowledge of the target and sonar geometries to model expected relationships among local pixel neighborhoods; its quite capable for returning well-centered targets to a classifier. However, current classifiers for SAS do not use this information. They assume a target is present in the image, but do not explicitly estimate or assumes its position. On the other hand, our proposed method jointly estimates target class and target position.
Because our proposed method estimates target position (in addition to object class), it is desirable to have a feature space embedding which is translation equivariant. By equivariant, we mean that as the target translates smoothly across an image, its associated embedding also translates smoothly. Despite the convolutional nature of CNNs, they do not inherently provide translation equivariance. Recent works such as [73, 2, 40] have pointed out this common misnomer and have made progress towards improvement. We utilize these techniques in our proposed method making it very robust to scene translation.
Having the classifier robust to translations has an added benefit in that we can forgo the traditional detection step and run the classifier on across the entire scene. This has an immediate benefit: the detection rate of the ATR is no longer bounded by the detector performance. For example, if a detector exhibits an eighty-percent detection rate, the overall ATR can do no better than an eighty-percent detection rate. Hence, even with an oracle classifier, the best detection rate that can be achieved is eighty-percent.
Iii-B Feature Extraction Network
Traditional image classification pipelines using deep learning are composed of a feature extraction network followed by classification network. Much recent work has been spent on designing an optimal feature extraction network as illustrated by the vast number of off-the-shelf (OTS) options available. DensetNet , Resnet , Inception , MobileNet , VGGNet , and AlexNet  are popular examples of such OTS networks. We leverage the good results these OTS network architectures and begin the construction of SPDRDL around a popular one: DenseNet-121. SPDRDL is composed of Densetnet-121 as a feature extraction network (pink box) followed by a standard classification network (yellow box) shown in Fig.3.
Iii-C Structural Similarity Prior Via Data-Adaptive Image Enhancement Network
Building upon the feature extraction and classifier networks, we introduce a data-adaptive image enhancement network which is added to the front of the feature extraction network. This enhancement network is shown in the blue box in Fig.3. The purpose of this network is to learn an image transformation which improves classification performance while still maintaining the original image semantics by obtaining an enhanced image (i.e. enhanced for classification purposes not necessarily human consumption) that is structurally similar to the original. We implement this network as a U-Net architecture  with the original image as input and the enhanced image as output.
To encourage image enhancement for classification, we utilize a novel loss function between the desired enhanced image and the original, the multiscale structural similarity measure (MS-SSIM) [65, 67] which is a scale-aware version of the SSIM measure,
where and are images being compared, and are the patch-wise mean and standard deviation respectively of the corresponding image, is the covariance between image and , are shaping constants, and are calibration constants. where higher values indicate higher perpetual similarity between the images. The SSIM is differentiable and tractable for incorporation into a deep learning network .
MS-SSIM introduces scale dependence by computing structural and contrast factors of SSIM over several staged, low-pass-filtered versions of the input image and then combining their results. It is given by,
where functions and represents the corresponding luminance, contrast, and structural components of Eq (1) respectively, and is the total number of scales to evaluate. For all constants, we use the same values as specified in .
By seeking to maximize the MS-SSIM between the original and the enhanced image in the enhancement net, we leverage human visual system priors designed into the MS-SSIM perceptual loss function . This utilizes the desired domain knowledge we which seek to embed in our formulation: there exists an enhanced imaged which is structurally similar to the original image but is able to yield improved classification performance. Finally, we define the structural similarity prior (SSP) regularization loss as,
where is the input image and is the improved image output by the enhancement network. Without this loss term, the network has no notion of a “noise model” and simply seeks to minimize the weights of the function with no understanding of the hand-crafted network structure we designed to exploit the domain prior.
Iii-D Structural Scene Context Prior Via Target Localization Network
As previously mentioned, the detector returns targets centered in the image. During the data augmentation process, these targets are translated by a random amount. This augmentation procedure is commonly used in other SAS ATR methods. However, our method is different in that we do not discard the translation parameters but encourage the network to estimate them while simultaneously performing classification. In this manner, we embed the target position domain knowledge into the feature extraction network by encouraging it to learn a spatially-aware context of the scene in addition to features for classification. With this prior, the likelihood of the network learning features which are not target-centric is reduced, and the creation of features derived from biases within the dataset, like seafloor texture, is reduced.
We encourage the network to learn target localization by augmenting our feature extractor with a target localization network whose task is to estimate the target position from the feature embedding. Recall that the ground truth for this estimate is determined through the data augmentation procedure. The target localization network is represented by the orange box in Fig.3. It is composed of a set of 1 1 convolutions to reduce the dimensionality of the embedding. This reduction serves as a bottleneck which then feeds two dense layers which both have no post-activation function simply returning the position estimates. Formally, we define the structural scene context prior (SSCP) regularization loss as,
where represents the mean-squared error between the shift (i.e. translation) applied during data augmentation, , and the shift estimated by the network, .
Fig.4 shows an example of how the positional shifts are created. First, the Mondrian detector returns an image with the target centered. Next, a random crop is applied to the image during data augmentation. This induces a translation of the target in the image. We denote this translation as and add this information to the network via the backpropagation through the loss of Eq (4).
Thus far in this sub-section, we have developed a novel method to encourage the network to learn target position within the scene, and we have done so in a self-supervised fashion. One assumption we have made which have not yet addressed is to assume that the feature space is translation equivariant. Despite the use of convolutional layers in our network, non-unitary strides associated with convolution and pooling operations prevent translation equivariance as we will show. Additionally, we have not provided local pixel positioning information to the network likely resulting in position information being determined by specific neurons in the dense layers which is undesirable for generalization. In the next two subsections, we will address each assumption.
Iii-D1 Addition of Anti-Aliasing Filtering Before Pooling Layers
Most CNNs are not inherently shift invariant  when combined with pooling layers. This is caused by the lack of proper filtering done during image subsampling in pooling and convolutional layers when the stride is greater than one. Strided layers perform two operations: (1) a filtering procedure which is run over the entire image (in the case of max pooling, this is an order-statistic filter), and (2), image subsampling to reduce the image dimensions most commonly done by striding to reduce the compute burden. The striding operation is subsampling the image for the purposes of decimation. During this procedure, the energy of the discarded frequencies is folded into the desired lower frequency band reducing the signal-to-noise ratio (SNR) of the resulting embedding. This results in a feature space which is not translation equivariant: translations in the input image do not correspond to translations in the feature embedding.
We can overcome the faults of the traditional pooling layer by introducing an anti-aliasing (AA) filter before all strided operations . In our setup, this means placing an AA filter before all strided convolutions and pooling operations. The AA filters prevents out-of-band frequencies from aliasing back into the remaining spectrum post-subsampling. This results in increasing the signal-to-noise ratio (SNR) of the embedding and to encourage translation equivariance.
Iii-D2 Feature Position Encoding
As previously mentioned, CNNs do not naturally provide translation invariance when used in tandem with strided convolution and/or pooling layers. In addition,  demonstrated that CNN’s have difficulty with position oriented tasks because they do not encode feature position. Indeed, this at first seems surprising given the translational nature of the convolution operator. However, the convolution operator takes as input a 2D map and also outputs a 2D map; a feature’s position through this process is a function of the map domain but not explicitly coded in the representation. Hence, when a 2D map is flattened and used as input to a dense layer, a feature’s position is lost.
The interesting problem of CNNs not recording positional information was not only noted by . In an early and popular work,  noted that CNNs are good at providing “what” but not the “where.” They specifically design their CNN architecture to compensate for this fault. Furthermore, the supplementary material of  also notes this as they cite the addition of positional information improved image in-painting tasks for their Deep Image Prior technique.
We augment SPDRDL with target position information by using the CoordConv solution of . In particular, we augment the output of the image enhancement network, , with two additional channels, each one describing a positional dimension of the input map as described by Eq (5),
where and are the generated 2D maps augmented to the channel dimension of the input, are the height and width of the image respectively in pixels, and are pixel locations.
Iii-E Classification Loss
Categorical cross-entropy is a commonly used loss function for penalizing classification error in neural networks. It accounts for errors probabilistically by measuring the amount of surprise between the predicted and true labels. The measure works well in the presence of balanced class and accurate labels. However, we know for SAS that the number of negative examples far outweighs the positive examples.
To mitigate the shortcomings of categorical cross-entropy in the presence of class imbalanced, we use a specified weighted version of the measure called the focal loss . The focal loss is given by Eq (6),
where is the number of classes, is the true probability of class , is the estimated probability of class , and we use the strength coefficients given by the paper of and . Focal loss is a weighted version of the cross-entropy loss whereby correct classification is de-weighted. Consequently, the effect of the focal loss is to place more emphasis on grossly mis-classified samples compared to virtually correct classified samples. Through the use of the focal loss, error gradients of correct classifications are greatly diminished during training time while grossly incorrect classification maintain their error magnitude. In this way, the focal loss focuses the training on the misclassification samples and largely leaves the easy, correct classifications untouched.
There are several ways to place emphasis on negative samples during training of which a common one is to assign label weights. However, we chose the focal loss because of several positive properties it offers for our setup. Following, we describe each.
The first benefit realized by focal loss is that it can be viewed through the lens of importance sampling  but without the explicit overhead associated with such techniques. Recently,  showed importance sampling works well to improve the performance of SAS ATR. In importance sampling, misclassified samples are shown more often during the training procedure than correctly classified samples. In a similar manner, using the focal loss can instill a similar training policy without the overhead of maintaining a list of the misclassified samples. Using the focal loss, a mini-batch of images is fed to the training algorithm and the misclassified samples are dynamically weighted proportional to their error. For each batch, correctly classified samples induce little error gradient and effectively are removed from the batch.
The second benefit realized by focal loss is that the effective batch size is reduced over time. Reducing batch sizes has been associated with better generalization error . Assuming the distribution of easy- and hard-to-classify samples is uniform throughout the minibatch, at the beginning of training all samples in the minbatch are considered hard-to-classify. As training progresses, some samples become easier-to-classify and their error gradients vanish effectively removing them from the minibatch reducing the effective minibatch size.
Iii-F SPDRDL: Jointly Learned Image Enhancement and Object Location Estimation
In the previous sections, we examined sources of structural information in SAS images currently not utilized by contemporary ATR methods. Our proposed approach builds upon an existing CNN backbone network commonly used for feature extraction by utilizing this overlooked structural information. In this section, we bring together the aforementioned sections and fully present our proposed method, SPDRDL.
Incorporating the losses discussed in the previous section, we arrive at the final loss function for SPDRDL, Eq (7),
where is the input image, is the data adaptive enhanced image, is the true target class, is the predicted target class, is the true target translation, is the estimated target translation, are the target localization network parameters, are the image enhancement network parameters, is the classification network parameters, is the feature extraction network parameters, and are regularization weights. Finally, is a class-dependent weight for the localization task given by Eq (8),
SPDRDL’s network description is in Table II. Convolutional layers are followed by ReLU activation and use initialization of . Anywhere subsampling was used (which includes pooling layers and strided convolutions), anti-aliasing filtering was applied before subsampling using a 3 3 kernel of .
|Layer Name||Layer Function||Dimensions||# Filters||Input|
|pool1||AA Max Pooling||2x2||N/A||conv1b|
|pool2||AA Max Pooling||2x2||N/A||conv2b|
|pool3||AA Max Pooling||2x2||N/A||conv3b|
|pool4||AA Max Pooling||2x2||N/A||conv4b|
|gap1||Global Average Pooling||N/A||N/A||densenet1|
|classification||Dense with softmax||4||N/A||gap1|
In this section, we describe how we measure the performance of SPDRDL and demonstrates its efficacy against contemporary methods. First, we will describe how we setup the experiments. Next, we describe the comparison methods. Finally, we show results by comparing all the methods.
The ultimate goal of our experiments is to show the superiority of SPDRDL over existing methods. Equally important are two regimes to characterize. The first regime is classification performance of each object class. We show results in this regime by using confusion matrices whose purpose is to provide an overview of the classifier accuracy as a function of class. The second regime is classification performance in a one-versus-all scenario, whereby the target classes are collated into a single group. This collation converts our four class problem to a two class problem consisting of a target class and background class. Additionally, we will use a variation of this regime by showing performance of a particular target class versus all others.
As mentioned, we present results in a one-versus-all regime through conversion of a multi-class problem into a binary one. For binary class problems, many metrics exist by which to measure efficiency. A popular method is to measure area under the receiver operating characteristic curve (AUCROC). AUCROC reports the statistics of any chosen pair of samples being classified correctly. However, the method has been shown to be sensitive to class imbalance  which is pervasive here. Therefore, we choose area under the precision-recall (AUCPR) as our performance metric based on the analysis of  which determined that AUCPR is superior over AUCROC when the number of negative class samples greatly outnumbers the positive class samples which is true here. Furthermore,  demonstrate that the stability of AUCPR over AUCROC meaning a performance curve dominating in precision-recall (PR) space also dominates in ROC space but not vice-versa.
As previously mentioned, we will use confusion matrices to measure per class accuracy. For the one-versus-all cases mentioned, we will use AUCPR. This metric has a benefit over a confusion matrix because it does not force us to specify a threshold as all thresholds are evaluated. Usually, a threshold is set to optimize for a specific performance metric which is context dependent. In lieu of having to select a particular context, we simply report AUCPR on a one-versus-all basis.
Sonar image fidelity is often a function of range. For example, spreading and absorption losses in the medium attenuate the reflected signals as a function of range. Therefore, the SNR of sonar echoes is reduced at long ranges. To measure the contribution of such effects on our classifier, we evaluate the classifier performance as a function of observation range.
Recalling that CNNs with strides or pooling are not translation invariant, we also evaluate translation performance. Ideally, translation invariance would be evaluated at every possible translation of the target but this becomes prohibitively expensive to compute. So, we evaluate translation invariance at eight extreme shifts of 59cm as shown in Fig.6. Good translation invariance will yield the same classification regardless of shift so we compute performance by measuring the standard deviation (stdev) of the output score as a function of these nine shifts (eight extreme shifts plus the center crop).
Iv-B Dataset Description
We train and evaluate SPDRDL on the dataset of images which are output from the Mondrian detector of SAS imagery collected by the CMRE MUSCLE SAS sensor . It is the same dataset used in  but with three modifications: (1) The original dataset contains detections on image boundaries and these images are extrapolated by mirroring resulting in target shapes which are not seen in the real environment. (2) Some of the images contained quadratic phase error (QPE) based on visual inspection . We removed this error by applying a brute force autofocus in the -space domain. (3) The images were dynamic range compressed using an algorithm based on the rational mapping function of .
Overall, the dataset is composed of two partitions (dataset A and dataset B) based on collection year which Fig.5 depicts. Dataset A is composed of 27,748 images containing 1,385 targets collected from 2008 through 2013. Dataset B from a set of 21,181 images composed of 639 targets collected from 2013 through 2018. Each image in the dataset has resolution of 1.5cm and of size 335 335 pixels. However, translation is induced through cropping the images down to 256 256 pixels at training/inference time.
The data is partitioned into three sections for evaluation purposes: (1) The training set is composed of Dataset A and augmentations of it. (2) The validation set is composed of a single translation of each image of Dataset B. (3) The test set is composed of nine translations of each image of Dataset B.
The translations applied to the validation and test set are from the set of shifts in the horizontal and vertical dimensions of the set . This configuration yields a total of nine possible shifts (including the center crop which is not shifted at all) for each image of the test set. The validation set is composed of one random crop from the set of nine available for each image. Fig.6 shows two examples of how the proposed cropping scheme contributes to the test and validation sets. In the examples, a given image from Dataset B is eight-way shifted 59cm yielding a total of nine images (the mosaic) which are assigned to the test set and of one image (bounded by the solid line) which is randomly selected to be in the validation set. Overall, the training set is composed of 27,748 images, the validation set is composed of 21,181 images, and the test set is composed of 190,629 images.
Iv-C SPDRDL Training Procedure
We train SPDRDL in a similar fashion to most other CNNs. We use a mini-batch size of sixteen of which we assign half the batch an image from the background class and half the batch an image from the one of the three target classes. This 50:50 background-to-target-class split is based on the analysis of . Recall, images from the dataset are 335 335 pixels. For training, a random crop of 256 256 pixels is selected for the mini-batch. The associated translational shift induced by cropping is recorded for the images containing a target. For each image in the mini-batch, the network estimates the class and translational shift (when the image is of a target class) and the errors are backpropagated appropriately. One epoch of training consists of the number of mini-batches required to see each image of the training set once on average.
CNNs perform best when lots of training data is available. In many situations though, large amounts of training data are not available. We call these instances low training data scenarios and in them, application of a domain prior becomes particularly important as its presence can significantly boost classification performance. To study this effect, we trained each of the methods on a random (but consistent across methods) 10% subset of the training. Fig.10 shows these results, and for convenience, the results when the full training data is available; recall these results are the AUCPRs of Fig.9. We show that the application of domain priors results in improved performance over all of the comparison methods when operating in low training data scenarios. These results demonstrate how application of our domain priors, image enhancement and target localization, improve performance on both abundant and low training data scenarios. The priors we introduce utilize information implicit during human interpretation and provide useful contextual information for our image classification method.
Deep networks often require a hyper-parameter search for optimal performance; SPDRDL is no different. SPDRDL uses the RMSProp optimization scheme  with a fixed learning rate of which was determined through cross-validation. Furthermore, the weights of the domain priors in Eq (7) were also found through cross-validation giving the best results when and .
Iv-D Ablation Study: Impact of Domain Priors
Table III shows the performance of SPDRDL with no additional multi-task losses and the incremental addition of each domain prior. For the low and high training scenarios, each additional loss provides improved classification performance with both priors giving better performance than each individual prior.
Next, we compare SPDRDL against three common despeckling techniques to show the benefit of using the learned enhancement network with the SSP. Table IV shows the results for the high training scenario when we retrain the network by setting in Eq (7) and supplanting the enhancement network with one of the following despeckling filters: Gaussian filter, median filter, and total variation [19, 5]. We can see that the best AUCPR performance of the pre-processed despeckled images is 0.9451 which is not as good as when the SSP is present, 0.9538.
|None, Only Classification Loss (CL)||0.8742||0.9281|
|CL + Structural Scene Context Prior (SSCP)||0.8919||0.9503|
|CL + Structural Similarity Prior (SSP)||0.8969||0.9456|
|Both, CL + SSCP + SSP||0.9079||0.9538|
|CL + SSCP + Gaussian Filter||0.9450|
|CL + SSCP + Median Filter||0.9409|
|CL + SSCP + Total Variation||0.9451|
Iv-E Comparison Against State of the Art
We demonstrate the efficacy of SPDRDL by comparing against three state of the art deep learning methods for sonar ATR and two recent shallow-learning methods. The deep learning methods were trained using Tensorflow 1.13.1 . The number of trainable parameters for each network is given in Table V. Development of the deep learning algorithms and their specific parameters follows.
Emigh, et al. (IOA SAS/SAR 2018) . This network is based on the Resnet-18 architecture, does not rely on Imagenet pre-training, ingests dual-band SAS images, and is a binary classifier with output of a single scalar indicating target/non-target score. We modify the network to input a single-band SAS image and output four classes by using a softmax function after the last dense layer. We use the categorical cross-entropy loss when training the network as binary cross-entropy loss was originally specified by the authors and we adapted it to this multi-class scenario. We trained using the Adam optimizer  with a learning rate of . The paper mentions decreasing the learning rate when a loss plateau occurs but does not give details on its parameters. In lieu of this, we forgo the learning rate schedule and invoke the same early stopping rule used during SPDRDL training.
Galusha, et al. (SPIE 2019) . This network has only a few layers and is an Alexnet-like architecture. Like Emigh, et al., this network originally consumed dual-band SAS imagery and output a binary classification score representing target/non-target. We modify the network in a similar fashion as Emigh, et al. by using only a single-band SAS image as input and modify the output to support classification of four classes using a softmax activation after the last dense layer. As in Emigh, et al., we use the categorical cross-entropy loss when training the network as the binary cross-entropy loss was originally specified by the authors. We train the network in the same fashion the authors used in their work: stochastic gradient descent for 2,000 epochs at a learning rate of .
DensetNet121 (CVPR 2017) . This is a common state-of-the-art off-the-shelf DenseNet architecture with 121 layers pre-trained on the Imagnet dataset. We choose this network for comparison because it is the feature extraction network used in SPDRDL, and serves as a baseline to provide evidence demonstrating our proposed priors improve classification performance. As with SPDRDL, use use the focal loss instead of the cross-entropy loss in order to demonstrate that our performance gains are not simply from this different classification loss. Densenet121 ingests three-channel color imagery imagery; we simply replicate the SAS input image over two additional channels to arrive at a three-channel image. Finally, we use the global average pooling option after the feature extraction layer and apply a four class output with softmax. We train using the same procedure as SPDRDL.
BoW-HOG (CVPR 2015). This method is a bag-of-words using histogram of oriented gradients features inspired by . A comparison to a similar algorithm was also made by . For this approach, each image is divided into pixel tiles of which the HOG features are computed. These features are then clustered using mini-batch k-means clustering  into what are known as words in this setup. The clusters of words form the vocabulary for the bag-of-words model.
The size of the vocabulary, , and the regularization parameters for the SVM, , are chosen using a random search [4, 47] of fifty iterations. The hyper-parameters for the search are chosen from and where represents the uniform distribution over the specified interval. The hyper-parameter search returned best results for . A radial basis function kernel was used. BoW-HOG method is costly to compute so we use no data augmentation during training and measure performance solely on the center crops of Dataset B.
DSRC (IEEE TGRS 2017). This method is a dictionary sparse reconstruction (DSRC) algorithm inspired by . We use mini-batch dictionary learning  with coordinate descent to learn the dictionary atoms.
At test time, the learned dictionary for each class is used to reconstruct the test images one class at a time. The inverse of these errors from each reconstruction is transformed by the softmax function to class probabilities.
The sparsity of the reconstruction loss, , and the number of dictionary atoms per class, , are chosen using a random search of fifty iterations. The hyper-parameters for the search are chosen from and and the best results were . A mini-batch size of ten was used for dictionary learning.
Similar to the BoW-HOG method, the extensive compute resources necessitated by this algorithm lead to using only center crops of the images of the training and validation sets. As with the BoW-HOG method, performance is only reported on the validation set.
|Network||Number of Trainable Parameters|
|Emigh, et al.|
|Galusha, et al.|
Iv-F Results and Analysis
In this section, we present results comparing SPDRDL against the several contemporary methods, demonstrate the necessity of each prior in our loss function formulation, analyze properties of the learned image enhancement function, and demonstrate the ability to reduce the network size considerably for use in low power embedded systems while maintain good classification performance.
Iv-F1 Classification Task
We show confusion matrices in Fig.7, precision recall curves for each target class versus background in Fig.8, and precision recall curves for target versus background in Fig.9. We demonstrate the learning efficiency of our method by showing results from the ablation study. From these figures, we can see the benefits our domain priors afford us. In almost all metrics, SPDRDL outperforms existing methods.
|(a) BoW-HOG||(b) DSRC||(c) Emigh, et al.|
|(d) Galusha, et al.||(e) Densenet121||(f) SPDRDL|
|(a) Cylinder||(b) Wedge||(c) Truncated Cone|
Indeed, the results of Fig.10 show SPDRDL performance very similar to Densenet121 in the low training data scenario with SPDRDL exhibiting a slight performance gain. However, the gap between SPDRDL and Denset121 may be much larger in reality. In this case, it is quite likely we are seeing the effects of selection bias of the training data subset. To examine issues of sample selection bias, we train the top three methods ten times each using a random subset of 10% of the training data; the same subsets were used for each algorithm evaluation. Evaluating the test set on all of these trained models would be computational prohibitive due to the large test set size. Therefore, we report on the validation set performance for the best epoch of each. As shown in Fig.11, the performance gap between SPDRDL and the next best method, Densenet121, becomes much more distinct when viewing the results in the context of selection bias.
We also examined the performance gain of each regularization term of Eq (7) to demonstrate its necessity. This was done by setting each loss term to zero and then retraining and reevaluating the network. Consequently, we can see both priors of Eq (7) improve classification performance even in the low training data scenario demonstrating the benefits of incorporating domain knowledge.
|(a) SPDRDL||(b) Densenet121|
|(c) Image under test for (a) and (b) which contains a target-class object.|
|Network||Shift Invariance Score (Lower is Better)|
|Emigh, et al.|
|Galusha, et al.|
We can see from the results that the SPDRDL method outperforms the deep learning methods and the shallow learning methods by a significant amount demonstrating the usefulness of the domain priors. For our ablation analysis, we focus on two aspects: (1) Generalization efficacy using in a low training data scenario where only 10% of the training data is used. (2) Necessity of the additional loss terms from our domain priors used in Eq (7).
Fig.10 shows AUCPR for all methods using 10% and 100% of the training data; SPDRDL outperforms all the comparison methods showing the efficacy of the additional domain priors.
Finally, we examine the two-class performance as a function of object range from sensor. Fig.13 shows SPDRDL performing well over an extensive range from the sonar due to the addition of the domain priors. Especially at the nearest and furthers ranges from the SAS, SPDRDL outperforms the comparison methods.
Iv-F2 Image Enhancement Task
Because of speckle noise, image despeckling algorithms are often employed to enhance SAS images for the purpose of improving human interpretability. We posit that there exists an image enhancement function which improves classification performance and we call this the structural similarity prior; any enhancement algorithm must preserve the structural similarity between the input and output images. We employ this prior in SPDRDL through the use of a learned image enhancement transform which is constrained by the loss of Eq (3).
We analyze the output of the enhancement network by examining its frequency response and compare it to the frequency response of the original images. That is, we compute the frequency response all the images in target classes and present their averaged spectra; all images are windowed with a 2D hamming window prior to frequency domain conversion. Fig.14 depicts these results. We can see selective suppression in both spatial frequency and orientation from the integration of our image enhancement network. This behavior is in contrast to what one would get with a simple 2D Gaussian filter which would give an isotropic frequency response.
Iv-F3 Target Localization Task and Translation Equivariance
CNNs lose their translation equivariance through the addition of non-unitary strided pooling and convolution layers. In this section, we quantify this phenomenon over the deep learning methods to demonstrate the necessity of adding anti-aliasing filters before subsampling in the network. To first illustrate the problem, we show classification scores of an image over a large set of translations, Fig.12. As shown, classification scores can vary drastically even at small pixel shifts. For example, suppose a well-centered target is presented to the classifier and is correctly classified. By translating the image by one pixel, the classifier will now misclassify it!
We analyze the translation invariance performance of each deep learning algorithm to assess its ability in providing translation invariance. We only examined the target/background scenario by converting the class prediction estimates to a single scalar from indicating target score. For a single image, we compute its scores for a center crop and eight extreme crops. We then measure the standard deviation of these scores and call this metric, , the image’s shift invariance score. More formally, we define the shift invariance score as
where function is the inference model, is the input image translated by and , and . As a result of this formulation, lower shift invariance scores indicate better translation robustness. We report the results in Table VI. We can see our proposed method has the greatest translation invariance due to the addition of the anti-aliasing filters layers employing non-unitary stride.
Iv-F4 Network Compute Burden and Reduction
Deep networks consisting of a large number of parameters can be challenging to deploy on embedded hardware because of the large memory and computational footprint required. However, we can reduce the number of parameters used during inference by utilizing a network pruning algorithm. Although such algorithms remove weights, classifier performance is often maintained and in some cases, even improved.
We reduce the number of free parameters of network using the pruning method of . In this method, we sort the absolute value of the network weights and set the lowest proportion of weights to zero. Fig.15 illustrates the results of performing this operation on SPDRDL. We can see that even when the number of parameters is reduced by half, the network still results in competitive classification performance.
SAS is relatively new field of remote sensing and has shown increased interest in the last two decades because of the high-quality imagery such systems produce. Recent, technological advancements in computing (e.g. GPUs) have made it easy to deploy SAS-equipped UUVs with on-board image formation. This capability paves the way for automatic machine interpretation of the imagery with the goal of influencing vehicle behavior. Despite the success of SAS, high false alarm rates provide difficulty for autonomy to make decisions in situ.
In this work, we developed a SAS ATR algorithm exhibiting improved performance over state-of-the-art methods by integrating domain knowledge of SAS images previously overlooked by existing methods. Our formulation jointly learns image enhancement and target localization for the purposes of improving the downstream task of image classification. We compare our method to several state-of-the-art techniques, including two recent deep-learning methods, and demonstrates its efficacy. Finally, we use a recently proposed pruning technique to show we can halve the number of free parameters in our network and still achieve competitive performance, thus demonstrating feasibility for real-time deployment.
Future work includes extending the approach to phase and frequency representations of the SAS images.
The authors would like to thank the NATO Centre for Maritime Research & Experimentation (CMRE) for providing the data used in this work. The collection of the data was funded by the NATO Allied Command Transformation. Colors used in the plots are derived from  and . I.G. would like to thank Dr. David Williams and Dr. John McKay for their helpful comments during the progress of this work.
-  (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §IV-E.
-  (2019) Why do deep convolutional networks generalize so poorly to small image transformations?. Journal of Machine Learning Research 20 (184), pp. 1–25. External Links: Cited by: §III-A, §III-D1.
-  (2013) GPU-based real-time synthetic aperture sonar processing on-board autonomous underwater vehicles. In OCEANS, pp. 1–8. Cited by: §IV-B.
-  (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (Feb), pp. 281–305. Cited by: §IV-E.
-  (2011) Bregman algorithms. Senior Thesis. University of California, Santa Barbara. Cited by: §IV-D.
-  (2003) Signal processing for synthetic aperture sonar image enhancement. Ph.D. Thesis, University of Canterbury. Cited by: §I-A, §I.
-  Color Brewer 2.0. Note: \urlhttp://colorbrewer2.orgAccessed: 2020-01-01 Cited by: Acknowledgments.
-  (2008) Analysis of phase error effects on stripmap SAS. IEEE Journal of Oceanic Engineering 34 (3), pp. 250–261. Cited by: §IV-B.
-  (2006) The relationship between precision-recall and ROC curves. In International Conference on Machine learning, pp. 233–240. Cited by: §IV-A.
-  (2009) ImageNet: a large-scale hierarchical image database. In International Conference on Computer Vision, Cited by: §II.
-  (1997) Automated detection and classification of sea mines in sonar imagery. In Detection and Remediation Technologies for Mines and Minelike Targets II, Vol. 3079, pp. 90–110. Cited by: §I-A.
-  (1999) Fusing sonar images for mine detection and classification. In Detection and Remediation Technologies for Mines and Minelike Targets IV, Vol. 3710, pp. 602–614. Cited by: §I-A.
-  (2018) Supervised deep learning classification for multi-band synthetic aperture sonar. In Synthetic Aperture Sonar & Synthetic Aperture Radar Conference, Vol. 40, pp. 140–147. Cited by: §II, §IV-E.
-  (2008) Improving classification performance of sonar targets by applying general regression neural network with PCA. Expert Systems with Applications 35 (1-2), pp. 472–475. Cited by: §I-A.
-  (2001) Statistical autofocus of synthetic aperture sonar images using image contrast optimisation. In OCEANS, Vol. 1, pp. 163–169. Cited by: §I.
-  (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §IV-F4.
-  (2019) Deep convolutional neural network target classification for underwater synthetic aperture sonar imagery. In Detection and Sensing of Mines, Explosive Objects, and Obscured Targets XXIV, Vol. 11012, pp. 1101205. Cited by: §II, §IV-E.
-  (2018) Additional representations for improving synthetic aperture sonar classification using convolutional neural networks. In Synthetic Aperture Sonar & Synthetic Aperture Radar Conference, pp. 11–22. Cited by: §II, §IV-B.
-  (2012) Rudin-osher-fatemi total variation denoising using split bregman. Image Processing On Line 2, pp. 74–95. Cited by: §IV-D.
-  (1986) A synthetic aperture sonar system capable of operating at high speed and in turbulent media. IEEE Journal of Oceanic Engineering 11 (2), pp. 333–339. Cited by: §I.
-  (2009) Synthetic aperture sonar in challenging environments: results from the HISAS 1030. Underwater Acoustic Measurements. Cited by: §I.
-  (1996) Synthetic aperture imaging algorithms: with application to wide bandwidth sonar. Cited by: §I-A.
-  (1991) Results from an experimental synthetic aperture sonar. In Acoustical Imaging, pp. 455–466. Cited by: §I.
-  (1992) Broad-band synthetic aperture sonar. IEEE Journal of Oceanic Engineering 17 (1), pp. 80–94. Cited by: §I.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on Imagenet classification. In International Conference on Computer Vision, pp. 1026–1034. Cited by: §III-F.
-  (2016) Deep residual learning for image recognition. In Computer vision and pattern recognition, pp. 770–778. Cited by: §III-B.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §III-B.
-  (2017) Densely connected convolutional networks. In Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §III-B, §IV-E.
-  (2003) Simulation of multiple-receiver, broadband interferometric SAS imagery. In OCEANS, Vol. 5, pp. 2629–2634. Cited by: §I.
-  (1995) Sea mine detection and classification using side-looking sonar. In Detection Technologies for Mines and Minelike Targets, Vol. 2496, pp. 442–453. Cited by: §I-A.
-  (2015) Sonar automatic target recognition for underwater UXO remediation. In Computer Vision and Pattern Recognition Workshops, Cited by: §II, §IV-E.
-  (2018) Not all samples are created equal: deep learning with importance sampling. In International Conference on Machine Learning, pp. 2530–2539. Cited by: §III-E.
-  (2017) On large-batch training for deep learning: generalization gap and sharp minima. International Conference on Learning Representations. Cited by: §III-E.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-E.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §III-B.
-  (2019) Deep learning from shallow dives: sonar image generation and training for underwater object detection. In International Conference in Robotics and Automation, Cited by: §II.
-  (2017) Focal loss for dense object detection. In International Conference on Computer Vision, pp. 2980–2988. Cited by: §III-E.
-  (2018) An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pp. 9605–9616. Cited by: §III-D2, §III-D2, §III-D2.
-  (2009) Online dictionary learning for sparse coding. In International Conference on Machine Learning, pp. 689–696. Cited by: §IV-E.
-  (2014) Convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 2627–2635. Cited by: §III-A.
-  (2017) What’s mine is yours: pretrained CNNs for limited training sonar atr. In OCEANS, pp. 1–7. Cited by: §II.
-  (2018) Bridging the gap: simultaneous fine tuning for data re-balancing. In International Geoscience and Remote Sensing Symposium, pp. 7062–7065. Cited by: §II.
-  (2017) Robust sonar ATR through bayesian pose-corrected sparse classification. IEEE Transactions on Geoscience and Remote Sensing 55 (10), pp. 5563–5576. Cited by: §II, §IV-E, §IV-E.
-  (2016) Localized dictionary design for geometrically robust sonar ATR. In International Geoscience and Remote Sensing Symposium, pp. 991–994. Cited by: §II.
-  (2009) Using information gain attribute evaluation to classify sonar targets. In 17th Telecommunications Forum, pp. 1351–1354. Cited by: §I-A.
-  (2018) Optimizing colormaps with consideration for color vision deficiency to enable accurate interpretation of scientific data. PloS one 13 (7). Cited by: Acknowledgments.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §IV-E.
-  (2012) Denoising of SAR images. Ph.D. Thesis, Federico II University of Naples. Cited by: §I-B, §III-A.
-  (2019) Coupling rendering and generative adversarial networks for artificial SAS image generation. In OCEANS, Cited by: §II.
-  (2004) Automated approach to classification of mine-like objects in sidescan sonar using highlight and shadow information. IEE Radar, Sonar and Navigation 151 (1), pp. 48–56. Cited by: §I-A.
-  (2003) An automatic approach to the detection and extraction of mine features in sidescan sonar. IEEE Journal of Oceanic Engineering 28 (1), pp. 90–105. Cited by: §I-A.
-  (2015) U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §III-C.
-  (1995) Quantization techniques for visualization of high dynamic range pictures. In Photorealistic Rendering Techniques, pp. 7–20. Cited by: §IV-B.
-  (2010) Web-scale k-means clustering. In International Conference on World Wide Web, pp. 1177–1178. Cited by: §IV-E.
-  (2014) Real-time SAS processing for high-arctic AUV surveys. In OES Autonomous Underwater Vehicles, pp. 1–5. Cited by: §I.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §III-B.
-  (2011) Automation for underwater mine recognition: current trends and future strategy. In Detection and Sensing of Mines, Explosive Objects, and Obscured Targets XVI, Vol. 8017, pp. 80170K. Cited by: §I.
-  (2015) Going deeper with convolutions. In Computer vision and pattern recognition, pp. 1–9. Cited by: §III-B.
-  (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §IV-C.
-  (2019) Prior information guided regularized deep learning for cell nucleus detection. IEEE Transactions on Medical Imaging 38 (9), pp. 2047–2058. Cited by: §III-C.
-  (2011) Coherence-based underwater target detection from multiple disparate sonar platforms. IEEE Journal of Oceanic Engineering 36 (1), pp. 37–51. Cited by: §II.
-  (2018) Deep image prior. In Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §III-D2.
-  (2018) Deep network for simultaneous decomposition and classification in UWB-SAR imagery. In Radar Conference, pp. 0553–0558. Cited by: item 1.
-  (2011) Class imbalance, redux. In International Conference on Data Mining, pp. 754–763. Cited by: §IV-C.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §III-C.
-  (2009) Mean squared error: love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine 26 (1), pp. 98–117. Cited by: §III-C.
-  (2003) Multiscale structural similarity for image quality assessment. In Asilomar Conference on Signals, Systems & Computers, Vol. 2, pp. 1398–1402. Cited by: §III-C, §III-C.
-  (2019) On the benefit of multiple representations with convolutional neural networks for improved target classification using sonar data. Underwater Acoustics Conference. Cited by: §II, §III-E.
-  (2016) Underwater target classification in synthetic aperture sonar imagery using deep convolutional neural networks. In International Conference on Pattern Recognition, pp. 2497–2502. Cited by: §II, §II.
-  (2017) The Mondrian detection algorithm for sonar imagery. IEEE Transactions on Geoscience and Remote Sensing 56 (2), pp. 1091–1102. Cited by: §III-A.
-  (2018) Exploiting phase information in synthetic aperture sonar images for target classification. In OCEANS, pp. 1–6. Cited by: §II.
-  (2019) Transfer learning with SAS-image convolutional neural networks for improved underwater target classification. In International Geoscience and Remote Sensing Symposium, pp. 78–81. Cited by: §II.
-  (2017) Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, Cited by: §III-A.
-  (2019) Making convolutional networks shift-invariant again. In International Conference on Machine Learning, pp. 7324–7334. Cited by: §III-D1.
-  (2016) Stacked what-where auto-encoders. International Conference on Learning Representations. Cited by: §III-D2.