# Deep Learning Classification of 3.5 GHz Band Spectrograms with Applications to Spectrum Sensing

## Abstract

In the United States, the Federal Communications Commission has adopted rules permitting commercial wireless networks to share spectrum with federal incumbents in the 3.5 GHz Citizens Broadband Radio Service band (3550-3700 MHz). These rules require commercial wireless systems to vacate the band when coastal sensor networks detect radars operated by the U.S. military; a key example being the SPN-43 air traffic control radar. For such coastal sensor networks to meet their operating requirements, they require highly-accurate detection algorithms. In addition to their use in sensor networks, detection algorithms can assist in the generation of descriptive statistics for libraries of spectrum recordings. In this paper, using a library of over 14,000 3.5 GHz band spectrograms collected by a recent measurement campaign, we investigate the performance of three different methods for SPN-43 radar detection. Namely, we compare classical energy detection to two deep learning algorithms: a convolutional neural network and a long short-term memory recurrent neural network. Performing a thorough evaluation, we demonstrate that deep learning algorithms appreciably outperform energy detection. Finally, we apply the best-performing classifier to generate descriptive statistics for the 3.5 GHz spectrogram library. Overall, our findings highlight potential weaknesses of energy detection as well as the strengths of modern deep learning algorithms for radar detection in the 3.5 GHz band.

double_column \booltruedouble_column

## I Introduction

In April 2015, the U. S. Federal Communications Commission (FCC) adopted rules for the Citizens Broadband Radio Service (CBRS) that permit commercial wireless usage of the 3550-3700 MHz band (the “3.5 GHz band”) in the United States [1]. The CBRS architecture outlined in the FCC rules includes a spectrum access system (SAS) together with environmental sensing capability (ESC) detectors to facilitate spectrum sharing in the 3550-3650 MHz band. The purpose of the SAS is to coordinate commercial-user CBRS access so that federal incumbents are given priority access.

The primary federal incumbents in the 3.5 GHz band are shipborne and ground-based radars operated by the U.S. military [2]. The CBRS framework requires that ESC sensors detect these radars, including the SPN-43 air traffic control radar [3], also identified as Shipborne Radar 1 in [2]. ESC detection capabilities are determined by intended and unintended emissions, as well as background noise. For example, out-of-band emissions (OOBE) from an adjacent-band U.S. Navy radar, identified as Shipborne Radar 3 in [2], are prevalent [4, 5, 6, 7], and could complicate SPN-43 detection.

As federal agencies collaborate with industry to refine standards and requirements for ESC detectors, sound methods for evaluation of ESC detector performance must be developed and potential limitations should be understood. Furthermore, efforts to design ESC detectors could benefit from detection algorithm comparisons and the characterization of emissions in the 3.5 GHz band.

In this paper, we address the above needs with a study of over 14,000 3.5 GHz band spectrograms recorded by a recent measurement campaign [6, 7] at two coastal locations: Point Loma, in San Diego, California and Fort Story, in Virginia Beach, Virginia. Specifically, we apply two deep learning methods to SPN-43 detection, a convolutional neural network (CNN) [8, 9] and a long short-term memory (LSTM) recurrent neural network [10], which have been extremely successful on a wide range of classification problems in recent years [11, 12]. A thorough performance evaluation utilizing a test set of unverified, human-labeled spectrograms reveals that deep learning appreciably outperforms conventional energy detection (ED) [13, 14, 15]. Last, we apply deep learning to classify the complete set of spectrograms collected in San Diego and Virgina Beach with respect to SPN-43 presence, from which we estimate SPN-43 spectrum occupancy and characterize the power of non-SPN-43 emissions.

## Ii 3.5 GHz Spectrograms

As described in [6, 7], 3.5 GHz band measurements were collected for a period of two months at each measurement site. The primary aim of the measurements was to acquire high-fidelity, time-domain recordings of SPN-43 radar waveforms in the 3.5 GHz band. For this purpose, a 60-second, complex-valued (i.e., I/Q) waveform was recorded roughly every ten minutes with a sample rate of 225 MS/s, and a corresponding spectrogram was computed. The decision to retain a given waveform recording was made by comparing the spectrogram amplitudes to a threshold over the band of interest. Although only a subset of the waveforms was retained for long-term storage, the entire set of spectrograms was saved.

In total, 14,739 spectrograms were collected over the measurement campaign. Of these, approximately 58% were acquired in San Diego and 42% in Virginia Beach. At each measurement site, data were collected with both an omni-directional antenna and a directional, cavity-backed spiral (CBS) antenna. Roughly 45% and 55% of the spectrograms were acquired with the omni-directional and CBS antennas, respectively. The spectrograms span a 200 MHz frequency range, typically 3465-3665 MHz, and a time-interval of one minute. Each spectrogram has dimensions 134x1024, with 134 time-bins of duration 0.455 seconds and 1024 frequency-bins of length .

The spectrograms were computed by applying a short-time Fourier transform (STFT) and then retaining the maximum amplitude in each frequency bin (i.e., a max-hold) over each 0.455 second time-epoch. The window function for the discrete STFT was 1024 samples long, with the middle 800 points given a weight of one, and the left-most and right-most 112 points weighted with a cosine-squared taper. The STFT was implemented with 112-sample overlap between consecutive time-segments. Each spectrogram value is the maximum of time-averaged amplitudes, where the averaging duration is , because the STFT effectively averaged over a 1024 sample () time window.

When noted below, the spectrogram values were converted to power units (dBm) as follows. Each max-hold spectrogram value was (i) divided by 1024, the STFT window length, (ii) divided by the (site-specific) front-end gain, (iii) multiplied by a measurement instrument calibration factor, (iv) squared and divided by (the 2 arises from the conversion between peak and root-mean-square (RMS) voltage for a narrowband signal, the 50 is for a 50 ohm load), and (v) converted to decibel-milliwatts (dBm) via the formula , where is the power in Watts from step (iv). For some calculations, the spectrogram values were further converted to dBm/MHz by subtracting dBMHz, the effective noise bandwidth of the time-domain window for the STFT [6, p. 32].

Figure 1 shows example spectrograms, cropped to the 3550-3650 MHz band of interest for ESC detection. The original full-bandwidth (approx. 3465-3665 MHz) versions of these cropped spectrograms are available in Figures 3.6 and 3.14 of [7], and Figure 3.18 of [6], for the

## Iii Classifiers

This section gives implementation details for the three classifiers that we utilized for SPN-43 detection. Specifically, we applied two deep learning algorithms, a CNN and an LSTM, as well as a conventional energy detector. The deep learning algorithms were implemented using the open-source TensorFlow^{TM} Python library running on an Nvidia^{®} DGX^{TM} workstation with four Tesla^{®} V100 Graphics Processing Unit (GPU) cards^{1}

The classifiers were first designed for the reduced task of detecting SPN-43 in a 10 MHz channel. For this purpose, spectrograms were divided into 10 MHz-wide channels centered at multiples of 10 MHz, e.g., 3550, 3560. Subsequently, to classify an entire spectrogram, copies of each classifier were connected together to classify multiple channels in parallel.

### Iii-a Convolutional Neural Network

Figure 2 (left) summarizes the CNN architecture that we used for SPN-43 detection. First, the 10 MHz channel was passed through an average pooling layer with a window size of 10x2, resulting in a down-sampled spectrogram with time and frequency dimensions reduced by factors of 10 and 2, respectively. The down-sampled spectrograms were then passed to a convolutional layer with 20 filters of size 3x3 and step size 1x1. Zero-padding was not used. Subsequently, a bias (i.e., constant) was added to the filter activations and a rectifier linear unit (ReLU) was applied to the resulting output. The output of the ReLU step consisted of 20 activation maps for each of the convolutional layer’s filters.

Next, the activation maps were averaged together to create a single averaged-activation map. To our knowledge, this averaging step has not been suggested previously in the CNN literature. The operation does, however, resemble 1x1 convolutions [16] and other channel pooling methods [17]. We found that using the averaged activations rather than the individual activation maps showed empirical improvements in accuracy. The averaged-activation map was than passed into a fully connected layer containing 150 neurons. A bias was added to the output of this fully-connected layer and a ReLU was applied. The output of the ReLU was then fed through a dropout step, with a dropout probability of 50%. Subsequently, the output from the dropout step was fed into another fully-connected layer containing a single neuron followed by a bias. Finally, the biased output was passed through a sigmoid activation function to produce the prediction, a continuous-valued number between zero and one. The CNN was trained using stochastic gradient descent with Xavier initialization [18] and cross-entropy loss; further details on training are given in Section IV-B.

### Iii-B Long Short-Term Memory Recurrent Neural Network

Figure 2 (right) summarizes our LSTM architecture. In order to effectively use the LSTM, we split the 10 MHz channel along the time axis to create sequential slices of approximately 0.455 seconds. Each of the time-slices was fed into the LSTM cell one at a time along with the previous output of the LSTM cell. This is known as a residual connection. Motivations for using residual connections in LSTMs include protection against the vanishing gradient problem in backpropagation [19] as well as greater network expressivity [20]. Dropout was used between LSTM cells with a probability of 50%.

After all of the time slices were fed into the LSTM, the output of the last cell was passed on to a fully-connected layer with 50 neurons. Next, a bias was added to the output of the 50-neuron fully-connected layer and a ReLU was applied. The output of the ReLU was then passed to a fully-connected layer of size 1 and a bias was added. Lastly, a sigmoid activation was applied to the output to generate the prediction, a continuous-valued number between zero and one. Like the CNN, the LSTM was trained using stochastic gradient descent with Xavier initialization [18] and cross-entropy loss; further details on training are given in Section IV-B.

### Iii-C Energy Detection

Energy detection [13, 14, 15] is a classical strategy based on the assumption that a signal of interest can be detected based on the total energy across a given time and frequency range. If a given detection threshold is exceeded, the signal is decided to be present.

As detailed in [6, 7], the spectrogram captures were collected with different front-ends at the two measurement sites. To account for this fact, we applied site-dependent corrections to normalize the spectrograms to dBm units, as described in Section II. This normalization was only used for the energy detection algorithm; the deep learning methods did not require any data normalization.

To improve the performance of energy detection, the whole 10 MHz channel was not used. Instead, only the 3 middle spectrogram columns (approximately 660 kHz) of each 10 MHz channel were aggregated for energy detection. This range was chosen based on the results of an empirical evaluation. Because SPN-43 can generally be expected to have carrier frequencies near multiples of 10 MHz, this modification excluded confounding emissions from the rest of the 10 MHz channel. Although this exclusion yielded a small improvement in energy detection performance, it was not needed for the deep learning algorithms.

## Iv Classifier Performance Assessment

As noted in Section II, a set of 4,491 spectrograms were labeled for SPN-43 presence and Radar 3 OOBE. This collection of labeled data was partitioned into two disjoint sets: one for training and one for testing. In this section, after explaining how the two sets were selected, we present performance results for each classifier. Since ESC detection will only be required for the 3550-3650 MHz portion of the 3.5 GHz band, the training and testing sets described below were limited to the eleven 10 MHz channels covering the range 3545-3655, with each channel centered on multiples of 10 MHz.

### Iv-a Test Set Composition

The sample of labeled spectrogram data was potentially biased in two respects. First, because the data collection was observational, at only two geographic locations for two months each, the respective proportions of different emission types did not necessarily reflect those in the whole population of possible field measurements, i.e., the distribution of 3.5 GHz emissions at all coastal locations under all conditions. Second, as mentioned in Section II, nearly 74% of the labeled spectrograms were selected for labeling because they corresponded to captures that triggered retention of a recorded waveform. Consequently, the set of labeled data suffered from a selection bias that resulted in a disproportionate number of labeled spectrograms with high-amplitude emissions.

Due to the above potential biases in the labeled data set, and due to the need for sufficient testing of important sub-groups, a stratified sampling approach was utilized to construct the test set. Specifically, a test set was randomly selected from the set of labeled data with approximately equal proportions of spectrograms across emission categories (SPN-43, Radar 3 OOBE, Both SPN-43 and Radar 3 OOBE, Neither), measurement locations (Virginia Beach and San Diego), and antenna type (Omni-directional and CBS). In addition, the maximum number of spectrograms containing multiple SPN-43 emissions were included. Table I shows the proportions for each category in the most general test set, denoted Test Set A. Note that the proportions are not exactly equal because the random test set generation program had to satisfy a hierarchy of preferences that did not typically lead to a perfect solution. Also, observe that roughly 50% of the cases contained SPN-43 and 50% of the cases did not contain SPN-43. To assess classifier performance without the presence of Radar 3 OOBE, a subset of Test Set A, called Test Set B, was used; see Table I.

Emissions | Location | Antenna | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Set | Total | Multi SPN-43 | SPN-43 | R3-OOBE | Both | Neither | VB | SD | Omni | CBS |

A | 509 | 109 | 24.56% | 24.36% | 26.72% | 24.36% | 51.28% | 48.72% | 48.92% | 51.08% |

B | 249 | 40 | 50.20% | 0% | 0% | 49.80% | 50.20% | 49.80% | 50.20% | 49.80% |

### Iv-B Training

As stated in Section III, the deep learning classifiers were first trained on 10 MHz channels for single-channel detection and then connected in parallel for multichannel detection over the full spectrogram. The training set consisted of 10 MHz channels extracted from labeled spectrograms not in Test Set A. A total of 4,285 channels were randomly selected for training, where half contained SPN-43 and half did not.

To assess the sensitivity of the CNN and LSTM training to weight initialization, we trained both algorithms 100 times with different, randomly-generated Xavier initializations [18]. Each training instance was run using the same training set for 1,000 epochs (an epoch consists of one pass through all training examples). Figure 3 summarizes the performance of the CNN and LSTM on Test Set A over the 100 training initializations. In this figure, multichannel detection performance for each training initialization is summarized with an empirical free-response receiver operating characteristic (FROC) curve, shown in light gray. Appendix B reviews FROC curves, which can be used to summarize multichannel detection performance. The minimum and maximum bounds over all 100 FROC curves are shown in bold. Note that these bounds are not necessarily the same as any individual FROC curve.

From the plots in Figure 3, we can draw three conclusions. First, the CNN distribution is much tighter than the LSTMs. Second, the best LSTMs and CNNs performed similarly. Last, these plots emphasize the necessity of testing classifier performance over multiple training initializations before settling on a particular set of weights. The results presented below were generated using the CNN and LSTM with the largest area under the FROC curve (FROC-AUC).

### Iv-C Performance Evaluation

Single-channel and multichannel detection performance was assessed using receiver operating characteristic (ROC) and FROC curves, respectively. See Appendix B for an introduction to ROC and FROC curves. In the single-channel ROC evaluations, the eleven 10 MHz channels covering 3550-3650 MHz in each spectrogram were tested, and the results were aggregated on a per-channel basis. On the other hand, the multichannel FROC evaluations holistically assessed detection performance over the entire 3550-3650 MHz frequency range by aggregating detection results on a per-spectrogram basis.

Figures 4 and 6 show empirical ROC and FROC curves for the CNN, LSTM, and energy detection on Test Set A. From these plots, we see that the CNN and LSTM decidedly outperformed energy detection, with the CNN performing slightly better than the LSTM. Table II gives estimates of the area under the curve (AUC) for the Test Set A ROC and FROC curves. In this table, FROC-AUC was normalized to make it a number between 0 and 1; the normalization factors for test set A and B were 10.28 and 10.36, respectively. See Appendix B for a discussion of our rationale for FROC-AUC normalization. The AUC point estimates were estimated nonparametrically by computing the area under the empirical ROC and FROC curves. In the ROC-AUC case, this is mathematically equivalent to the normalized Mann-Whitney U statistic [21]. Ninety-five percent confidence intervals for ROC-AUC were estimated using the nonparametric method of DeLong et al. [22] together with the logit transformation method recommended by Pepe [21, p. 107]. Also, ninety-five percent confidence intervals for FROC-AUC were estimated using the percentile bootstrap method [23], where the bootstrapping was stratified to maintain the proportions in Table I. Note that because the classifiers were applied to the same test set, the confidence intervals are correlated.

The detection performance for Test Set B, a proper subset of Test Set A that excluded Radar 3 OOBE, is summarized in Figure 5, Figure 7, and Table III. Comparing these results to those for Test Set A, we see that the removal of Radar 3 OOBE only yielded a slight improvement in energy detection for low false-positive rates. To elucidate this finding, Figure 8 shows estimated probability density functions (PDFs) for the power in each 660 kHz-wide channel used by the energy detector for SPN-43-absent and SPN-43-present cases. Each PDF was estimated using the kernel density estimation method with a Gaussian kernel and a bandwidth of one. The peaks in the SPN-43-absent distributions correspond to the receiver noise floor, which varied with measurement location and receiver reference level [6, 7, 24]. From these plots, we see that the SPN-43-absent PDF for Test Set A has a fatter tail between -85 dBm and -70 dBm than for Test Set B, which is consistent with Radar 3 OOBE being present in set A. However, there is only a very slight shift in the peaks of each distribution for sets A and B. The very small differences between the distributions for sets A and B help to explain the observed lack of improvement in energy detection performance on Test Set B.

Classifier | ROC-AUC | FROC-AUC |
---|---|---|

CNN | .997, [.993, .998] | .997, [.994, .998] |

LSTM | .988, [.978, .994] | .987, [.975, .996] |

ED | .854, [.819, .883] | .809, [.760, .855] |

Classifier | ROC-AUC | FROC-AUC |
---|---|---|

CNN | .997, [.992, .999] | .993, [.984, .997] |

LSTM | .994, [.987, .998] | .983, [.963, .995] |

ED | .852, [.796, .895] | .751, [.680, .823] |

### Iv-D Detection Examples

To gain further insight into classifier performance, we examined spectrograms in which there was no consensus between the three classifiers. Three notable examples, denoted Example 1, Example 2, and Example 3, respectively, are shown in Figure 9. For the discussion below, each of the classifiers was applied with a decision threshold corresponding to a false-positive rate of 0.05 on Test Set A.

Example 1, shown in Figure 9 (

## V Applications

Distributions estimated from field measurements for SPN-43 spectrum occupancy and SPN-43-absent power density are potentially informative to both federal regulators and commercial industry, as they may be relevant to ESC requirements [25, 26] and ESC development efforts. Namely, occupancy distributions may be relevant to a requirement that the channel be vacated for a fixed time-interval after incumbent signals have been detected [25]. Distributions of SPN-43-absent power may be relevant to ESC developers since some detection strategies may result in unacceptably high false-alarm rates for channels with higher levels of non-SPN-43 emissions. In addition, field observations of ambient power levels are relevant to ESC certification testing [26], since they could inform selection of background noise levels.

As explained in Section II, a total of 14,739 spectrograms were collected in San Diego and Virgina Beach, of which 4,491 were human-labeled for SPN-43 presence. In this section, we describe how the best-performing classifier from Section IV-C, the CNN, was applied to classify the unlabeled spectrograms for SPN-43 presence and how the complete set of spectrograms was then used to estimate distributions for SPN-43 spectrum occupancy and for power density when SPN-43 was absent. To classify unlabeled spectrograms for SPN-43 presence, we chose a decision threshold to generate the CNN prediction output from each 10 MHz channel. Below, we give details on the selected decision threshold, which was different for each application to accommodate dissimilar preferences between true-positive and false-positive rates.

The findings given here are a partial selection from a larger set of descriptive statistics that will be provided in a forthcoming technical report [24]. It should be emphasized that because these results are derived from spectrum observations at only two geographic locations for two months each, the reader should be careful not to draw overly-general conclusions.

### V-a Channel Occupancy Statistics

For the goal of estimating SPN-43 channel occupancy, we chose the CNN decision threshold to control the false-positive rate. Specifically, the decision threshold was selected to correspond to a false-positive rate of on Test Set A; this operating point corresponds to a true-positive rate of . Note that false-positives lead to a positive bias in occupancy estimates and a negative bias in vacancy estimates. Thus, because we controlled the false-positive rate, our occupancy and vacancy estimates are conservative and liberal, respectively.

As stated in Section II, spectrograms were collected roughly every ten minutes. This sampling interval was not exact due to hardware restrictions, like the rate at which data was saved to disk. Despite this fact, to simplify our estimates of vacancy and occupancy time-intervals, we assumed that the captures were exactly ten minutes apart. To calculate the length of time a 10 MHz channel was either occupied by SPN-43 or vacant, we ordered the spectrograms by their capture time-stamp and then counted the number of consecutive vacant and occupied observations. The counts were then multiplied by 10 minutes to estimate durations. Note that this approach could not resolve changes in SPN-43 occupancy that occurred less than 10 minutes apart.

Figure 10 shows histograms of occupied and vacant time-intervals measured in minutes for the 10 MHz channel centered at 3550 MHz in San Diego. Specifically, the occupancy histogram lists the number of time-intervals for which SPN-43 was continuously-observed for the specified duration. For example, the 3550 MHz channel was continuously occupied for 30-40 minutes nine times during the two-month measurement period in San Diego. Similarly, the vacancy histogram lists the number of time-intervals for which SPN-43 was not present for the specified duration, e.g., there were ten SPN-43 vacancies with durations of 50-60 minutes. Only time-intervals below 120 minutes are shown. Of the observed time-intervals, 24 occupancies and 138 vacancies exceeded 120 minutes.

To gain a better understanding of how often a channel was occupied, we estimated the occupancy ratio, i.e., the amount of time the channel was occupied by SPN-43 divided by the total observation time. Table IV lists the estimated occupancy ratio for channels where SPN-43 was observed in San Diego and Virgina Beach, respectively.

Measurement Site | Channel Center (MHz) | Occupancy Ratio |
---|---|---|

San Diego | 3520 | |

San Diego | 3550 | |

San Diego | 3600 | |

Virginia Beach | 3570 | |

Virginia Beach | 3600 | |

Virginia Beach | 3630 |

### V-B Power Density Distributions

For the aim of estimating the distribution of power density when SPN-43 was absent, we chose the CNN decision threshold to control the number of missed SPN-43 detections (false-negatives). Specifically, the decision threshold was selected to correspond to a true-positive rate of on Test Set A; this operating point corresponds to a false-positive rate of . Because missed detections lead to the inclusion of SPN-43 emissions in estimates of the SPN-43-absent power density, and because the power density of SPN-43 emissions is typically quite high, missed detections are expected to add a positive bias. Therefore, to avoid such a bias, we controlled the rate of missed detections with the potential expense of additional false-positives, which shrank our sample size for SPN-43-absent observations.

After classification of the unlabeled spectrograms, the channels found to contain SPN-43 were discarded, and the empirical cumulative distribution function (CDF) for the power density was estimated from the set of spectrogram values (converted to dBm/MHz) in the 220 kHz-wide frequency-bin at the center of each 10 MHz channel (the expected location for SPN-43). Figure 11 shows examples of the empirical complementary CDF (CCDF), equal to one minus the CDF, for the SPN-43-absent power density for frequencies where SPN-43 was observed in San Diego and Virgina Beach. These plots can be used to quickly read off percentiles associated with the upper tails of each distribution. Namely, the 90th and 99th percentiles correspond to the power density values where the CCDF is equal to 0.1 and 0.01, respectively. Nonparametric simultaneous 95% confidence bands were estimated for the empirical CCDFs using a method based on the Dvoretzky-Kiefer-Wolfowitz inequality [27, Thm. 7.5]. The steep decrease in the CCDFs corresponds to the noise floor of the receiver, which varied with measurement location and receiver reference level [6, 7, 24]. The tail heaviness indicates the prevalence of non-SPN-43 emissions, such as Radar 3 OOBE.

## Vi Summary and Discussion

Accurate detection of radar in the 3.5 GHz band with coastal sensors is needed to both protect federal incumbent systems and to enable economical commercial utilization of the band. In this paper, we investigated the effectiveness of two deep learning algorithms, a CNN and an LSTM, for detection of SPN-43 radar in spectrograms. The algorithms were trained and tested with a set of nearly 4,500 unverified, human-labeled spectrograms collected at two coastal locations. Our evaluations demonstrated that deep learning methods offer much better detection performance than conventional energy detection. Finally, we applied the best-performing detection method in our study, the CNN, to classify over 10,000 unlabeled spectrograms and used the human and machine-labeled spectrograms to estimate occupancy statistics and SPN-43-absent power density distributions.

A standard practice in machine learning is to use separate labeled data sets for training, validation, and testing. Although there are many variations on this basic premise, such as the holdout method or k-fold cross validation, all essentially follow the same form. The training set is used to fit the model, the validation set is used to determine when to stop training and avoid over-fitting, and the test set is used to assess classifier generalization. Utilization of a validation set is a very effective strategy for determining the optimal amount of training a model needs to generalize well. Unfortunately, as discussed in Section II, our labeled data set suffered from a selection bias that limited the number of cases available to construct training, validation, and test sets with sufficient representation of the subgroups listed in Table I. Therefore, to ensure that our training and testing sets contained enough cases for each relevant subgroup, we chose to not use a separate validation set. Instead, to avoid over-fitting, we stopped training after a fixed number of epochs, and trained 100 separate models with different training initializations. It is possible that the variability between different training initializations would be much smaller if a validation set was used. Still, it is important to note that the initialization of weights within a network can have a large impact on the final model. Despite variations due to training initialization, the CNN always outperformed energy detection by a wide margin, whereas the LSTM underperformed energy detection for a small number of initializations.

In addition to demonstrating the superiority of deep learning methods to energy detection, our evaluations indicate the minimum level of performance that can potentially be expected from an ESC detector. However, our test set was constructed so that challenging cases, e.g., multiple SPN-43s and Radar 3 OOBE, were well-represented, whereas the proportions of such observations in the field may be very different. Also, our data set only contained captures collected at two geographic locations for two months each. Therefore, the absolute detection performance levels observed in our tests may not be the same as those for a real-world ESC deployed in the field.

The excellent detection performance of the CNN allowed us to estimate spectrum occupancy statistics and power distributions for non-SPN-43 emissions with a much higher accuracy than would have been otherwise possible. In particular, using energy detection to classify the unlabeled data would have resulted in many more false-positives and missed detections at a given decision threshold, which would lead to biased estimates. Namely, false-positives lead to overestimation of spectrum occupancy, whereas missed detections result in an overestimation of spectrum vacancy and add a positive bias to non-SPN-43 power distributions. As explained in Section V, occupancy statistics and power distributions may have value to both ESC developers and spectrum regulators. A complete set of occupancy/vacancy statistics and power distributions for each channel in the 3.5 GHz band will be provided in a forthcoming technical report [24].

## Appendix A Elements of Deep Learning Architectures

In this appendix, we review elements present in our deep neural network implementations. For additional details on deep learning methods, refer to the textbook by Goodfellow et al. [12].

### A-a Convolutional Neural Network

There are three common substructures in a classic CNN. These three substructures are the convolutional layer, the pooling layer, and the fully connected layer. Multiple instances of these layers can be stacked and arranged in any number of permutations. For detailed definitions and further information on CNNs, see, e.g., [28][29].

#### Convolutional Layer

Convolutional layers are designed to recognize shift-invariant features. As input, the layer accepts a three dimensional matrix. These layers are made up of filters, each of which is represented by a matrix with variable height and width but a fixed depth. The depth must match the depth of the input.

To generate the output, every possible subsection of the input that matches the filter size is extracted. Then, a dot product operation is applied between these subsections and each filter to produce an output. This final 3D output matrix is called the activation map. It can be interpreted as the probability of each filters’ learned pattern existing at any given point in the input.

Besides the filter size, the number of filters is variable. Larger step sizes can be used to skip subsections. Additionally, zero-padding can be applied to the input edges to ensure that the filters can activate on the edges or to inflate the output dimensions.

#### Pooling Layer

Pooling layers perform down-sampling. In the max-pooling layer, as the window steps through the input, the maximum activation within the window is kept as output. Average pooling is another type of pooling in which the average is kept instead. These layers have variable window sizes, heights, and widths.

#### Fully Connected Layer

Fully connected layers learn to recognize high-level features at the tail end of a network. The expected input to a fully connected layer is a columnar vector. The layer consists of a matrix. The dot product between the matrix and input vector is calculated as the output.

### A-B Long Short-Term Memory

LSTMs consist of sequential cells. Each cell takes the previous cell’s output and a single step from a sequence as input. The former is called the cell state. Each cell produces a prediction via a set of gates that control how information is processed within the cell.

The forget gate removes information from the previous cell state. The input gate removes information from the sequence step. The output gate controls what information remains in the new cell state. All of these gaits are essentially trainable fully-connected layers. The actual cell state and prediction is calculated using dot products between different gate outputs.

One LSTM modification we implemented was residual connections. The prediction of a cell was added to the previous prediction. As a result, the size of the LSTM was controlled by the size of our input. Generally, LSTMs have variable size.

### A-C Additional Concepts

#### Activation Functions

An activation function is a nonlinear transformation applied to the outputs of some neural layer to control the flow of information through a network. A commonly-used activation function is the rectifier linear unit (ReLU), which applies the transformation to a layer output, , mapping it to the range . ReLU activation functions have been shown to improve the speed of convergence in training neural networks [9].

Another common activation function is the sigmoid function, defined as . The sigmoid function maps a layer output to the range .

#### Batch Normalization

When training a single neural layer, stochastic gradient descent and its variants assume all other layers remain constant. In reality, all layers are modified simultaneously leading to mismatches called covariate shifting. Batch normalization forces a layer’s output to match a desired distribution, somewhat alleviating this covariate shifting [30].

#### Dropout

Dropout is a process by which each value in a forward pass through a neural network has a random chance of being set to zero. This process reduces reliance on specific features as they may be dropped out. The resulting model learns better generalizations rather than specific features from the training data [31].

#### Cross-Entropy Loss

Cross-entropy loss calculates the difference between a classifier’s predictions versus the known labels of some input. During the training phase, using the back-propagation algorithm, the network’s weights are adjusted in a direction that will decrease the cross-entropy loss. The cross-entropy loss function has the following form:

(1) |

where is the number of training examples, is the prediction generated from the network on the set of inputs, , and is the known label for each training example.

## Appendix B ROC and FROC Curves

We present a brief introduction to two types of graphical plots that can be used to evaluation detection performance: the receiver operating characteric (ROC) curve and a related generalization, called the free-response ROC (FROC) curve. Further background on ROC curves can be found in [32, 33, 21] and details on FROC curves are given in [34, 35]. Although ROC curves are commonly utilized in signal processing and machine learning, FROC curves are lesser-known, since they have been primarily applied in radiology to evaluate lesion detection performance. In the context of multichannel spectrum sensing for cognitive radio, a notable application of FROC curves to classifier performance evaluation is the work of Collins and Sirkeci-Mergen [36].

### B-a Binary Signal Detection: The ROC Curve

For a binary signal detection task, the aim is to use a data observation to decide whether or not a signal is present. Each decision results in one of four possible outcomes: true-positive (TP), false-positive (FP), true-negative (TN), or false-negative (FN). These outcomes give rise to four conditional probabilities (or rates). In the engineering literature, the TP rate, FN rate, and FP rate are commonly called the “detection”, “miss” and “false-alarm” probabilities, respectively. For a given decision threshold, binary classification performance is fully described by the FP and TP rates. Namely, the FN rate is equal to one minus the TP rate, and the TN rate is one minus the FP rate.

A useful way to summarize detection performance is the ROC curve, defined as the plot of TP rate versus FP rate, over all decision thresholds [32, 21]. An example of an ROC curve is shown in Figure 12 (left). When comparing ROC curves, better classifier performance is indicated by a higher curve that is closer to the upper left corner. Namely, for a perfect classifier, there exists a threshold where the TP rate is equal to one with a FP rate of zero. By contrast, for a useless classifier, the ROC curve is equal to or below the diagonal dashed “chance line” shown in Figure 12 (left) for which the TP rate is equal to the FP rate at all decision thresholds [21].

ROC curves possess three properties that make them particularly useful. First, they fully characterize binary classifier performance over all decision thresholds, which enables evaluation and comparison of classifiers that may be deployed at various operating points (thresholds) [33]. Second, ROC curves are invariant under strictly-increasing transformations of the decision variable [21]. Thus, classifiers with decision variables on different ordinal scales can be compared via ROC curves. Third, ROC curves are independent of signal prevalence [33]. This implies that ROC curves can be used to assess classifiers that may be deployed in environments with different signal prevalences.

A commonly-used summary measure for binary classification performance is the area under the ROC curve (ROC-AUC). ROC-AUC takes values between zero and one, with higher values indicating better performance. ROC-AUC can be interpreted as the average TP rate, averaged uniformly over all FP rates. Alternatively, ROC-AUC can be interpreted as a probability. Namely, given randomly-selected signal-absent and signal-present cases, ROC-AUC is the probability that the signal-present case is rated higher [21, Sec. 4.3].

In this paper, we use the so-called “empirical” nonparametric estimators for the ROC curve and its area. For details on these estimators, see [21, Sec. 5.2]. In addition to having simple implementations, the empirical estimators are nonparametric and unbiased.

### B-B Multiple Signal Detection and Localization: The FROC Curve

The FROC curve [34, 35] is generalization of the ROC curve designed to summarize classifier performance for a combined detection and localization task in which multiple detection decisions are made for each observation. An example of such a task arises in multichannel spectrum sensing, where the aim is to detect one or more signals and localize them in frequency. After specifying a criterion for correct signal localization, it is possible to determine if a detection result is a correctly-localized TP or a FP. Suppose that each correctly-localized TP detection occurs with the same probability, called the signal-detection fraction. The FROC curve is defined as the plot of signal detection fraction versus the mean number of false-positives per observation, plotted over all decision thresholds; an example FROC curve is shown in Figure 12. Like the ROC curve, an FROC curve closer to the upper left corner of the graph indicates better classifier performance. To estimate the FROC curve, we use the usual empirical estimator [34].

When the number of detection decisions is bounded, the abscissa (x-axis) of the FROC curve is bounded, and the area under the FROC curve (FROC-AUC) is a well-defined summary measure. For example, multichannel spectrum sensing typically aims to assess spectrum occupancy for a fixed number of frequency channels. Because the maximum abscissa value for an empirical FROC curve depends on the maximum number of possible FP decisions in the test set, the maximum empirical FROC-AUC can be different for dissimilar test sets. Therefore, in this paper, to enable straightforward comparisons between test sets, we normalize FROC-AUC values to fall between zero and one. The normalization factor depends on the maximum number of possible FP decisions in the test set.

### B-C When to Use Which Curve?

Because ROC and FROC curves are designed for different, but related tasks, they provide complementary insights into classifier performance. In particular, ROC curves focus solely on signal detection for a single decision, regardless of signal localization. Thus, for the problem of multichannel spectrum sensing, ROC curves are best suited to low-level assessment of classifier performance for a single channel. Such evaluations may be particularly useful for classifier development. On the other hand, FROC curves assess both detection and signal localization when multiple decisions must be made. For this reason, they are better suited to classifier performance evaluation for the full multichannel spectrum sensing task. If FROC curves are not a good match to the task and associated preferences, one can consider variations of FROC curves that weight TP and FP decisions differently; for further details on FROC variants and their generalizations, see [37].

### Footnotes

- Certain commercial software and hardware products are identified to fully specify our implementation. This does not imply endorsement by the National Institute of Standards and Technology or that the software and hardware are the best available for the purpose.

### References

- “Citizens Broadband Radio Service.” Code of Federal Regulations. Title 47, Part 96, June 2015.
- “An assessment of the near-term viability of accommodating wireless broadband systems in the 1675–1710 MHz, 1755–1780 MHz, 3500–3650 MHz, 4200–4220 MHz and 4380–4400 MHz bands.” National Telecommunications and Information Administration, Oct 2010.
- “Operation and maintenance instructions, organizational level, radar set AN/SPN-43C.” Naval Air Systems Command, Technical Manual, EE216-EB-OMI-010, vol. 1, Sept 2005.
- M. G. Cotton and R. A. Dalke, “Spectrum occupancy measurements of the 3550–3650 megahertz maritime radar band near San Diego, California,” Technical Report TR 14-500, National Telecommunications and Information Administration, Jan 2014.
- F. H. Sanders, J. E. Carroll, G. A. Sanders, and L. S. Cohen, “Measurements of selected naval radar emissions for electromagnetic compatibility analyses,” Technical Report TR 15-510, National Telecommunications and Information Administration, Oct 2014.
- P. Hale, J. Jargon, P. Jeavons, M. Lofquist, M. Souryal, and A. Wunderlich, “3.5 GHz radar waveform capture at Point Loma,” Technical Note 1954, National Institute of Standards and Technology, May 2017.
- P. Hale, J. Jargon, P. Jeavons, M. Lofquist, M. Souryal, and A. Wunderlich, “3.5 GHz radar waveform capture at Fort Story,” Technical Note 1967, National Institute of Standards and Technology, October 2017.
- H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proc. 26th Int. Conf. Machine Learning, pp. 609–616, ACM, 2009.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. 25th Int. Conf Neural Information Processing Systems, pp. 1097–1105, 2012.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015.
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT press, 2016.
- H. Urkowitz, “Energy detection of unknown deterministic signals,” Proc. IEEE, vol. 55, no. 4, pp. 523–531, 1967.
- M. A. Abdulsattar and Z. A. Hussein, “Energy detection technique for spectrum sensing in cognitive radio: a survey,” Int. J. Computer Networks & Communications, vol. 4, pp. 223–224, Sept 2012.
- S. Atapattu, C. Tellambura, and H. Jiang, Energy detection for spectrum sensing in cognitive radio. Springer, 2014.
- M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint, 2013. https://arxiv.org/abs/1312.4400.
- Y. Huang, X. Sun, M. Lu, and M. Xu, “Channel-max, channel-drop and stochastic max-pooling,” in IEEE Conf. Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–17, IEEE, 2015.
- X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Int. Conf. Artificial Intelligence and Statistics, pp. 249–256, 2010.
- J. Kim, M. El-Khamy, and J. Lee, “Residual LSTM: Design of a deep recurrent architecture for distant speech recognition,” arXiv preprint, 2017. https://arxiv.org/abs/1701.03360.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
- M. S. Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford Univ. Press, 2003.
- E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson, “Comparing the areas under two or more correlated receiver operating characterstic curves: A nonparametric approach,” Biometrics, vol. 44, pp. 837–845, Sept. 1988.
- B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC, 1993.
- W. M. Lees, A. Wunderlich, P. Jeavons, P. D. Hale, and M. R. Souryal, “Spectrum Occupancy and Ambient Power Distributions for the 3.5 GHz Band Estimated from Observations at Point Loma and Fort Story,” Technical Note, National Institute of Standards and Technology, 2018. under review.
- Wireless Innovation Forum, Requirements for Commercial Operation in the U.S. 3550-3700 MHz Citizens Broadband Radio Service Band, Working Document WINNF-TS-0112, Version V1.5.0, May 2018.
- F. H. Sanders, J. E. Carroll, G. A. Sanders, R. L. Sole, J. S. Devereux, and E. F. Drocella, “Procedures for laboratory testing of environmental sensing capability sensor devices,” Techincal Memorandum TM 18-527, National Telecommunications and Information Administration, November 2017.
- L. Wasserman, All of statistics: A concise course in statistical inference. New York: Springer, 2004.
- Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applications in vision,” in Proc. IEEE Int. Symp. Circuits and Systems, pp. 253–256, 2010.
- D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmidhuber, “Flexible, high performance convolutional neural networks for image classification,” in Proc. Int. Joint Conf. Artificial Intelligence, pp. 1237–1242, 2011.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Int. Conf. Machine Learning, pp. 448–456, 2015.
- N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
- H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part I. New York: John Wiley & Sons, 1968.
- C. E. Metz, “Basic principles of ROC analysis,” Semin. Nucl. Med., vol. 8, pp. 283–298, 1978.
- P. Bunch, J. Hamilton, G. Sanderson, and A. Simmons, “Free-response approach to the measurement and characterization of radiographic observer performance,” J. Appl. Photogr. Eng., vol. 4, no. 4, pp. 166–171, 1978.
- C. E. Metz, “Receiver operating characteristic analysis: A tool for the quantitative evaluation of observer performance and imaging systems,” J. Am. Coll. Radiol., vol. 3, pp. 413–422, 2006.
- S. D. Collins and B. Sirkeci-Mergen, “Localization ROC analysis for multiband spectrum sensing in cognitive radio,” in Proc. IEEE Military Communications Conf., pp. 64–67, 2013.
- A. Wunderlich, B. Goossens, and C. K. Abbey, “Optimal joint detection and estimation that maximizes ROC-type curves,” IEEE Trans. Med. Imag., vol. 35, pp. 2164–2173, Sept. 2016.