Particle Identification In Camera Image Sensors Using Computer Vision
We present a deep learning, computer vision algorithm constructed for the purposes of identifying and classifying charged particles in camera image sensors. We apply our algorithm to data collected by the Distributed Electronic Cosmic-ray Observatory (DECO), a global network of smartphones that monitors camera image sensors for the signatures of cosmic rays and other energetic particles, such as those produced by radioactive decays. The algorithm, whose core component is a convolutional neural network, achieves classification performance comparable to human quality across four distinct DECO event topologies. We apply our model to the entire DECO data set and determine a selection that achieves purity for all event types. In particular, we estimate a purity of when applied to cosmic-ray muons. The automated classification is run on the public DECO data set in real time in order to provide classified particle interaction images to users of the app and other interested members of the public.
keywords:cosmic rays, deep learning, convolutional neural network, classification, citizen science
The Distributed Electronic Cosmic-ray Observatory (DECO) vandenbroucke2015 (); vandenbroucke2016 () taps into the rapidly growing network of smartphones worldwide to detect cosmic rays and other energetic charged particles. DECO accomplishes this by way of an Android application designed to detect ionizing radiation that traverses silicon image sensors in smartphones. The resulting dataset consists of images recorded by users worldwide (Figure 1) that contain evidence of charged particle interactions. Due to the diverse ecosystem of Android phones on the market, the systematic variation in data taking conditions, and the variety of particle event morphologies, classification of DECO events presents a unique challenge. Previous efforts to classify events in the highly heterogeneous dataset have been successful in classifying some event types, but identifying a cosmic-ray muon sample with high purity has proven challenging. We present a computer vision algorithm based on a convolutional neural network for classifying DECO events.
In addition to DECO, other cosmic-ray cell phone apps including crayfis_2016 () and cr_app () could benefit from the approach described here. We presented initial results from our CNN classification in meehan2017 (). More recently, during preparation of this paper, crayfis_cnn () appeared and describes a CNN algorithm intended for use as an online cosmic-ray muon trigger.
2 DECO App
The DECO detection technique uses similar ionization-detecting semiconductor technology to that found in professional particle physics experiments such as in silicon trackers ackermann2012 (); cms2008 (). Ionizing charged particles that travel through the sensitive region (i.e. depleted region) of a phone’s image sensor are detected via the electron-hole pairs they create. The DECO app, which can be run on any Android device with Android version 2.1, is designed to be run with the camera face down or covered in order to minimize contamination from background light. While running, the app repeatedly takes long-duration (50 ms) exposures and runs them through a two-stage filter to search for potentially interesting events. This filter first searches a low-resolution image for pixels above an intensity threshold, and if passed, analyzes a high-resolution image in the same manner. The intensity is the sum of the red, blue, and green color values (RGB) for each pixel. Images that pass both filters are tagged as “events” and are automatically uploaded to a central database for offline analysis. Additionally, the app has a “minimum bias” data stream that saves one image every five minutes per device for offline calibration and noise studies. The app’s online filter is simple and efficient in order to maximize livetime, while more detailed analyses of images are performed offline. The DECO data can be browsed using a public website deco_data (), where users can perform queries using various metadata including time stamp (UTC), latitude and longitude (rounded to nearest 0.01 for privacy), event vs. minimum bias categorization, Android phone model, and device ID.
Offline analysis of images that pass the app’s online filter begins with a contour-finding algorithm to locate clusters of bright pixels. We use the marching squares algorithm, a special case of the marching cubes algorithm lorensen1987 (); scikit-image (), to search for groups of at least 10 pixels with a minimum RGB sum of 20. These clusters of pixels are then grouped together at a higher level: any clusters within 40 pixels of one another are considered a single group. This grouping is to account for electrons which can scatter in and out of the camera sensor, creating multiple nearby clusters of pixels with distinct contours. Figure 2 shows an example of the contours found in a DECO image with this algorithm.
2.1 Event Types
There are three categories of charged particle events in the DECO dataset: tracks, worms, and spots. These are named according to the convention in groom2002 (), which categorizes events based on their morphology. Tracks are long, straight clusters of pixels in an image created by high-energy (GeV) minimum-ionizing cosmic rays. These are predominantly cosmic-ray muons at sea level and primary cosmic rays (mostly protons) above 20,000 ft altitude PDG (). Worms are named for the curved clusters of pixels caused by the meandering paths of electrons that have undergone multiple Coulomb scattering interactions. These electrons are likely the result of local radioactivity. Worms can also be seen as two or more nearby, disconnected clusters of pixels, which are the result of an electron scattering in and out of the sensitive region of the camera sensor. Spots are smaller, approximately circular clusters of pixels that can be created by various interactions. They are likely predominantly caused by gamma rays that Compton scatter to produce a low energy electron that is quickly absorbed. Spots can also be produced by alpha particles, which also have a very short range in silicon, or by cosmic rays incident normal to the sensor plane. Figure 3 shows the characteristic camera sensor response for each of the three interaction signatures detected by DECO. In addition to the three particle interaction categories, there are also events due to light in the sensor occurring when it is not sufficiently shielded, and several categories of noise: hot spots, thermal noise fluctuations, and large-scale sensor artifacts such as rows of bright pixels vandenbroucke2015 (). While non-particle events, shown in Figure 4, are not particularly of interest from an analysis standpoint, they do cause potential classification confusion.
2.2 Initial Classification Approach
Given the numerous event types, both particle and non-particle, and the increasing number of images being collected by DECO, there is a growing need for a reliable computerized event classification system. However, there are several challenges associated with characterizing the DECO dataset in a way that requires little human intervention. Due to the inhomogeneity in hardware
An initial algorithm that classified DECO events used straight cuts applied to geometric metrics that were combined to make a binary classification: track or non-track. Clusters of pixels were identified using the marching squares algorithm described in Section 2. The binary classification identified low-noise images with a single cluster of pixels, not containing any sub-clusters (i.e. evidence of an electron scattering out of the sensor plane), with a minimum area of 10 pixels, and an eccentricity 0.99, where eccentricity is calculated using image moments as described in image_moments (). The last two requirements were intended to select larger, line-like events, such as tracks. This method accurately distinguished tracks from spots, but struggled to separate tracks and worms, presumably due to their similar morphology. Many worms only curve slightly and have a high eccentricity. These events are unlikely to be high-energy muons due to their curvature, but the classification based on straight cuts could not distinguish them from tracks. Fortunately, advances in the quickly developing field of machine learning offer techniques to overcome these classification challenges.
3 Deep Learning
Deep learning is a subset of machine learning focused on building models that are capable of learning how to describe data at multiple levels of abstraction. This is achieved with a nested hierarchy of simple algorithms that when combined can form highly complex and diverse representations. At each layer of the nested hierarchy, a non-linear transformation of the previous layer’s output is typically performed, which results in the deeper layers of the model seeing a progressively more abstract representation of the original input. By learning features at multiple levels of abstraction, the model has the ability to learn complex mappings between the input and output directly from data bengio2009 (). This is particularly advantageous when dealing with higher-level abstractions that humans may not know how to explicitly describe in terms of the available input.
Deep learning models are typically constructed with four basic components in mind: (1) a specific dataset, (2) an objective function
For classification, we begin by assuming that there exists some function, , that describes the true mapping between input vector, , and category, , such that . In this case, the goal of the feedforward neural network is to construct a mapping, , then learn which value of the parameter vector, , provides the best approximation between and goodfellow2016 (). The categorical label, , is a unit vector containing all zeros, except for the index that corresponds to the th category in the model, which has a value of 1. The function is typically a series of nested functions, , with depth , where corresponds to the input layer, through are hidden layers
Each layer consists of a specified number of units, called neurons, that each compute a weighted linear combination of the inputs followed by a non-linear function which outputs a single, real-valued input for the next layer. Traditionally, layers have a dense, fully connected structure where the output of each neuron in a given layer is connected as input to all the neurons in the next layer. In this case, the output of the th layer, , has the following vector representation:
where is the output of the previous layer’s neurons, is a matrix of weights, is a vector of biases, and is the non-linear function, also known as the activation function. The weights and biases constitute the model’s parameters, which are optimized during the learning process. With the exception of the output layer, the typical choice for the activation function is the rectified linear unit, or ReLU nair2010 (), defined by max, which outputs the maximum between the input and zero. A common variant is the leaky ReLU maas2013 (), where negative inputs are not set to zero, but are instead multiplied by a small constant . In the output layer, the softmax function (multi-class generalization of the logistic sigmoid, see for example goodfellow2016 ()) is used to produce a multinoulli distribution representing the probability that input belongs to each of the different categories represented in the model. The category with the greatest probability is generally taken to be the classification, however specific threshold cuts for each category can also be used.
During the learning process, the model is presented with a large number of training examples where each input, , has a single human assigned categorical label, , which is taken by the model to be the ground truth. The ground truth label, , is then typically represented in a conditional probability distribution, , such that the conditional probability for the th category in the model is given by , which is the Kronecker delta. A loss function is used to compute the error between the model predictions and the ground truth. Modern neural networks are typically trained using the principle of maximum likelihood. In this approach, the loss function is the negative log-likelihood, which can be equivalently described as the cross-entropy between the training examples and the modeled distributions goodfellow2016 (). In the case of multinomial logistic regression (i.e., classification with multiple categories), the cross-entropy loss function for a single training example is:
where is the total number of categories in the model, is the ground truth, human-assigned probability for the th category, and is the probability output by the model for the th category. The gradient of the loss, as a function of the weights and biases, is calculated using the back-propagation algorithm rumelhart1986 (). The loss is then minimized by updating the weights and biases for all the neurons in each layer using the method of mini-batch stochastic gradient descent (SGD) lecun1998c (); bottou2016 (). When using mini-batches, the gradient of the loss function is estimated as the average instantaneous gradient over a small group of training examples (25 to 100, typically), which serves to balance gradient stability with computing time. This procedure is then repeated, iterating through mini-batches of training examples, until the error between the modeled and ground truth distributions reaches a satisfactory level. A single cycle through all of the mini-batches contained in the training set is typically referred to as an epoch.
3.2 Convolutional Neural Networks
Convolutional neural networks (CNNs) lecun1998a () are a subclass of neural networks in which standard matrix multiplication is replaced with the convolution operation in at least one of the model’s layers. CNNs have shown extraordinarily good performance learning features from datasets that are characterized by a known grid-like topology, such as pixels in an image or samples in a waveform. The core concept behind CNNs is to build many layers of “feature detectors” that take into account the topological and morphological structure of the input data simard2003 (). Throughout the training process, the model learns how to extract meaningful features from the input, which can then be used to model the contents of the input data. The first stages of a CNN typically contain two types of alternating layers that are used to perform “feature extraction”: convolutional layers and pooling layers.
Convolutional layers take a stack of inputs (e.g. color channels in an image) and convolve each with a set of learnable filters to produce a stack of output feature maps, where each feature map is simply a filtered version of the input data (input image, in our case). A given input image, , convolved with a filter, , will produce an output according to:
where is the pixel of the feature map (prior to applying the non-linear function), and correspond to the filter’s height and width in units of pixels, and is the number of color channels in the input image
where is the th of total feature maps output by the previous layer
Feature maps are essentially abstract representations of the input image, where each individual feature map is tasked with learning how to extract a specific feature from the input, such as edges, corners, contours, parts of objects, etc. It should be noted that the specific features learned by each feature map are not predetermined, but, rather, are selected solely by the model during the learning process. The feature maps nearest the input tend to resemble the original image. At layers further from the input, the feature maps gradually become more abstract and specialized. Figure 5 (in particular, panels (b) and (d)) highlights the difference between feature maps directly connected to the input image and those several layers deeper into the model.
Replacing the matrix product with a sum of convolutions results in a series of additional benefits goodfellow2016 (): (1) a restricted connectivity pattern where each neuron is only connected to a local subset of the input, which reduces the number of computations, (2) the model learns a single set of parameters for each filter that can then be shared via convolution by all pixels in the input, which reduces the number of model parameters and improves the model’s generalization performance
Pooling layers boureau2010 () reduce the dimensionality of a feature map by using an aggregation function to compute a summary statistic across a small, local region of the input. The dimensional reduction gives the deeper layers of the model the ability to learn correlations between increasingly larger, yet lower resolution, regions of the input. For example, max pooling zhou1988 () computes the maximum output located within a rectangular region of the input, then reduces that rectangular region to a single value equal to the maximum. A common choice is to divide each feature map into non-overlapping grids of pixels that are then each reduced to a single pixel, converting a feature map from, say, pixels to pixels. As a result, only the most pronounced features in each rectangular region are forwarded to the deeper layers of the model. The pooling operation also gives rise to translation invariance
Finally, the features extracted from convolutional and pooling layers are typically used as input for a standard, fully connected, feedforward neural network (as explained in Section 3.1) where the desired output is then produced, which, in this case, is the CNN classification of the input image.
4 Constructing a DECO CNN
In the sections that follow, we describe the construction and optimization of a DECO-specific convolutional neural network. We begin by introducing the dataset and the challenges associated with both human classification error and the small number of training images. We explain how data augmentation was used to make the model approximately invariant to rotations as well as artificially boost the number of training images. We then discuss the problem of overfitting and the techniques used to address it. Next, we summarize the model structure and training process used. Finally, we present the classification results, evaluate the performance of the model, and discuss the model’s role in current and future DECO analyses.
4.1 Image Database and Human Labels
As discussed in Section 3.1, the model must not only be presented with a large number of training examples, but also with a set of corresponding human-determined categorical labels. However, assigning human labels to large datasets is time consuming and, depending on the dataset, difficult to do accurately. Previous deep learning models within the astronomy and particle physics communities have constructed labeled datasets by using a crowd-sourcing approach, for example by Galaxy Zoo willett2013 (); dieleman2015 (), or large-scale Monte Carlo event simulations, for example by the NOvA neutrino experiment ayres2007 (); aurisano2016 (). Both approaches require considerable human labor. At present, the DECO image database contains 45,000 events (images that passed the online filter), each of which potentially contains one or more clusters. Assigning human labels to each event cluster would be a very time consuming task. With this in mind, rather than labeling the entire dataset, we instead opted for an iterative, active learning lewis1994 (); luo2004 () inspired approach in which the number of labeled training examples was successively increased in parallel with the optimization of the CNN model structure.
To accomplish this, we began with an initial image dataset consisting of a few hundred images, each of which contained at least one hit pixel cluster. The individual event clusters were then inspected by eye, by multiple people, and assigned labels of track, spot, worm, ambiguous, or other. If an image contained noise artifacts, such as a hot spot in the camera sensor, the image was classified as other and not used for CNN training. Additionally, if a clear identification could not be made or if humans disagreed on the classification, which occurred 10 of the time, the image was labeled as ambiguous and excluded from the training set. The remaining images containing tracks, worms, and spots were then used to train the first iteration of the model. The trained model was used to classify approximately 1,000 events that were not in the original training sample. These classified images were then searched by eye for likely false positives, i.e., instances where the model reports a high probability that an event belongs to a certain category but appears to be wrong. These incorrectly classified events were then assigned a correct human label, added to the existing set of training images, and used to train the next iteration of the model. This process was repeated on increasingly larger sets of images.
As shown in Figure 6, with each new iteration, the examples that the model found most difficult to categorize were added to the labeled dataset, thus addressing the remaining weaknesses in the classifier. After applying the model to the full set of DECO images, it was found that images containing noise (thermal fluctuations, hot spots, etc.) were causing confusion for the classifier. To account for this, an additional “noise” category was added to the model. By learning the concept of noise, the model is in principle better able to distinguish particle events from sensor noise. The final training dataset, summarized in Figure 7, contained 5119 human-classified images: 442 tracks, 1063 worms, 1094 spots, and 2520 noise.
4.2 Preprocessing and Data Augmentation
Image-to-image variations in position, scale, and rotation pose a challenge to DECO event classification. When a DECO user collects data, both the position and orientation (at least in azimuth – zenith typically corresponds to phones operating flat on a table) of the phone is arbitrary. Both orientation and location data are collected in the app’s metadata. However, the position of a given event cluster within the camera sensor, as far as the model is concerned, should be considered a meaningless feature. Similarly, the orientation of a hit cluster within the plane, as well as reasonable variations in scale (e.g. the length of a track) should also be considered meaningless by the model. Fortunately, CNNs naturally handle translations in the input quite well lecun1998b (); gong2014 (). However, invariance to features such as scale and rotation needs to be learned.
For a given input image, the apparent size of the event with respect to the camera sensor can be affected by a number of factors such as the underlying hardware in the specific phone model (including the image sensor resolution), the energy of the particle, and the angle of incidence. The pooling operation provides resiliency to minor changes in shape and scale scherer2010 (), however, variations larger than a few pixels must be addressed by other means. Sophisticated solutions to this problem have been proposed xu2014 (), however, the simplest method is to introduce scale-jittering via data augmentation, which is in widespread practice today simonyan2014 (); krizhevsky2012 (). Data augmentation consists of randomly transforming training images while preserving their human-assigned category labels. Similar to scale invariance, data augmentation can also be used to learn rotation invariance. While rotation-invariant CNN architectures exist dieleman2015 () and have been shown to outperform data augmentation in certain cases marcos2016 (), the small number of training images in this study prohibited the use of such methods. Finally, due to the limited number of training images available, data augmentation was also used to artificially inflate the number of “unique” images seen by the model during training.
In general, data augmentation has been shown to be the simplest way to achieve approximate invariance to a given set of transformations simard2003 (). Assuming the model has the capacity to do so (i.e., enough feature maps), the model should be able to learn a wide variety of invariances directly from the data lenc2014 (). An additional benefit of data augmentation is that a single set of transformations can be used to address multiple different issues. With that in mind, the following operations were applied to each training image:
grayscale conversion and normalization: a dimensional reduction over the channel axis of each image was performed by calculating an unweighted sum of each pixel’s R+G+B value. The resulting grayscale images were then normalized to 1, taking the maximum possible R+G+B value to be 765 (i.e., ). Grayscale reduces the variation seen from phone to phone and is also computationally more efficient. Furthermore, while color provides essential information for other image classification tasks, it does not for particle tracks.
translation: random left/right and up/down shifts, each by an integer number of pixels uniformly sampled between -8 and +8 with respect to the image center.
rescale: random zoom in/out uniformly sampled between 90% and 110% of the original image size, used for learning scaling invariance.
reflection: random horizontal and vertical reflections, each with a probability of 50%.
rotation: random rotation uniformly sampled between and ; used for learning rotation invariance. After the rotation, any remaining pixels outside the boundaries of the original input were assigned a value of 0.
crop: crop from pixels to pixels; used to reduce the amount of empty space created on the boundaries of the image as a result of rotation, translation, and rescaling.
With the exception of normalization and the conversion to grayscale, which could be performed ahead of time, all data augmentation was done in real time during the training process. Prior to the start of each training epoch, a new random set of perturbations are applied to each image.
4.3 Avoiding Overfitting Through Regularization
Deep neural networks typically have anywhere from tens of thousands to tens of millions of trainable parameters. The advantage of such a large number of parameters is that the model has the ability to fit extremely complex and diverse datasets. However, the downside of a model with such tremendous freedom is that there is considerable risk of over-fitting, which occurs when the model simply memorizes the training images. As a result, the model is overly sensitive to the specific features that were memorized during training and therefore generalizes poorly to new data. Over-fitting is of particular concern when dealing with a small number of training images, as is the case in this study. To combat this phenomenon, we used several regularization techniques goodfellow2016 (); kukacka2017 (), which are modifications to the learning process that are intended to reduce generalization error while leaving training error
data augmentation: artificially increasing the number of training examples by modifying the images in such a way that they look different for each particular training instance while still maintaining the correctness of the underlying human assigned label. The particular perturbations used are outline in Section 4.2.
label smoothing: accounting for the uncertainty in human assigned labels by replacing the hard 0, 1 (false, true) label distribution, , with , where is the th of total categories in the model, is a small constant representing the probability of an incorrect label, and is the human label. This modification results in an additional penalty term being introduced into the loss function, Equation 2. Assuming that is reasonably small, this technique reduces the effect of incorrect labels while still encouraging correct classification szegedy2015 ().
dropout: at every step of the training process, each individual neuron in a given layer has a probability, , of being temporarily set to zero, or “dropped out” hinton2012 (); srivastava2014 (). The purpose of dropout is to prevent the co-adaptation of neuron outputs such that each individual neuron depends less on other neurons being present in the network. To preserve the total scale of inputs, the neurons that weren’t dropped out are rescaled by a factor of . Dropout is only applied during training and turned off afterwards.
max-norm constraint: to prevent weights from blowing up, a max-norm constraint is applied to each neuron’s weight vector, , such that , where is the vector norm and is a user specified constant dictating the maximum value. After each training step the constraint is checked and, when necessary, the weights are updated according to . The max–norm constraint, both with and without dropout, has been shown to help reduce over-fitting srivastava2014 (); srebro2005 (). This constraint was applied to fully connected layers only.
early stopping: during the training process, testing loss (error) typically decreases, reaches a minimum value, and then begins to increase again once over-fitting has set in. To avoid using an overfit model, we capture running snapshots of the best version of the model during training, which correspond to the epochs where testing loss reaches a new minimum value bishop1995 (); sjoberg1995 ().
categorical weights: As seen in Figure 7, certain event types, tracks in particular, have fewer training images than others. As a result, the model sees more training examples from the abundant categories than the under-represented ones, which introduces bias into the classifier. To account for this imbalance, each category is assigned a weight, according to its abundance, which is applied to the loss function (Equation 2) to ensure that all categories are represented equally during optimization.
4.4 Model Structure and Training
The best performing model trained in this study begins by taking a normalized, grayscale image (zoomed in on the hit pixel cluster) as input. The input is then transformed via data augmentation (Section 4.2), cropped to , and subjected to dropout with a probability . Next, feature extraction is performed using four three-layer-deep blocks, each of which consists of the following operations: convolution followed by a leaky ReLU activation, a second identical convolution with leaky ReLU, and, lastly, max pooling. For the leaky ReLU non-linearity, a constant multiplier is applied for all negative inputs. Following max pooling in each block, dropout is applied with probability . For each of the four blocks, the number of feature maps is doubled, starting with 64 in the first block and ending with 512 in the last. The model structure is loosely based on the VGG-16 network simonyan2014 (), which used only convolutional filters and max-pooling throughout the network. Following feature extraction, the feature maps are flattened to a single, one-dimensional vector that is used as input for a three-layer fully connected network (Section 3.1). The first two layers are identical dense (fully connected) layers with 2048 neurons, leaky ReLU activation with , and a max-norm constraint with (see Section 4.3). Each dense layer is also followed by dropout with a probability . Finally, the output layer performs softmax regression, which outputs the probability for each of the 4 categories in the model (track, spot, worm, and noise). Figure 8 shows a block diagram of the model structure and workflow. Specific details for each layer are summarized in Table 1.
To train the model, we used a variant of mini-batch SGD (see Section 3.2) known as Adadelta zeiler2012 (). For our model, Adadelta was found to converge slightly faster than both SGD and Adam kingma2014 (), another widely used variant of SGD. At the beginning of each training epoch, a new set of random data augmentation perturbations are applied to each image in the training set. The model was programmed in Python using the Keras neural network application programming interface keras () operating with a Theano theano () backend. The final model contains approximately 25 million trainable parameters and was trained on a single NVIDIA Quadro M4000 graphics processing unit (GPU) with 8 GB of RAM.
5 Results and Analysis
5.1 Model Performance
To estimate the overall performance of the model, independent sets of human-classified images were evaluated using the method of stratified k-fold cross-validation kohavi1995 (). In this procedure, the set of training images is split into groups, where each group contains a roughly equal number of images from each of the categories represented in the model. otherwise identical versions of the model are then trained, each time setting aside one group for testing and for training the model. Selecting a value of 10 for , we trained each individual fold for a total of 800 epochs, where each epoch consists of a single cycle through the full set of training images. The accuracy and loss (defined below) for both training and testing sets, averaged over the 10 folds as a function of training epoch, is shown in Figure 9. The accuracy is defined to be the fraction of images correctly classified by the CNN, assuming the CNN category is equal to the output with the maximum probability. The loss
where is the number of training or testing images, is the number of categories in the model, and and are the respective CNN and human assigned categorical distributions for each image (defined in Section 3.1). Conceptually, the loss can be thought of as the average error between the human and CNN classifications.
Early stopping (Section 4.3) was used to obtain the best performing (lowest testing loss) versions of the model throughout each 800-epoch training session, which, on average, occurred near epoch 650. The average and standard deviation accuracy and loss, evaluated using the best performing epoch for each fold, were found to be % and , respectively. As seen in Figure 9, testing and training accuracy remain approximately equal throughout the training process. However, this is not the case for loss. The substantial gap between the training and testing loss is likely caused by the heavy use of dropout (see Section 4.3), which is only applied to the training set. Lower testing loss than training loss can also be indicative of an underfit model. To test this, an alternate version of the model was trained with dropout removed from all convolutional layers and set to in the dense layers. The results of this test revealed that the gap between testing and training loss disappeared, however, the testing loss also increased slightly, suggesting that larger dropout was not causing the model to be underfit. To investigate the potential benefits of a longer training duration, an additional model was trained for 10,000 epochs. While training loss was observed to decrease slightly, no benefit was seen in the testing set, thus confirming that 800 epochs was sufficient. A value of was used for label smoothing. Setting to 0 as well as using larger values of and all resulted in marginally higher testing loss.
Figure 10 shows a category-by-category summary, known as a confusion matrix, quantifying the error between human and CNN classifications for each category in the model. Each square of this confusion matrix is calculated by averaging the testing set results over the 10 folds in the cross validation. It should be noted that the resulting distribution is not normalized and is biased according to the relative occurrence of each category in the training set. For example, noise events make up almost half of the training set (Figure 7). This bias can be removed by normalizing each row of the confusion matrix to the total counts contained in each row, i.e. the total number of human-labeled events for each category. The resulting row-normalized confusion matrix describes the conditional CNN probability distributions for each of the four human-assigned labels in the model. The probability of the CNN correctly identifying each event type, along with the probability of mis-identifying each category, can be read directly off of the row-normalized confusion matrix in Figure 11. For example, the model correctly identifies human-labeled tracks as tracks 91% of the time, while incorrectly identifying them as worms 7% of the time. This confusion in the classifier is both expected and comparable to human performance, given that, out of the four categories in the model, track and worm event morphologies are among the most similar. The model accurately labels noise events 98% of the time, which is the highest accuracy of any event type. This is also expected due to the vast differences between charged particle events and noise. Moreover, this also confirms that the model successfully learned the concept of noise, justifying the inclusion of this category in the model.
These results assume that a single classification is assigned to each image by choosing the category with the highest CNN output probability. We explore the performance of alternative choices below.
We further evaluate the model’s classification performance by calculating the true and false positive rates for each category, assuming a binary classification scheme (e.g. track and non-track). The true and false positive rates for each category are parameterized according to a threshold applied to its CNN output probability and plotted as a receiver operating characteristic (ROC) curve, as seen in the top panel of Figure 12. For example, requiring a track probability at least 0.9 results in a true positive rate of 60% and a false positive rate of 0.1%. While the trade-off between efficiency and purity can be inferred from the ROC curve, these quantities were also explicitly calculated for tracks, which is the primary category of interest for most DECO users. The resulting efficiency, purity, and efficiency purity curves, averaged over the 10 folds and plotted as a function of track probability threshold, are shown in the bottom panel of Figure 12. For a given fold and threshold, the efficiency is calculated from the testing set and defined to be the ratio of the number of tracks that pass the threshold to the total number of tracks. Likewise, for a given fold and corresponding test set, the purity is defined as the ratio of the number of human-labeled tracks that pass the threshold to the total number of events, regardless of event type, that pass the threshold. The product of the resulting curves is one metric that can be used to determine a threshold value which that balances the efficiency vs. purity trade-off. As seen in Figure 12, the efficiency purity curve is maximized at a threshold of 0.61, which corresponds to an efficiency of 86% and a purity of 88%.
5.2 Comparison With Straight Cuts
Early classification attempts, described in Section 2.2, sought to separate tracks from non-tracks in a binary fashion using straight cuts on simple metrics. This method, which used each image’s area, number of clusters, and eccentricity, can be directly compared to the CNN model. To accomplish this, we treat the CNN output as a binary classification scheme (track or non-track) and evaluate both classification methods on the same set of testing images and corresponding human-assigned labels. The initial, straight-cuts model yielded a track selection with an efficiency of 69% and a purity of 37%. The low purity is likely due to small differences in the event topologies of many tracks and worms, which can be difficult to capture with simple geometric metrics. Moreover, optimization of the straight-cuts approach required aggressive cuts on these metrics, which also contributes to its poor efficiency in identifying tracks. The CNN classification, on the other hand, identifies tracks with 73% efficiency and 96% purity (cutting at a track probability threshold of 0.8, to be explained in Section 5.3), and can also accurately identify worms, spots, and noise with similar performance. Furthermore, the output probabilities of the CNN model enable us to design an event selection with a desired efficiency and/or purity in mind.
5.3 Application To Full Dataset
While the CNN model has a number of uses, providing real-time classifications for the events listed in the public DECO data browser deco_data () is perhaps the most important. For this purpose, we seek to maintain a high-purity set of events identified as tracks. After evaluating constant cut-off values of 0.7, 0.8, and 0.9 on the testing set, we opted for a probability threshold of 0.8, which yields an event selection with a track efficiency of 70% and, most importantly, a track purity of 95%. As a result of applying a threshold cut rather than the maximum-probability criterion, there are some events with probability below threshold for every single category, which are therefore assigned a label of “ambiguous”. More aggressive threshold cuts result in more events being labeled “ambiguous”.
To investigate the effect of a given threshold choice on the full dataset we ran every event in the DECO database (45,000 images) through the CNN model and used the resulting output probabilities to classify each event according to several different threshold choices. The resulting distributions for all event types, shown in Figure 13, confirm that a threshold of 0.8 is indeed reasonable and results in ambiguous images 10% of the time, which is consistent with human categorization ambiguity (Section 4.1). With this in mind, the classification scheme based on a threshold of 0.8 was implemented in the public database, which can now be queried by event type as determined by the CNN deco_data ().
Given the classification assigned to any event using this scheme, it is desirable to know the probability that the CNN classification is in fact correct for each event type. As an example, for tracks this corresponds to the conditional probability , where is the human label and is the CNN label. This probability depends on the relative rate of each event type in the data set, i.e., the prior probability that a given event belongs to a given category. The conditional probability could be calculated directly from the testing data sets used in the 10-fold cross validation, however, the distribution of event types in this set of images is biased in comparison to the full dataset. This is because the training set was intentionally enriched with tracks and worms; tracks are the most interesting events from an astrophysical perspective and worms are the primary source of confusion for tracks. Compared to the training set, the full data set has relatively fewer worms and tracks and more spots and noise events. Fortunately, this bias can be corrected by rescaling the testing set results. To accomplish this, we begin with the approximation that the CNN classifications for the full dataset are entirely correct, an approximation that is justified by the excellent performance of the CNN. We then use the abundance of each event type in the full dataset according to the CNN classification to determine the prior probability that an event belongs to a given category. Next, we apply a threshold cut of 0.8 to the testing set and construct a new confusion matrix (similar to Figure 10). We rescale each row of this confusion matrix by the ratio of the number of events for each event type in the full data set (Figure 13 with a 0.8 threshold) to the number of each event type in the training set (Figure 7). Finally, we rescale the confusion matrix column-wise in order to calculate the conditional probability, , for each category. By necessity, a 5th column for “ambiguous” events was added to the confusion matrix, which shows the distribution of events that don’t meet any of the CNN threshold requirements. The resulting confusion matrix, shown in Figure 14, suggests that all four event types in the full dataset are likely to be classified correctly 90% of the time. Most notably, we estimate that an event classified as a track by the CNN has a 98% probability of being a track according to human classification.
6 Conclusions and Future Work
We have described the development and validation of a convolutional neural network for the classification of images obtained by users running the DECO application. This new approach to image classification resulted in significant improvements over previous classification of DECO images using straight cuts. Event classification using the straight-cuts approach produced a track sample with 20% purity after applying the rescaling procedure described in Section 5.3. The CNN model, on the other hand, yields a data set with an estimated purity of 98% after rescaling to the full DECO data set. This classification scheme has been integrated into the standard DECO processing pipeline and the resulting classification of each event is available along with the event’s image and metadata on the public web site within several hours of detection. The CNN classification can be used in queries, allowing users to select a sample of images of any type, or multiple types, for analysis and outreach purposes.
In addition to improving the overall experience of DECO users, the new model opens the door for new and improved analyses. For example, the measurement of the depletion depth (i.e., sensitive region) of a phone’s camera sensor requires a large, pure sample of cosmic-ray muon tracks. Without a robust method of identifying tracks, the analysis published in vandenbroucke2016 () was limited to a single phone. The new classification enables us to extend this analysis to multiple phones with a lower non-cosmic-ray background in the data set. Once the thickness of the depletion region is known for a particular phone model, it can be used to constrain the incident zenith angle of individual cosmic rays. Together with the azimuthal direction of the track within the sensor plane, this would enable reconstructing the direction of DECO tracks.
While the model was developed exclusively using images in the Android DECO data set, we expect it to generalize to similar data sets with minimal changes. DECO for iOS, which is currently in development, will have a data set consisting of images created by the same charged-particle interactions discussed here. Although the overall camera response will differ from Android phones, the resulting event types are expected to be the same. It is worth mentioning once more that the Android data set consists of images from hundreds of different phone models, each with unique camera sensor response to DECO events. The data augmentation applied during training (Section 4.2) mitigates the effects of model-to-model variation by building invariances into the classification that should enable it to generalize to the iOS data set. The excellent performance of our CNN in identifying particle types in the DECO data set indicates that similar computer vision approaches could be well suited for other experiments that use CCD and CMOS sensors for particle detection.
DECO is supported by the American Physical Society, the Knight Foundation, the Simon Strauss Foundation, QuarkNet, and by National Science Foundation Grant #1707945. We are grateful for beta testing, software development, and valuable conversations with Raaha Azfar, Keith Bechtol, Segev BenZvi, Andy Biewer, Paul Brink, Patricia Burchat, Duncan Carlsmith, Alex Drlica-Wagner, Mike Duvernois, Brett Fisher, Lucy Fortson, Stefan Funk, Mandeep Gill, Laura Gladstone, Giorgio Gratta, Jim Haugen, Kenny Jensen, Kyle Jero, Peter Karn, David Kirkby, Matthew Plewa, David Saltzberg, Marcos Santander, Delia Tosi, and Ian Wisher. We would also like to thank Ilhan Bok, Adrian Cisneros, Alex Diebold, Tyler Dolan, Blake Gallay, Emmanuelle Hannibal, Heather Levi, and Owen Roszkowski for their contributions to the DECO project through our QuarkNet DECO high school internship program.
- journal: Astroparticle Physics
- Users have run DECO on 604 distinct phone models to date.
- In the case of minimization, the objective function is commonly referred to as the cost, loss, or error function.
- Intermediate layer outputs are always connected as inputs for other layers and are therefore never visible as network outputs, hence the term “hidden”.
- In our application, we sum the three color channels R, G, and B to produce a single grayscale color channel.
- The input layer, , isn’t a feature map but is simply the input image for the model.
- Generalization performance is a model’s ability to perform well on previously unseen examples that were not included in the training set.
- To be clear, is translation equivariant if , and translation invariant if , where is a translation operation.
- The error between the true and predicted classification for images in the training set
- Note that this is technically the logarithm of the loss and therefore is not expressed as a percentage.
- J. Vandenbroucke, S. Bravo, P. Karn, M. Meehan, M. Plewa, T. Ruggles, D. Schultz, J. Peacock, A. L. Simons, Detecting particles with cell phones: the Distributed Electronic Cosmic-ray Observatory, PoS ICRC2015 (2016) 691. arXiv:1510.07665.
J. Vandenbroucke, S. BenZvi, S. Bravo, K. Jensen, P. Karn, M. Meehan,
J. Peacock, M. Plewa, T. Ruggles, M. Santander, D. Schultz, A. Simons,
D. Tosi, Measurement
of cosmic-ray muons with the Distributed Electronic Cosmic-ray Observatory,
a network of smartphones, Journal of Instrumentation 11 (04) (2016) P04019.
D. Whiteson, M. Mulhearn, C. Shimmin, K. Cranmer, K. Brodie, D. Burns,
for ultra-high energy cosmic rays with smartphones, Astroparticle Physics 79
(2016) 1 – 9.
- M. Meehan, S. Bravo, F. Campos, J. Peacock, T. Ruggles, C. Schneider, A. L. Simons, J. Vandenbroucke, M. Winter, The particle detector in your pocket: The Distributed Electronic Cosmic-ray Observatory, in: Proceedings, 35th International Cosmic Ray Conference (ICRC 2017): Bexco, Busan, Korea, July 12-20, 2017, 2017. arXiv:1708.01281.
- M. Borisyak, M. Usvyatsov, M. Mulhearn, C. Shimmin, A. Ustyuzhanin, Muon Trigger for Mobile Phones, J. Phys. Conf. Ser. 898 (3) (2017) 032048. arXiv:1709.08605, doi:10.1088/1742-6596/898/3/032048.
M. Ackermann, et al., The
Fermi Large Area Telescope on Orbit: Event Classification, Instrument
Response Functions, and Calibration, The Astrophysical Journal Supplement
Series 203 (1) (2012) 4.
The CMS Collaboration,
The CMS experiment
at the CERN LHC, Journal of Instrumentation 3 (08) (2008) S08004.
- W. E. Lorensen, H. E. Cline, Marching cubes: A high resolution 3d surface construction algorithm, COMPUTER GRAPHICS 21 (4) (1987) 163–169.
S. van der Walt, J. L. Schönberger, J. Nunez-Iglesias, F. Boulogne,
J. D. Warner, N. Yager, E. Gouillart, T. Yu, the scikit-image
contributors, scikit-image: image
processing in Python, PeerJ 2 (2014) e453.
D. Groom, Cosmic rays and
other nonsense in astronomical CCD imagers, Experimental Astronomy 14 (1)
- C. Patrignani, et al., Review of Particle Physics, Chin. Phys. C40 (10) (2016) 100001. doi:10.1088/1674-1137/40/10/100001.
- Y. D. Khan, S. A. Khanand, F. Ahmad, S. Islam, Iris Recognition Using Image Moments and k-means Algorithm, The Scientific World Journal 2014 (2014) 9.
Y. Bengio, Learning Deep
Architectures for AI, Found. Trends Mach. Learn. 2 (1) (2009) 1–127.
- F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Spartan Books, Washington, 1962.
- R. D. Reed, R. J. Marks, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, MIT Press, Cambridge, MA, USA, 1998.
- I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
V. Nair, G. E. Hinton,
Units Improve Restricted Boltzmann Machines, in: Proceedings of the 27th
International Conference on International Conference on Machine Learning,
ICML’10, Omnipress, USA, 2010, pp. 807–814.
- A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in: in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
D. E. Rumelhart, G. E. Hinton, R. J. Williams,
processing: Explorations in the microstructure of cognition, vol. 1, MIT
Press, Cambridge, MA, USA, 1986, Ch. Learning Internal Representations by
Error Propagation, pp. 318–362.
Y. LeCun, L. Bottou, G. B. Orr, K.-R. Müller,
in: Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996
NIPS Workshop, Springer-Verlag, London, UK, UK, 1998, pp. 9–50.
- L. Bottou, F. E. Curtis, J. Nocedal, Optimization Methods for Large-Scale Machine Learning (Jun. 2016). arXiv:1606.04838.
- Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2323. doi:10.1109/5.726791.
P. Y. Simard, D. Steinkraus, J. C. Platt,
Best practices for
convolutional neural networks applied to visual document analysis, in:
Proceedings of the Seventh International Conference on Document Analysis and
Recognition - Volume 2, ICDAR ’03, IEEE Computer Society, Washington, DC,
USA, 2003, pp. 958–.
- Y. L. Boureau, J. Ponce, Y. Lecun, A Theoretical Analysis of Feature Pooling in Visual Recognition, in: ICML 2010 - Proceedings, 27th International Conference on Machine Learning, 2010, pp. 111–118.
- Y. T. Zhou, R. Chellappa, Computation of optical flow using a neural network, in: IEEE 1988 International Conference on Neural Networks, 1988, pp. 71–78 vol.2. doi:10.1109/ICNN.1988.23914.
- K. W. Willett, C. J. Lintott, S. P. Bamford, K. L. Masters, B. D. Simmons, K. R. V. Casteels, E. M. Edmondson, L. F. Fortson, S. Kaviraj, W. C. Keel, T. Melvin, R. C. Nichol, M. J. Raddick, K. Schawinski, R. J. Simpson, R. A. Skibba, A. M. Smith, D. Thomas, Galaxy Zoo 2: detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey, MNRAS 435 (2013) 2835–2860. arXiv:1308.3496, doi:10.1093/mnras/stt1458.
- S. Dieleman, K. W. Willett, J. Dambre, Rotation-invariant convolutional neural networks for galaxy morphology prediction, MNRAS 450 (2015) 1441–1459. arXiv:1503.07077, doi:10.1093/mnras/stv632.
- D. S. Ayres, et al., The NOvA Technical Design Reportdoi:10.2172/935497.
- A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M. D. Messier, E. Niner, G. Pawloski, F. Psihas, A. Sousa, P. Vahle, A convolutional neural network neutrino event classifier, Journal of Instrumentation 11 (2016) P09001. arXiv:1604.01444, doi:10.1088/1748-0221/11/09/P09001.
D. D. Lewis, W. A. Gale,
A sequential algorithm
for training text classifiers, in: Proceedings of the 17th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’94, Springer-Verlag New York, Inc., New York, NY, USA,
1994, pp. 3–12.
- T. Luo, K. Kramer, S. Samson, A. Remsen, D. B. Goldgof, L. O. Hall, T. Hopkins, Active learning to recognize multiple types of plankton, in: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Vol. 3, 2004, pp. 478–481 Vol.3. doi:10.1109/ICPR.2004.1334570.
Y. LeCun, Y. Bengio,
The Handbook of
Brain Theory and Neural Networks, MIT Press, Cambridge, MA, USA, 1998,
Ch. Convolutional Networks for Images, Speech, and Time Series, pp. 255–258.
- Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale Orderless Pooling of Deep Convolutional Activation Features (Mar. 2014). arXiv:1403.1840.
D. Scherer, A. Müller, S. Behnke,
pooling operations in convolutional architectures for object recognition,
in: Proceedings of the 20th International Conference on Artificial Neural
Networks: Part III, ICANN’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp.
- Y. Xu, T. Xiao, J. Zhang, K. Yang, Z. Zhang, Scale-Invariant Convolutional Neural Networks (Nov. 2014). arXiv:1411.6369.
- K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition (Sep. 2014). arXiv:1409.1556.
A. Krizhevsky, I. Sutskever, G. E. Hinton,
classification with deep convolutional neural networks, in: Proceedings of
the 25th International Conference on Neural Information Processing Systems -
Volume 1, NIPS’12, Curran Associates Inc., USA, 2012, pp. 1097–1105.
- D. Marcos, M. Volpi, D. Tuia, Learning rotation invariant convolutional filters for texture classification (Apr. 2016). arXiv:1604.06720.
- K. Lenc, A. Vedaldi, Understanding image representations by measuring their equivariance and equivalence (Nov. 2014). arXiv:1411.5908.
- J. Kukačka, V. Golkov, D. Cremers, Regularization for Deep Learning: A Taxonomy (Oct. 2017). arXiv:1710.10686.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the Inception Architecture for Computer Vision (Dec. 2015). arXiv:1512.00567.
- G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors (Jul. 2012). arXiv:1207.0580.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
Dropout: A simple way to
prevent neural networks from overfitting, Journal of Machine Learning
Research 15 (2014) 1929–1958.
N. Srebro, A. Shraibman, Rank,
Trace-norm and Max-norm, in: Proceedings of the 18th Annual Conference
on Learning Theory, COLT’05, Springer-Verlag, Berlin, Heidelberg, 2005, pp.
- C. M. Bishop, Regularization and Complexity Control in Feed-forward Networks, in: F. Fougelman-Soulie, P. Gallinari (Eds.), Proceedings International Conference on Artificial Neural Networks ICANN’95, Vol. 1, 1995, pp. 141–148.
J. Sjöberg, L. Ljung,
Regularization, and Searching for Minimum in Neural Networks, IFAC
Proceedings Volumes 25 (14) (1992) 73 – 78, 4th IFAC Symposium on Adaptive
Systems in Control and Signal Processing 1992, Grenoble, France, 1-3 July.
- M. D. Zeiler, ADADELTA: An Adaptive Learning Rate Method (Dec. 2012). arXiv:1212.5701.
- D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization (Dec. 2014). arXiv:1412.6980.
- F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015).
Theano Development Team, Theano: A
Python framework for fast computation of mathematical expressions
- R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, Morgan Kaufmann, 1995, pp. 1137–1143.