A probabilistic atlas for cell identification

A probabilistic atlas for cell identification

Greg Bubnis    Steven Ban    Matthew D. DiFranco    Saul Kato corresponding author: saul.kato@ucsf.edu University of California, San Francisco 
Weill Institute for Neurosciences, Department of Neurology
September 4, 2019

We propose a general framework for a collaborative machine learning system to assist bioscience researchers with the task of labeling specific cell identities from microscopic still or video imaging. The distinguishing features of this approach versus prior approaches include: (1) use of a statistical model of cell features that is iteratively improved, (2) generation of probabilistic guesses at cell ID rather than single best-guesses for each cell, (3) tracking of joint probabilities of features within and across cells, and (4) ability to exploit multi-modal features, such as cell position, morphology, reporter intensities, and activity. We provide an example implementation of such a system applicable to labeling fluorescently tagged C. elegans neurons. As a proof of concept, we use a generative spring-mass model to simulate sequences of cell imaging datasets with variable cell positions and fluorescence intensities. Training on synthetic data, we find that atlases that track inter-cell positional correlations give higher labeling accuracies than those that treat cell positions independently. Tracking an additional feature type, fluorescence intensity, boosts accuracy relative to a position-only atlas, suggesting that multiple cell features could be leveraged to improve automated label predictions.

cell atlas, Bayesian, cell identification, cell type, calcium imaging

I Introduction

The development of effective fluorescent reporters, increase in computing power, and proliferation of volumetric microscopy techniques are now enabling acquisition of large tissue volumes or entire organisms, within which many individual cells can be discerned (Ahrens et al. 2013, Kato et al. 2015, Prevedel et al. 2014, Skocek et al. 2018). With the explosion in data collection capabilities and throughput, the time demand for the experimentalist is shifting from data acquisition to post-acquisition data analysis.

Common post-acquisition analyses are to (1) find and delineate these cells in space and, for video recordings, in time, often collectively referred to as region-of-interest (ROI) detection, segmentation, and tracking, (2) extract static or time-series data of fluorescence intensity arising from calcium dynamics or other cellular processes, and (3) determine the identity or type of the delineated cells, in order to fuse cellular-resolution data across trials and animals (“inter-trial registration”) and pave the way for statistically powerful analysis. This last step is the focus of this paper.

In eutelic organisms or certain experimental contexts in non-eutelic organisms, cell identity labels may be unique within a recording (“cell ID”); in other contexts, the labels may be non-unique (“cell type”). Here, we focus on the former problem, typified by volumetric recordings of fluorescently tagged neurons of the brain of the nematode C. elegans hermaphrodite, which is composed of precisely 302 neurons. Manual identification based on comparison to previously labeled images or a traditional reference atlas is extremely laborious, and there currently does not exist a widely used method for automated cell identification in this context. We propose a robust and practical approach for automated cell ID.

Ii Design Features

We outline what we believe to be necessary design requirements as follows:

Statistical model. A canonical reference atlas of anatomy is a timeworn pedagogical device, but well appreciated to neglect inherent biological variability. A more informative atlas – derived from experimental data across multiple trials – should maintain a record of the biological variability of anatomical or physiological features (Mazziotta et al. 2001, Essen and Dierker 2007, Toga et al. 2006). Specifically, the atlas should maintain a probability distribution for each atlas feature or, to capture relationships between these features, a joint probability distribution over the set of atlas features.

Incrementally trained with partially labeled data. We would like the quality of our atlas to improve as each new training dataset is contributed to the system, as measured by a declining error rate and increasing confidence levels of labeling. Furthermore, training sets are likely to lack complete labeling, possibly severely so, because manual identification of even a subset of neurons is extremely laborious and unreliable.

Probabilistic output. Earlier approaches to produce algorithms for cell identification, such as those based on bipartite graph matching, were prone to cascading errors where the mislabeling of one cell could lead to the mislabeling of many other cells, prompting the exploration of probabilistic approaches (Long et al. 2009, Kainmueller et al. 2014, Wu et al. 2018). It is also probably unreasonable to expect a perfect automatic labeling for any large dataset due to the noise and variability in all imaging of biological systems, and lack of perfect knowledge in real-world experimental scenarios. Therefore, a system which can report label guesses with marginal probabilities, which reflect confidence of labeling, is of more practical use (Figure 1).

Acceptance of datasets with missing or erroneous data. Variability in development, reporter expression patterns, feature extraction algorithms, and imaging contexts may result in incoming datasets having varying numbers of cells to be identified. To be generally applicable, identification should be possible on ROI sets that correspond to both subsets and supersets of atlas cells.

Flexibility in exploiting available cell properties. Cells have many observable features which may covary within and between cells, including position, size, shape, fluorescence intensity, and time series (Table 1). We want the atlas to leverage this covariance structure to ID cells with increased confidence.

Community driven. Each experimental context may require its own atlas depending on the characteristics of the data; however, there are likely to be several groups around the world working in similar contexts who need a cell ID system and who can, in return, contribute training datasets for the improvement of the atlas. The precise parameters of each experiment are likely to change, and the use of new datasets must be robust to these changes.

We recommend that this system be implemented as a cloud service for accessibility and to pool the efforts of researchers.

Figure 1: Example probabilistic and non-probabilistic output for a system containing 5 unlabeled ROIs and atlas cells. In this case, labeling is only performed on positions of ROI centroids (dots), and the atlas contains information on the biological variability of the cells (ellipses).
Data Type Data Format Dimension

Single Frame

Position 3
Size 3
Orientation 3
Fluorescence intensity ,…
Expression pattern fold(gene)
Morphology voxel set


Fluorescence time series
Spacetime trajectory
Table 1: Potential features of cells to learn

Iii System Description

Figure 2: Atlas system flowchart

We now describe a system that satisfies the aforementioned design requirements.

A sequence of datasets, possibly contributed by different users, is processed in the cloud one at a time by the system (Figure 2). We assume that a dataset arrives post-ROI detection (and post time series extraction for a video) as a set of unlabeled feature records, one for each cell. For ease of reference, visualization, and proof checking by end users, datasets may be accompanied by the original raw image stack data, although the atlas does not make use of this raw data.

To enable effective atlas learning, datasets must first be registered to match the coordinate system of the reference atlas without the benefit of labels to match particular cells from dataset to particular cells in the atlas. This registration operation may involve global transformations of the dataset, such as scaling, translation, and rotation, as well more localized transformations such as volumetric warping to correct for tissue deformation. Registration of other feature types, such as fluorescence intensity, may involve other approaches such as histogram equalization or color space transformation.

After registration, labels must be assigned to each cell in the dataset. Label assignment across all cells in a dataset is a combinatorial optimization problem, in which an objective function representing likelihood is maximized. Estimated marginal probabilities of individual cell labelings are reported to the user (Figure 1), along with accompanying visualization data for ease of validation.

A curation step, either manual or automated, follows: if the quality the dataset and success of labeling is acceptable, then the labeled dataset, possibly in part or in whole, will be used for a training update to the atlas. In the early stages of atlas training, manual curation will likely be required to produce a useful system for early users. In the future, partial or weighted-importance acceptance of datasets could replace this binary acceptance procedure.

Iv Mathematical Setting

We follow a sequential Bayesian updating approach. We define a probabilistic atlas as a set of hyperparameters that describe a joint probability distribution of cell parameters , along with the selection of probability distribution function family :


with being the number of atlas cells.

Each dataset comes into the system as an unrolled, concatenated feature vector () where contains features for ROI (i.e. unlabeled cell) . At present we only consider the case of . For each , global registration (spatial and/or in other feature spaces) is performed between dataset and atlas features, which produces a transformed dataset :


where parameter vector is the result of optimizing over an objective function such as least-squares error to nearest neighbor. Without such global registration, direct application of probability models with location parameters is likely to fail. An example algorithm for spatial registration is coherent point drift (CPD) (Myronenko and Song 2010). We exclude these transformation parameters from our atlas hyperparameters as they are likely to be dependent on the nature of particular datasets, though in the future a supervised learning approach could be applied to the registration task.

After spatial registration, a putative labeling , defined as a mapping from atlas cells to ROIs with indicating that atlas cell is matched to ROI , is determined by maximizing likelihood over possible label permutations. This is a combinatorial search problem, and our strategies to solve it are discussed later in section VI.

For a transformed, labeled dataset that passes curation, denoted by , a sequential Bayesian update of the hyperparameters is performed:


where the form of is determined by solving the Bayes’ theorem equation


under the appropriate choice of conjugate prior form , which is in turn based on the choice of distribution family (Murphy 2012). The initial values of hyperparameters may be drawn from other data sources or based on uninformative or reference priors.

We consider three types of probabilistic atlas of increasing expressivity: univariate (UV), cell multivariate (CMV), and full multivariate (FMV). For ease of illustration, we use the normal distribution family, although it is not a requirement of the atlas.

Univariate model (UV). In this case, we assume that the probability distributions of each cell parameter are independent and identically distributed (iid). Each is tracked with a normal distribution, with unknown mean and unknown variance . Estimates of the true and can be obtained from sampled data using the normal-inverse-gamma (NIG) conjugate prior with hyperparameters where and are priors for the mean and variance (up to a scaling factor); and and are scalars representing the strength of belief of the respective priors.

Cell multivariate model (CMV). In addition to tracking variability of individual features, we may wish to capture relationships of different feature types within each cell, such as correlation between the , , and coordinates of the cell. We still ignore inter-cell correlation and assume that the cell models are iid. In this case, the probability distribution for each cell is , where is a vector of means for each parameter, and is the covariance matrix of all features of a single cell, both unknown.

Estimates of and can be obtained from sample data using the normal-inverse-Wishart (NIW) conjugate prior with hyperparameters where and are priors for the mean and covariance (up to a scaling factor); and and are scalars representing the strength of belief of the respective priors.

Full multivariate model (FMV). In the full multivariate case, we drop the assumption of iid cells in order to capture inter-cell, inter-feature correlations and therefore use a single multivariate model spanning all parameters. Using a normal distribution, we can employ the same NIW conjugate prior across the full feature set.

In all three cases, we update the atlas hyper parameters with new labeled datasets according to Equation 4. Our atlases then atlas models yield estimates of a multivariate normal distribution, , but with different learned covariance matrix structure: is diagonal, is block diagonal with each block modeling intra-cell feature correlations, and has no a priori block diagonal structure.

Labeling. A putative labeling is scored using a cost function defined by the negative log likelihood as a function of atlas parameters:


For our models, is given by the multivariate Student-T distribution (Murphy 2012). Any convenient cost function that is a monotonic function of the negative log likelihood is also suitable, such as


where are the maximum a posteriori parameters from the atlas posterior (4).

Probabilistic output. The model output would ideally be the marginal probabilities of ROI having cell label evaluated over all possible labelings:


where , and is the marginal distribution, i.e. integrated over . The factorial scaling, however, makes this full evaluation intractable for more than a small number of cells. We presume, without proof, that for a well-fitting model, the terms in the summation for are dominated by a comparatively small number of low-cost (high probability) labelings such that a partial summation


where is analogous to the normalization above, is a reasonable estimate to (7). Note that in the =1 limiting case, with a single best-guess labeling , we recover the usual non-probabilistic labeling . In our probabilistic scheme, the cost global minimum (GM) has the most statistical weight in (7) and (8), and it is therefore essential to have an algorithm capable of finding it. However, in addition, we seek to sample other low-cost labelings such that the partial summation (8) can be used to approximate the fully marginalized probabilities (7).

V Proof-of-concept atlas: spring-mass model

Figure 3: Toy models with (A) 50, (B) 100, and (C) 302 cells. Systems (A) and (B) comprise cells (blue) and two stationary anchor points (gray) which are also included in labeling. The system (C) comprises cells (blue spheres) used for labeling, as well as matrix points (orange spheres) and anchor points (gray spheres). In (C), thin lines distinguish cell-matrix or matrix-anchor connections from (thick) cell-cell connections.

We illustrate the construction and use of a probabilistic atlas for synthetic data. We generated a sequence of frames by simulating the dynamics of a spring-mass system, with each frame representing a new incoming dataset. This model is intended to capture the spatial variability of cells across experiments. The physical behavior of the toy model loosely models the phenomenon of non-overlapping cells jostling within a confined volume with correlated motion due to deformations of the surrounding tissue. The cell feature set is limited to , , and positions. In this model, the ground-truth labeling () is known and the system’s behavior is tunable, providing a convenient testbed. Note that in this proof-of-concept atlas, we assume each training frame is accurately labeled, and contains the same number of ROIs as atlas cells.

The generative toy model comprises point masses in 2D or 3D space, representing cell centroids, connected by damped springs. To test different aspects of the atlas labeling scheme, we generated toy systems with 10, 50, 100, and 302 cells with varying positions and connectivity. The cells were embedded in different environments via spring connections to flexible “matrix” points and fixed “anchor” points.

The system is 2D and sufficiently small such that all labelings can be enumerated, thereby allowing comparison between the approximate (8) and exact (7) marginal probability outputs. The cells’ coordinates were drawn from uniform distributions that allow for significant overlap to test both atlas quality and probabilistic outputs.

The and system (Figure 3A,B and Supplementary Movies 1-2) are 2D and exhibit large-amplitude correlated shifts in cells’ positions. The initial coordinates were generated by a branched growth scheme to give a lumpy but compact distribution of points in space.

The system (Figure 3C and Supplementary Movie 3) is based partially on the empirical neuroanatomy of C. elegans. A set of 134 head neuron 3D positions derived from (White et al. 1986) were copied and translated on the -axis (corresponding to the worm head-tail axis) to give a system of 302 cells with high aspect ratio. This system also incorporates regularly spaced anchor points as well as randomly positioned matrix particles which are bound to both the cells and the anchors, mimicking the influence of surrounding tissue.

The spring-mass systems were simulated via Langevin dynamics (Schlick 2010, see Methods). The particles’ , , and coordinates were snapshotted at fixed time intervals to reduce correlations between time-adjacent frames, and an initial segment of each trajectory was discarded to allow for equilibration of the dynamical behavior. Independent simulations, with different random seeds, were used for atlas training and labeling validation.

Figure 4: Cascading assignment errors using a CMV atlas and the Hungarian algorithm can be remedied using a FMV atlas and the PM scheme. (A) obtained using a CMV atlas for the system. Ellipses represent the mean position and covariance ( contours) for each cell of the CMV atlas and dots are cells to be labeled. Gray/red lines indicate correct/incorrect assignments, respectively. (B) Schematic view of the CMV/FMV cost landscapes and PM search scheme. The HU labeling (red dot) is the CMV (red curve) cost GM , but not (orange dot). Starting with , the PM algorithm seeks to find by minimizing the FMV cost function (blue curve). Note that the vertical cost offset between FMV and CMV is arbitrary and the labeling cost landscapes are, in reality, discrete rather than continuous. (C) FMV labeling costs during PM, with semi-transparent blue circles and green diamonds representing the population pool of legal and illegal assignments, respectively. (D) Labeling accuracy (100 being perfect) during PM for the lowest cost legal labeling. In (C) and (D) the solid blue line and solid circles indicate the lowest cost legal assignment within the population. The accuracy of the lowest cost labeling in (D) is not guaranteed to increase monotonically. (E) The GT assignment determined using PM, where ellipses and dots are as in (A).

Vi Labeling algorithm

To perform atlas guided cell labeling we used the cost function (6), derived from the atlas (UV, CMV, FMV) and an algorithm to search the combinatorial space of labelings. Depending on the atlas model, different schemes are available to solve the combinatorial labeling problem. For the UV and CMV atlas models, the probability factorizes and the cost function becomes a sum of single cell assignment costs,


In these cases, the task to find equates to the well-known linear assignment problem (LAP). Deterministic LAP solvers, namely the Hungarian algorithm (HU) (Kuhn 1955), can reach the optimal solution in polynomial time. This straightforward scheme, however, does not guarantee that the result is the correct labeling , only that it is the optimum, given the cost function. Figure 4 illustrates how (for a CMV atlas) can contain assignment errors that ‘cascade’ throughout the structure.

These issues may be addressable with a full multivariate model (FMV) that captures correlations of features between cells. Unfortunately, the cost function for the FMV model does not factorize; thus the assignment problem is nonlinear. To approach this combinatorial optimization problem (GM search), we use a population cost minimization (PM) scheme that uses greedy and stochastic minimization steps as well as genetic algorithm-like mating operations. Non-bipartite (‘illegal’) assignments – where one ROI may hold two or more atlas cell labels, i.e. , and other ROIs are unlabeled – can be generated as intermediates in this procedure, but they are ‘legalized’ by resolving missing and duplicate labels.

Figure 5: (A) Schematic of the PM algorithm. (B) Different moves used during PM. A labeling , shown for the mate move, is illustrated as two stacked rows where each column, an pair, indicates a single label assignment . The upper row of values is fixed, and is thus omitted for the rest of the illustrated labelings.

The PM scheme is carried out in cycles, where one cycle takes a population of labelings , performs intermediate steps, and returns a new population . Figure 5 illustrates one cycle (panel A) and its constituent labeling moves (panel B). Each cycle starts (step I) by creating a set of 32 new labelings, , either by mating pairs or copying an individual from (with mating probability 50%). The labelings in then go through steps II-IV in series before the results are pooled (step V) and culled (step VI). The different intermediate steps/moves are summarized as follows:

  1. Mating: A child labeling is created where each element is drawn with a 50:50 chance from either or . This introduces randomization and can potentially combine correct labels from disparate parent labelings, but can also create illegal assignments.

  2. Pairwise swaps (greedy): For each atlas index (order shuffled) of labeling , the pairwise swap where that gives the lowest cost (including the null swap) is selected. This step tests all pairwise swaps, requiring cost function calls, assuming no heuristics are applied to winnow attempted swaps.

  3. Stochastic flips: One label is flipped, , where and are both randomly drawn and the flip is only accepted only if cost decreases (1000 attempts per labeling).

  4. Legalization: For a labeling that is ‘illegal’ due to mating or stochastic flips, for each (shuffled order) if , is chosen from the set of unused ROI indices such that gives the lowest cost. Legalization often increases cost and therefore does not directly drive minimization, but it converts illegal labelings back to legal ones so that the population of legal assignments can advance.

  5. Pooling: Pooling combines with the labelings returned from steps I-IV into and removes any duplicates.

  6. Culling: Culling is carried out on using a modified cost function: , where =4 and is the number of unique in . The added term does not influence legal assignment costs () but adds a growing penalty for re-used in (’illegality’) in order to prevent exceedingly non-physical, low-cost labelings from misdirecting the population. With the pool ranked by , is built from the set union of: i) the 24 best labelings (illegal or legal) and ii) the 12 best legal labelings.

  7. Iterating: (Optional) The next cycle is carried out with .

The PM scheme was seeded with and the naive Bayes labeling (where minimizes the linear cost sum Equation 9 irrespective of legality) from the corresponding CMV model. Only legal labelings are considered valid output, e.g. for use in Equation 7 or Equation 8.

Figure 6: (A-C) Atlas guided labeling accuracies. The systems with =50, 100, and 302 cells all had validation sets of 250 frames. (A) The mean labeling accuracy . Error bars are 95% confidence intervals. (B) The minimum accuracy . The y-axis limits differ between (A) and (B) because values have a larger range than (C) The GT hit fraction . Horizontal gray lines in (A)-(C) indicate the theoretical maximum (either or 1), and the legends in row (A) apply to all other panels in (B)-(C). (D) The number of PM cycles required (a minimum of three cycles were always run) for for frames in the validation set.

Results. Here, we compare the GM labelings , found using the different models (UV/CMV/FMV) and labeling schemes (HU/PM). For the UV/CMV models, is guaranteed to be , whereas the PM search using the FMV model is not guaranteed to find the GM. In the latter case, the best found labelings are considered putative global minima.

To quantify the accuracy of a labeling relative to we use two metrics


which are the number of correct labels and a test for an exact match to , respectively. When averaged for a set of validation frames, is the mean labeling accuracy and is the GT hit fraction.

In Figure 6 we plot , (the least accurate labeling in the entire validation set), and as a function of how many frames were used for atlas training. For the systems, the FMV model has equal or superior performance to the UV and CMV models. This shows that the correlations between cells’ positions, learned by FMV model, carry additional, useful information for cell labeling. The gain in accuracy is most evident for , reflecting the fact that the UV and CMV models are susceptible, on occasion, to large assignment errors that cascade through a structure, whereas the FMV model is not. Figure 4C shows one such case where the CMV-HU labeling has 81/100. Using the FMV model and PM algorithm, however, the population reaches lower cost labelings and eventually finds after 11 cycles. Despite the high mean accuracy, , for the system with a well-trained atlas, the fact that suggests that perfect labeling accuracy remains a challenging goal for an atlas using only position data.

For , the FMV model requires a considerable amount of training, >4000 frames, before it outperforms the CMV and UV models on all three accuracy metrics. The high training cost for the FMV model is partly attributable to the need to learn a covariance matrix for 302 cells in 3D, whereas the 2D system has a covariance matrix. Moreover, unlike the 50 and 100 systems where the cells form a compact cluster, the cells in the system form many smaller clusters that are weakly coupled. In this latter case, it is to be expected that learning the high-dimensional covariance structure requires more training data. If the covariance for real data is similarly high-dimensional, this suggests that a FMV atlas will require a large amount of training data, spanning multiple experiments.

For the FMV results reported above, the PM scheme always found a legal labeling with , which suggests that the combinatorial search is tractable for . As Figure 6D shows, this typically required 20 or fewer cycles; however, in a handful of cases (<1%) with very inaccurate seed labelings, upwards of 200 cycles were required. Each PM cycle for the =302 system required around three minutes on a single cpu core (Intel i9-7940x).

Vii Generation of probabilistic output

We turn to the task of probabilistic output, as in Figure 1, computed using labelings found during the PM global minimum search or, for the system, computed using all labelings.

In the toy system with 10 point masses, we constructed a matrix of marginal probabilities by fully evaluating labelings and comparing them to marginal probabilities estimated from only the 10 lowest cost labelings (Figure 7). While increasing the number of labelings evaluated does provide a more complete picture of the model output (Figure 7B), a majority of the have values less than and do not significantly improve our output (Figure 7C). In this illustrative example, sampling as few as 8 low-cost labelings yields a max error of 0.01%, which increases our confidence in our reporting of truncated marginal probabilities (Figure 7D).

Figure 7: Evaluation of marginalized probabilities as defined in Equation 7 on toy system containing cells. (A) One frame of system. The set of points represent an instance of an incoming dataset, and ellipses represent single cell variances (1 contours). (B) The log marginalized probability of an atlas cell (row) being assigned to a dataset cell (column) evaluated on 10 (left) and labelings (right). (C) The same log marginalized probability as in (B) thresholded thresholded such that only are reported. Note that there are no clear observable differences between these cases. (D) The maximum absolute error between fully evaluated and truncated marginal probabilities. Evaluating only 8 low-cost labelings (gray dashed line) is sufficient to have a maximum error below 0.01%.

Figure 8 shows one case for a frame of the system where a probabilistic labeling report was computed using Equation 8. In this case, both the CMV-HU and FMV-PM labelings matched , but a group of neighboring cells (70, 85, 90) have shifted positions relative to their atlas means such that the labeling looks questionable. Without knowing , the table in Figure 8C captures the uncertainty in the labeling.

Figure 8: Probabilistic labeling. (A) One frame of the system where the both the NMV-HU and FMV-PM minimum cost labelings are the GT. (B) ROIs (cells) 70, 85, and 90 that have shifted positions relative to their atlas means. In (A) and (B) the formatting is as in Figure 4. (C) The probabilities computed using equation (8) using labelings collected during the PM search. Only labelings contributing were included when computing values. Labels not shown in the table were certain.

Viii Multi-feature atlas

Lastly, we tested what labeling accuracy might be achievable using an atlas with two feature types, position and cell fluorescence intensity. If cells’ fluorescence intensities have a consistent dependence on cell identity across animals, are spatially uncorrelated, and not too noisy, they could help distinguish neighboring cells and improve labeling performance (Figure 9D). For this test, we re-used the =50 system simulations and added a fluorescence intensity to each cell. Intensity was drawn from a uniform distribution across cells with Gaussian noise added to each frame, as shown in Figure 9E.

In the well-trained limit, all three accuracy metrics , , increase with the addition of the intensity feature. For the UV and CMV models, the considerable increase in shows that large misassignment errors are prevented. The GT hit fractions also increase by .

The FMV model is inaccurate when only 10-40 training frames are used. We attribute this to the model initially learning spurious correlations between cell positions and fluorescence intensities. With sufficient training, however, these correlations fade and the FMV model outperforms the UV and CMV models. With 640 or more training frames, the FMV model has near perfect labeling accuracy: 247/250 validation frames are perfectly labeled and the other 3/250 have a single erroneous pairwise flip.

This test suggests that an extra feature uncorrelated to other features could substantially increase atlas labeling performance for real data. Multi-feature datasets could also be used to produce labeled datasets to train a position-only atlas for other experimental contexts lacking intensity data. We suspect that the addition of additional fluorescence intensity channels could further increase prediction accuracy.

Figure 9: Boost in labeling accuracy for a position + fluorescence intensity atlas. (A) Mean labeling accuracy. (B) Minimum labeling accuracy. (C) GT hit fraction. In (A),(B) and (C), the legends and formatting are as in Figure 6. (D) Mean cell positions shaded by mean fluorescence intensity. (E) Cell fluorescence intensities (sorted). Error bars denote two standard deviations.

Ix Discussion and challenges

We have proposed a machine learning framework to train a probabilistic cell atlas for use in computing probabilistic cell labelings for video or still imaging data. We choose a standard Bayesian parametric probability model, although non-parametric and non-Bayesion approaches could be considered as long as they generate a probabilistic labeling. We used synthetic models of point-masses connected by springs to simulate the biological variability of cells across organisms and experiments, and the collective movements caused by deformations of the surrounding tissue. Using this dataset, we compared different probabilistic atlas models (UV, CMV, FMV) and labeling schemes (HU, PM) for their labeling accuracy.

We found that a well-trained FMV atlas that models the covariance between cell positions outperforms more simplistic models (UV/CMV) that do not track inter-cell covariance. Standard linear assignment solvers could not be used with the FMV model, due to its non-additive cost function; therefore we developed an algorithm (PM) to carry out the combinatorial labeling search for the FMV atlas.

Using an atlas trained only with cell positions, we found that the single best-guess labeling was not reliably accurate. However, an atlas that also tracked fluorescence intensity yielded an increase in labeling accuracy, suggesting that utilization of multiple cell features may yield significantly higher performance for real data.

The limited predictive accuracy of a single labeling permutation suggests the need for a probabilistic output that reports label guesses and their uncertainties. We illustrated a means to compute such a probabilistic labeling output. The accuracy of this probabilistic output depends on the extent to which the low-cost labelings (including ) are identified and how well the atlas has been trained. For the synthetic case, probabilistic outputs considering the 8 lowest cost labelings were within <0.01% error of outputs evaluated on all labelings. This suggests but does not prove that a reasonably accurate report for the larger systems should not require an intractable amount of combinatorial sampling. For the larger system, the probabilistic assignments as in Figure 8 might improve from a more comprehensive search for other low-cost labelings in addition to those collected during the PM search. Ensemble sampling methods such as Wang-Landau sampling (Wang and Landau 2001) or simulated tempering (Marinari and Parisi 1992) could potentially be applied.

In the current scheme, our model is trained on perfectly labeled frames containing the same number of ROIs as atlas labels. We anticipate this not to be the case with data taken from real experiments, which can be expected to have both missing and false-positive ROIs, as well as partial and erroneous labels. Partially supervised learning is an active area of machine learning research.

Furthermore, we assumed that our training data was pre-registered into a common coordinate system. In practice, depending on experimental context, datasets will need to be to pre-processed in a number of ways, each of which could influence labeling outcomes. Achieving high quality registration is an open challenge.

There is a large computational cost of the combinatorial search over labelings and evaluation of multivariate probability models. This cost could be reduced by building a more restricted model tailored to a particular system that leverages locality in physical or feature space. This may allow the use of partial multivariate models that require fewer computations to train and evaluate, as well as efficient means of combinatorial search of labelings. Additionally, approaches based on relaxing the combinatorial search over discrete labelings into a continuous optimization problem (Linderman et al. 2017, Mena et al. 2018) may yield computationally efficient probabilistic labeling algorithms.

The selection of feature vectors and probability distribution family are crucial design decisions of the atlas curator; therefore, we advocate storage of datasets along with the parameterized distribution so that more sophisticated models can be tried later or formal model selection can be performed.

The general probabilistic scheme described here can also be applied to cell typing, rather than cell ID, although the choice of feature vectors, probability distribution families, and labeling algorithms is likely to be different.


Each spring mass system consisted of points with cartesian coordinates connected by springs. Given initial coordinates , spring connections were generated to give each point a minimum connectivity, to its nearest neighbors. The system’s potential energy is a sum over connected pairs (springs, ):


where and is a spring’s relaxed distance taken from (here and index the point-masses within a simulation, and are not to be confused with the previous labeling notation). The exponential term mimics short range inter-cell repulsion and prevents unphysical overlap between connected points. The potential energy parameters for the different toy systems are given in Table II.

Spring-mass system dynamics were integrated using the Langevin equation


and a Runge-Kutta 4th order integrator with a timestep of =0.1, where is a stationary gaussian process, all masses are set to unity, is the reduced temperature, and =0.2 provides damping. The system coordinates, snapshotted every 200 integration steps, were used as input for atlas training and validation.

50 1.6 10 20 2 7
100 1.6 10 20 2 7
302 1.7 0.05 20 0.2 8
Table 2: Spring-mass simulation parameters

All code was implemented in Python and run on a Ubuntu 18.04LTS laptop (Intel i7-7560U) or Windows 10 Home desktop (Intel i9-7940x) and will be available on the focolab.org code repository. The molecular visualization package PyMOL (Schrödinger, LLC 2015) was used for visualization of spring-mass simulations.


We thank Sean Escola and Raymond L. Dunn for valuable discussion. This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM124735 and UCSF grant PBBR-NFR-7028580. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


  • Ahrens et al. [2013] M. B. Ahrens, M. B. Orger, D. N. Robson, J. M. Li, and P. J. Keller. Whole-brain functional imaging at cellular resolution using light-sheet microscopy. Nature Methods, 10:413–420, 2013.
  • Essen and Dierker [2007] D. C. V. Essen and D. L. Dierker. Surface-based and probabilistic atlases of primate cerebral cortex. Neuron, 56(2):209 – 225, 2007.
  • Kainmueller et al. [2014] D. Kainmueller, F. Jug, C. Rother, and G. Myers. Active graph matching for automatic joint segmentation and annotation of C. elegans, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014, pages 81–88, Cham, 2014. Springer International Publishing.
  • Kato et al. [2015] S. Kato, H. S. Kaplan, T. Schrödel, S. Skora, T. H. Lindsay, E. Yemini, S. Lockery, and M. Zimmer. Global brain dynamics embed the motor command sequence of Caenorhabditis elegans. Cell, 163(3):656–669, 2015.
  • Kuhn [1955] H. Kuhn. The Hungarian method for the assignment problem. Naval Res. Logist. Quart., 2:83–98, 1955.
  • Linderman et al. [2017] S. Linderman, G. Mena, H. Cooper, L. Paninski, and J. Cunningham. Reparameterizing the birkhoff polytope for variational permutation inference. arXiv:1710.09508, 2017.
  • Long et al. [2009] F. Long, H. Peng, X. Liu, S. K. Kim, and E. Myers. A 3D digital atlas of C. elegans and its application to single-cell analyses. Nature Methods, 6:667–672, 2009.
  • Marinari and Parisi [1992] E. Marinari and G. Parisi. Simulated tempering: A new monte carlo scheme. Europhysics Letters (EPL), 19(6):451–458, 1992.
  • Mazziotta et al. [2001] J. Mazziotta, A. Toga, A. Evans, P. Fox, others, K. Zilles, R. Woods, T. Paus, G. Simpson, B. Pike, C. Holmes, L. Collins, P. Thompson, D. MacDonald, M. Iacoboni, T. Schormann, K. Amunts, N. Palomero-Gallagher, S. Geyer, L. Parsons, K. Narr, N. Kabani, G. Le Goualher, D. Boomsma, T. Cannon, R. Kawashima, and B. Mazoyer. A probabilistic atlas and reference system for the human brain: International consortium for brain mapping (icbm). Philos Trans R Soc Lond B Biol Sci, 356(1412):1293–1322, 2001.
  • Mena et al. [2018] G. Mena, D. Belanger, S. Linderman, and J. Snoek. Learning Latent Permutations with Gumbel-Sinkhorn Networks. arXiv:1802.08665, 2018.
  • Murphy [2012] K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
  • Myronenko and Song [2010] A. Myronenko and X. Song. Point Set Registration: Coherent Point Drift. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2262–2275, 2010.
  • Prevedel et al. [2014] R. Prevedel, Y.-G. Yoon, M. Hoffmann, N. Pak, G. Wetzstein, S. Kato, T. Schrödel, R. Raskar, M. Zimmer, E. S. Boyden, and A. Vaziri. Simultaneous whole-animal 3d imaging of neuronal activity using light-field microscopy. Nature Methods, 11:727 –730, 2014.
  • Schlick [2010] T. Schlick. Molecular modeling and simulation: an interdisciplinary guide: an interdisciplinary guide, volume 21. Springer Science & Business Media, 2010.
  • Schrödinger, LLC [2015] Schrödinger, LLC. The PyMOL molecular graphics system, version 1.8. 2015.
  • Skocek et al. [2018] O. Skocek, T. Nöbauer, L. Weilguny, F. Martínez Traub, C. N. Xia, M. I. Molodtsov, A. Grama, M. Yamagata, D. Aharoni, D. D. Cox, P. Golshani, and A. Vaziri. High-speed volumetric imaging of neuronal activity in freely moving rodents. Nature Methods, 15(6):429–432, 2018.
  • Toga et al. [2006] A. W. Toga, P. M. Thompson, S. Mori, K. Amunts, and K. Zilles. Towards multimodal atlases of the human brain. Nat Rev Neurosci, 7(12):952–966, 2006.
  • Wang and Landau [2001] F. Wang and D. P. Landau. Efficient, multiple-range random walk algorithm to calculate the density of states. Phys. Rev. Lett, 86:2050–2053, 2001.
  • White et al. [1986] J. G. White, E. Southgate, J. N. Thomson, and S. Brenner. The Structure of the Nervous System of the Nematode Caenorhabditis elegans. Philosophical Transactions of the Royal Society B: Biological Sciences, 314(1165):1–340, 1986.
  • Wu et al. [2018] S. Wu, Y. Toyoshima, M. S. Jang, M. Kanamori, T. Teramoto, Y. Iwasaki, T. Ishihara, Y. Iino, and R. Yoshida. An ensemble learning approach to auto-annotation for whole-brain c. elegans imaging. bioRxiv, 2018. doi: 10.1101/180430.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description