Classification of crystallization outcomes using deep convolutional neural networks

Classification of crystallization outcomes using deep convolutional neural networks

Andrew E. Bruno Center for Computational Research, University at Buffalo, Buffalo, New York, United States of America.    Patrick Charbonneau Department of Chemistry, Duke University, Durham, North Carolina, USA Department of Physics, Duke University, Durham, North Carolina, USA    Janet Newman Collaborative Crystallisation Centre CSIRO, Parkville, Victoria, Australia.    Edward H. Snell Hauptman-Woodward Medical Research Institute and SUNY Buffalo, Department of Materials, Design, and Innovation, Buffalo, New York 14203, United States of America.    David R. So Google Brain, Google Inc., Mountain View, California, United States of America.    Vincent Vanhoucke Google Brain, Google Inc., Mountain View, California, United States of America.    Christopher J. Watkins IM&T Scientific Computing, CSIRO, Clayton South, Victoria, Australia    Shawn Williams Platform Technology and Sciences, GlaxoSmithKline Inc., Collegeville, Pennsylvania, United States of America.    Julie Wilson Department of Mathematics, University of York, York, United Kingdom.
August 1, 2019

The Machine Recognition of Crystallization Outcomes (MARCO) initiative has assembled roughly half a million annotated images of macromolecular crystallization experiments from various sources and setups. Here, state-of-the-art machine learning algorithms are trained and tested on different parts of this data set. We find that more than 94% of the test images can be correctly labeled, irrespective of their experimental origin. Because crystal recognition is key to high-density screening and the systematic analysis of crystallization experiments, this approach opens the door to both industrial and fundamental research applications.
Author summary: Protein crystal growth experiments are routinely imaged, but the mass of accumulated data is difficult to manage and analyze. Using state-of-the-art machine learning algorithms on a large and diverse set of reference images, we manage to recapitulate the labels of a remarkably large fraction of the set. This automation should enable a number of industrial and fundamental applications.

I Introduction

X-ray crystallography provides the atomic structure of molecules and molecular complexes. These structures in turn provide insight into the molecular driving forces for small molecule binding, protein-protein interactions, supramolecular assembly and other biomolecular processes. The technique is thus foundational to molecular modeling and design. Beyond the obvious importance of structure information for understanding and altering the role of biomolecules, it also has important industrial applications. The pharmaceutical industry, for instance, uses structures to guide chemistry as part of a “predict first” strategy Harrison et al. (2017), employing expert systems to reduce optimization cycle times and more effectively bring medicine to patients. Yet, despite decades of methodological advances, crystallizing molecular targets of interest remains the bottleneck of the entire crystallography program in structural biology.

Even when crystallization is facile, it is microscopically rare; for macromolecules it is also uncommon McPherson (1999); Chayen (2004); Fusco and Charbonneau (2016); Ng et al. (2016). Experimental trials typically involve: (i) mixing a purified sample with chemical cocktails designed to promote molecular association, (ii) generating a supersaturated solution of the desired molecule via evaporation or equilibration, and (iii) visually monitoring the outcomes, before (iv) optimizing those conditions and analyzing the resultant crystal with an X-ray beam. One hopes for the formation of a crystal instead of non-specific (amorphous) precipitates or of nothing at all. In order to help run these trials, commercial crystallization screens have been developed; each screen generally contains 96 formulations designed to promote crystal growth. Whether these screens are equally effective or not Ng et al. (2016); Fazio et al. (2015) remains debated, but their overall yield is in any case paltry. Typically fewer than 5% of crystallization attempts produce useful results (with a success rate as low as 0.2% in some contexts Newman et al. (2012)).

The practical solution to this hurdle has been to increase the convenience and number of crystallization trials. To offset the expense of reagents and scientist time, labs routinely employ industrial robotic liquid handlers, nanoliter-size drops, and record trial outcomes using automated imaging systems Kotseruba et al. (2012); Newman (2011); Zhang et al. (2017); Ng et al. (2016); Thielmann et al. (2012). Hoping to compensate for the rarity of crystallization, commercially available systems readily probe a large area of chemical space with minimal sample volume with a throughput of individual experiments per hour.

While liquid handling is readily automated, crystal recognition is not. Imaging systems may have made viewing results more comfortable than bending over a microscope, but crystallographers still manually inspect images and/or drops, looking for crystals or, more commonly, conditions that are likely to produce good crystals when optimized. This human cost makes crystal recognition a key experimental bottleneck within the larger challenge of crystallizing biomolecules Newman et al. (2012). A typical experiment for a given sample includes four 96-well screens at two temperatures, i.e., 768 conditions (and can have up to twice that Snell et al. (2008a)). Assuming that it takes 2 seconds to manually scan a droplet (and noting that the scans have to be repeated, as crystallization is time dependent), simply looking at a single set of 96 trials over the lifetime of an experiment can take the better part of an hour 111This estimate is based on personnal communication with five experienced crystallographers at GlaxoSmithKline: 2 seconds/observation 8 observations 96 wells. Note that current technology can automatically store and image plates at about 3 min/plate.. For the sake of illustration, the U.S. Structural Science group at GlaxoSmithKline performs 96-well experiments per year. If the targeted observation schedule were rigorously followed, the group would spend a quarter of the year staring at drops, of which the vast majority contains no crystal. Recording outcomes and analyzing the results of the 96 trials would further increase the time burden. Current operations are already straining existing resources, and the approach simply does not scale for proposed higher-density screening Zhang et al. (2017).

Crystal growth is also sufficiently uncommon that the tolerance for false negatives is almost nil. Yet most crystallographers are misguided in thinking that they themselves would never miss identifying a crystal given an image containing an crystal, or indeed miss a crystal in a droplet viewed directly under a microscope Wilson (). In fact, not only do crystallographers miss crystals due to lack of attention through boredom, they often disagree on the class an image should be assigned to. An overall agreement rate of was found when the classes assigned to 1200 images by 16 crystallographers were compared Wilson (). (When considering only crystalline outcomes, agreement rose to .) Consistency in visual scoring was also considered by Snell et al. when compiling a image dataset Snell et al. (2008b). They found that viewers give different scores to the same image on different occasions during the study, with the average agreement rate for scores on a control set at the beginning and middle of the study being 77%, rising to 84% for the agreement in scores between the middle and end of the study. Crystallographers also tend to be optimistically biased when scoring their own experiments Hargreaves (). A better use of expert time and attention would be to focus on scientific inquiry.

An algorithm that could analyze images of drops, distinguish crystals from trivial outcomes, and reduce the effort spent cataloging failure, would present clear value both to the discipline and to industry. Ideally, such an algorithm would act like an experienced crystallographer in:

  • recognizing macromolecular crystals appropriate for diffraction experiments;

  • recognizing outcomes that, while requiring optimization, would lead to crystals for diffraction experiments;

  • recognizing non-macromolecular crystals;

  • ignoring technical failures;

  • identifying non-crystalline outcomes that require follow up;

  • being agnostic as to the imaging platform used;

  • being indefatigable and unbiased;

  • occurring in a time frame that does not impede the process;

  • learning from experience.

Such an algorithm would further reduce the variance in the assessments, irrespective of its accuracy. A high-variance, manual process is not conducive to automating the quality control of the system end-to-end, including the imaging equipment. Enhanced reproducibility enables traceability of the outcomes, and paves the way for putting in place measurable, continuous improvement processes across the entire imaging chain.

Automated crystallization image classifications that attempt to meet the above criteria have been previously attempted. The research laboratories that first automated crystallization inspection quickly realized that image analysis would be a huge problem, and concomitantly developed algorithms to interpret them Spraggon et al. (2002); Cumbaa and Jurisica (2005); Kawabata et al. (2008); Buchala and Wilson (2008). None of these programs was ever widely adopted. This may have been due in part to their dependence on a particular imaging system, and to the relatively limited use of imaging systems at the time. Many of the early image analysis programs further required very time consuming collation of features and significant preprocessing, e.g., drop segmentation to locate the experimental droplet within the image. To the best of our knowledge, there was also no widespread effort to make a widely available image analysis package in the same way that that the diffraction oriented programs have been organized, e.g., the CCP4 package Winn et al. (2011).

Can a better algorithm be constructed and trained? In order to help answer this question, the Machine Recognition of Crystallization Outcomes (MARCO) initiative was set up Mar (2017). MARCO assembled a set of roughly half a million classified images of crystallization trials through an international collaboration with five separate institutions. Here, we present a machine-learning based approach to categorize these images. Remarkably, the algorithm we employ manages to obtain an accuracy exceeding 94%, which is even above what was once thought possible for human categorization. This suggests that a deployment of this technology in a variety of laboratory settings is now conceivable. The rest of this paper is as follows. Section II describes the dataset and the scoring scheme, Sec. III describes the machine-learning model and training procedure, Secs. IV and V describe and discuss the results, respectively, and Sec. VI briefly concludes.

Ii Material and Methods

Image Data

Institution Technical Setup # of Images
Bristol-Myers Squibb Formulatrix Rock Imager (FRI) 8719
CSIRO Sitting drop, FRI, Rigaku Minstrel Vallotton et al. (2010); Rosa et al. () 15933
HWMRI Under oil, Home system Snell et al. (2008b) 79632
GlaxoSmithKline Sitting drop, FRI 83126
Merck Sitting drop, FRI 305804
Table 1: Breakdown of data sources and imaging technology per institution contributing to MARCO.

The MARCO data set used in this study contains 493,214 scored images from five institutions (See Table 1 Mar (2017)). The images were collected from imagers made from two different manufacturers (Rigaku Automation and Formulatrix), which have different optical systems, as well as by the in-house imaging equipment built at the Hauptman-Woodward Medical Research Institute (HWMRI) High-Throughput Crystallization Center (HTCC). Different versions of the setups were also used – some Rigaku images are collected with a true color camera, some are collected as greyscale images. The zoom extent varies, with some imagers set up to collect a field-of-view (FOV) of only the experimental droplet, and some set for the FOV to encompass a larger area of the experimental setup. The Rigaku and Formulatrix automation imaged vapor diffusion based experiments while the HTCC system imaged microbatch-under-oil experiments. A random selection of 50,284 test images was held out for validation. Images in the test set were not represented in the training set. The precise data split is available from the MARCO website Mar (2017).


Images were scored by one or more crystallographers. As the dataset is composed of archival data, no common scoring system was imposed, nor were exemplar images distributed to the various contributors. Instead, existing scores were collapsed into four comprehensive and fairly robust categories: clear, precipitate, crystal, and other. This last category was originally used as a catchall for images not obviously falling into the three major classes, and came to assume a functional significance as the classification process was further investigated. Examination of the least classifiable five percent of images indeed revealed many instances of process failure, such as dispensing errors or illumination problems. These uninterpretable images were then labelled as “other” during the rescoring, which added an element of quality control to the overall process Mele et al. (2014a).


After a first baseline system was trained (see Sec. III), the 5% of the images that were most in disagreement with the classifier (independently of whether the image was in the training or the test set), were relabeled by one expert, in order to obtain a systematic eye on the most problematic images.

Because no rules were established and no exemplars were circulated prior to the initial scoring, individual viewpoints varied on classifying certain outcomes. For example, the bottom 5% contained many instances of phase separation, where the protein forms oil droplets or an oily film that coats the bottom of the crystallization well. Images were found to be inconsistently scored as “clear”, “precipitate”, or “other” depending on the amount and visibility of the oil film. This example highlights the difficulty of scoring experimental outcomes beyond crystal identification. A more serious source of ambiguity arises from process failure. Many of the problematic images did not capture experimental results at all. They were out of focus, dark, overexposed, dropless, etc. Whatever labeling convention was initially followed, for the relabeling the “other” category was deemed to also diagnose problems with the imaging process.

A total of 42.6% of annotations for the images that were revisited disagreed with the original label, suggesting somewhat high (1 to 2%) label noise in this difficult fraction of the dataset. For a fraction of this data, multiple raters were asked to label the images independently and had an inter-rater disagreement rate of approximately 22%. The inherent difficulty of assigning a label to a small fraction of the images is therefore consistent with the results of Ref. Wilson (). Table 2 shows the final image counts after relabeling.

Number of images
Label Training Validation
Crystals 56,672 6632
Precipitate 212,541 23,892
Clear 148,861 16760
Other 24,856 3,000
Table 2: Data distribution. Final number of images in the dataset for each category after collapsing the labels and relabeling.

Iii Machine Learning Model

The goal of the classifier here is to take an image as an input, and output the probability of it belonging to each of four classes (crystals, precipitate, clear, other) (see Fig. 1). The classifier used is a deep Convolutional Neural Network (CNN). CNNs, originally proposed in Ref. LeCun et al. (1989), and their modern ‘deep’ variants (see, e.g., Refs. LeCun et al. (2015); Rawat and Wang (2017) for recent reviews), have proven to consistently provide reliable results on a broad variety of visual recognition tasks, and are particularly amenable to addressing data-rich problems. They have been, for instance, state of the art on the very competitive ILSVRC image recognition challenge Berg et al. (2010) since 2012.

This approach to visual perception has been making unprecedented inroads in areas such as medical imaging Litjens et al. (2017) and computational biology Angermueller et al. (2016), and have also shown to be human-competitive on a variety of specialized visual identification Krause et al. (2017); Liu et al. (2017). The chosen classifier is thus well suited for the current analysis.

Fig 1: Conceptual Representation of a Convolutional Neural Network. A CNN is a stack of nonlinear filters (three filter levels are depicted here) that progressively reduce the spatial extent of the image, while increasing the number of filter outputs that describe the image at every location. On top of this stack sits a multinomial logistic regression classifier, which maps the representation to one probability value per output class (Crystals vs. Precipitate vs. Clear vs. Others). The entire network is jointly optimized through backpropagation Rumelhart et al. (1986), in general by means of a variant of stochastic gradient descent Bottou (2010).

Model Architecture

The model is a variation on the widely-used Inception-v3 architecture Szegedy et al. (2016), which was state of the art on the ILSVRC challenge around 2015. Several more recent alternatives were tried, including Inception-ResNet-v2 Szegedy et al. (2017), and automatically generated variants of NASNet Zoph et al. (2017), but none yielded any significant improvements. An extensive hyperparameter search was also conducted using Vizier Golovin et al. (2017), also without providing significant improvement over the baseline.

The Inception-v3 architecture is a complex deep CNN architecture described in detail in Ref. Szegedy et al. (2016) as well as the reference implementation Silberman and Guadarrama (2017). We only describe here the modifications made to tailor the model to the task at hand.

Standard Inception-v3 operates on a 299x299 square image. Because the current problem involves very detailed, thin structures, it is plausible to assume that a larger input image may yield better outcomes. We use instead 599x599 images, and compress them down to 299x299 using an additional convolutional layer at the very bottom of the network, before the layer labeled Conv2d_1a_3x3 in the reference implementation. The additional convolutional layer has a depth (number of filters) of 16, a receptive field (it operates on a square patch convolved over the image) and a stride of 2 (it skips over every other location in the image to reduce the dimensionality of the feature map). This modification improved classification absolute accuracy by approximately 0.3%. A few other convolutional layers were shrunk compared to the standard Inception-v3 by capping their depth as described in Table 3, using the conventions from the reference implementation.

Layer Max depth
Conv2d_4a_3x3 144
Mixed_6b 128
Mixed_6c 144
Mixed_6d 144
Mixed_6e 96
Mixed_7a 96
Mixed_7b 192
Mixed_7c 192
Table 3: Limits applied to layer depths to reduce the model complexity. In each named layer of the deep network – here named after the conventions of the reference implementation – every convolutional subblock had its number of filters reduced to contain no more than these many outputs.

While these parameters are exhaustively reported here to ensure reproducibility of the results, their fine tuning is not essential to maximizing the success rate, and was mainly motivated by improving the speed of training. In the end, it was possible to train the model at larger batch size (64 instead of 32) and still fit within the memory of a NVidia K80 GPU (see more details in the training section below). Given the large number of examples available, all dropout Srivastava et al. (2014) regularizers were removed from the model definition at no cost in performance.

Data Preprocessing and Augmentation

The source data is partitioned randomly into 415990 training images and 47062 test images.

The training data is generated dynamically by taking random 599x599 patches of the input images, and subjecting them to a wide array of photometric distortions, identical to the reference implementation:

  • randomized brightness ( 32 out of 255),

  • randomized saturation (from 50% to 150%),

  • randomized hue ( 0.2 out of 0.5),

  • randomized contrast (from 50% to 150%).

In addition, images are randomly flipped left to right with 50% probability, and, in contrast to the usual practice for natural scenes which don’t have a vertical symmetry, they are also flipped upside down with 50% probability. Because images in this dataset have full rotational invariance, one could also consider rotations beyond the mere 90, 180, 270 that these flips provide, but we didn’t attempt it here, as we surmise the incremental benefits would likely be minimal for the additional computational cost. This form of aggressive data augmentation greatly improves the robustness of image classifiers, and partly alleviates the need for large quantities of human labels.

For evaluation, no distortion is applied. The test images are center cropped and resized to 599x599.


The model is implemented in TensorFlow Abadi et al. (2015), and trained using an asynchronous distributed training setup Dean et al. (2012) across 50 NVidia K80 GPUs. The optimizer is RmsProp Tieleman and Hinton (2012), with a batch size of 64, a learning rate of 0.045, a momentum of 0.9, a decay of 0.9 and an epsilon of 0.1. The learning rate is decayed every two epochs by a factor of 0.94. Training completed after 1.7M steps (Fig. 2) in approximately 19 hours, having processed 100M images, which is the equivalent of 260 epochs. The model thus sees every training sample 260 times on average, with a different crop and set of distortions applied each time. The model used at test time is a running average of the training model over a short window to help stabilize the predictions.

Fig 2: Classifier Accuracy. Accuracy on the training and validation sets as a function of the number of steps of training. Training halts when the performance on the evaluation set no longer increases (‘early stopping’). As is typical for this type of stochastic training, performance increases rapidly at first as large training steps are taken, and slows down as the learning rate is annealed and the model fine-tunes its weights.

Iv Results


The original labeling gave rise to a model with 94.2% accuracy on the test set. Relabeling improved reported classification accuracy by approximately 0.3% absolute, with the caveat that the figures are not precisely comparable since some of the test labels changed in between. The revised model thus achieves 94.5% accuracy on the test set for the four-way classification task. It overfits modestly to the training set, reaching just above 97% at the early-stopping mark of 1.7M steps. Table 4 summarizes the confusions between classes. Although the classifier does not perform equally well on images from the various datasets, the standard deviation in performance from one set to another is fairly small, about 5% (see Table 5), compared to the overall performance of the classifier.

True Predictions
Label Crystals Precipitate Clear Other
Crystals 91.0% 5.8% 1.7% 1.5%
Precipitate 0.8% 96.1% 2.3% 0.7%
Clear 0.2% 1.8% 97.9% 0.2%
Other 4.8% 19.7% 5.9% 69.6%
Table 4: Confusion Matrix. Fraction of the test data that is assigned to each class based on the posterior probability assigned by the classifier. For instance, 0.8% of images labeled as Precipitate in the test set were classified as Crystals.
True Predictions
Label Crystals Precipitate Clear Other
Crystals 5% 4% 1% 1%
Precipitate 2% 4% 1% 2%
Clear 1% 3% 5% 1%
Other 7% 15% 6% 21%
Table 5: Standard Deviation of the predictions across data sources. Note in particular the large variability in the consistency of the label ’Other’ across datasets, which leads to comparatively poor selectivity of that less well-defined class.

The classifier outputs a posterior probability for each class. By varying the acceptance threshold for a proposed classification, one can trade precision of the classification against recall. The receiver operating characteristic (ROC) curves can be seen in Fig. 5.


At CSIRO C3 a workflow Watkins (2018) has been set up which uses a variation of the analysis tool from DeepCrystal dee (2017) to analyze newly collected crystallisation images and to assign either no score, ‘crystal’ score or ‘clear’ score. A total of 37,851 images were collected in Q1 2018 and assigned a human score by a C3 user were used as an independent dataset to test the MARCO tool. Within this dataset, 9746 images had been identified as containing crystals. The current, DeepCrystal tool (which assigns only ‘crystal’ or ‘clear’ scores) was found to have an overall accuracy rate of 74%, while the MARCO tool has 90%. Although this retrospective analysis doesn’t allow for a direct comparison of the ROC, the precision, recall and accuracy of the two tools all favor the MARCO tool, as shown in table 6. The precision achieved by MARCO on this dataset is also very similar to that seen for the CSIRO images in the training data.

DL tool Precision Recall Accuracy
DeepCrystal 0.4928 0.4520 0.7391
MARCO 0.7777 0.8663 0.9018
Table 6: Validation at C3 Precision, recall and accuracy from an independent set of images collected after the MARCO tool was developed. The 38K images of sitting drop trials were collected between January 1 and March 30, 2018 on two Formulatrix Rock Imager (FRI) instruments.
Fig 5: Receiver Operating Characteristic Curves. (Q) Percentage of the correctly accepted detection of crystals as a function of the percentage of incorrect detections (AUC: 98.8). 98.7% of the crystal images can be recalled at the cost of less than 19% false positives. Alternatively, 94% of the crystals can be retrieved with less than 1.6% false positives. (B) Percentage of the correctly accepted detection of precipitate as a function of the percentage of incorrect detections (AUC: 98.9). 99.6% of the precipitate images can be recalled at the cost of less than 25% false positives. Alternatively, 94% of the precipitates can be retrieved with less than 3.4% false positives.

Pixel Attribution

We visually inspect to what parts of the image the classifier learns to attend by aggregating noisy gradients of the image with respect to its label on a per-pixel basis. The SmoothGrad Smilkov et al. (2017) approach is used to visualize the focus of the classifier. The images in Fig. 9 are constructed by overlaying a heat map of the classifier’s attention over a grayscale version of the input image.

Fig 9: Sample heatmaps for various types of images. (A) Crystal: the classifier focuses on some of the angular geometric features of individual crystals (arrows). (B) Precipitate: the classifier lands on the precipitate (arrows). (C) Clear: The classifier broadly samples the image, likely because this label is characterized by the absence of structures rather than their presence. Note the slightly more pronounced focus on some darker areas (circle and arrows) that could be confused for crystals or precipitate. Because the ‘Others’ class is defined negatively by the the image being not identifiable as belonging to the other three classes, heatmaps for images of that class are not particularly informative.

Note that saliency methods are imperfect and do not in general weigh faithfully all the evidence present in an image according to their contributions to the decision, especially when the evidence is highly correlated. Although these visualizations paint a simplified and very partial picture of the classifier’s decision mechanisms, they help confirm that it is likely not picking up and overfitting to cues that are irrelevant to the task.

Inference and Availability

The model is open-sourced and available online at Vanhoucke (2018). It can be run locally using TensorFlow or TensorFlow Lite, or as a Google Cloud Machine Learning clo (2018) endpoint. At time of writing, inference on a standard Cloud instance takes approximately 260ms end-to-end per standalone query. However, due to the very efficient parallelism properties of convolutional networks, latency per image can be dramatically cut down for batch requests.

V Discussion

Previous attempts at automating the analysis of crystallisation images have employed various pattern recognition and machine learning techniques, including linear discriminant analysis Cumbaa et al. (2003); Saitoh et al. (2005), decision trees and random forests Bern et al. (2004); Liu et al. (2008); Cumbaa and Jurisica (2010), and support vector machines Pan et al. (2006); Buchala and Wilson (2008). Neural networks, including self-organizing maps, have also been used classify these images Spraggon et al. (2002); Po and Laine (2008), with the most recent involving deep learning Yann and Tang (2016). However, all previous approaches have required a consistent set of images with the same field of view and resolution, in order to identify the crystallization droplet in the well Vallotton et al. (2010), and thereby restrict the analysis. Various statistical, geometric or textural features were then extracted, either directly from the image or from some transformation of the region of interest, to be used as variables in the classification algorithms.

The results from various studies can be difficult to compare head-to-head because different groups present confusion matrices with the number of classes ranging from 2 to 11, only sometimes aggregating results for crystals/crytalline materials. There is also a tradeoff between the number of false negatives and the number of false positives. Yet most report classification rates for crystals around 80-85% even in more recent work Cumbaa and Jurisica (2010); Kotseruba et al. (2012); Hung et al. (), in which missed crystals are reported with much lower rates. This advance comes at the expense of more false positives. For example, Pan et al. report just under 3% false negatives, but almost 38% false positives Pan et al. (2006).

As the trained algorithms are specific to a set of images, they are also restricted to a particular type of crystallisation experiment. Prior to the curation of the current dataset, the largest set of images (by far) came from the Hauptman-Woodward Medical Research Institute HTCC Snell et al. (2008b). This dataset, which contains 147,456 images from 96 different proteins but is limited to experiments with the microbatch-under-oil technique, has been used in a number of studies Fusco et al. (2014); Yann and Tang (2016). Most notably, Yann et al. used a deep convolutional neural network that automatically extracted features, and reported a correct classification rates as high as 97% for crystals and 96% for non-crystals. Although impressive, these results were however obtained from a curated subset of 85,188 clean images, i.e., images with class labels on which several human experts agreed Yann and Tang (2016). In order to validate our approach, we retrained our model to perform the same 10-way classification on that subset of the data alone without any tuning of the model’s hyperparameters and achieved 94.7% accuracy, compared to the reported 90.8%.

In this context, the current results are especially remarkable. A crystallographer can classify images of experiments independently of the systems used to create those images. They can view an experiment with a microscope or look at a computer image and reach similar conclusions. They can look at a vapor diffusion experiment or a microbatch-under-oil setup and, again, asses either with confidence. Here, we show that this can be accomplished equally well, if not better, using deep CNNs. A benchtop researcher can classify many images, especially if they relate to a project that has been years in the making. For high-throughput approaches, however, that task becomes challenging. The strength of computational approaches is that each image is treated like the previous one, with no fatigue. Classification of 10,000 images is as consistent as classification of one. This advance opens the door for complete classification of all results in a high-throughput setting and for data mining of repositories of past image data.

Another remarkable aspect of our results is that they leverage a very generic computer vision architecture originally designed for a different classification problem – categorization of natural images – with very distinct characteristics. For instance, one can presume that the global geometric relationships between object parts would play a greater role in identifying a car or a dog in an image, compared to the very local, texture-like features involved in recognizing crystal-like structures. Yet no particular specialization of the model was required to adapt it to the widely differing visual appearances of the samples originating from different imagers. This convergence of approaches toward a unified perception architecture across a wide range of computer vision problems has been a common theme in recent years, further suggesting that the technology is now ready for wide adoption for any human-mediated visual recognition task.

Vi Conclusion

In this work, we have collated biomolecular crystallization images for nearly half a million of experiments across a large range of conditions, and trained a CNN on the labels of these images. Remarkably, the resulting machine-learning scheme was able to recapitulate the labels of more than 94% of a test set. Such accuracy has rarely been obtained, and has no equal for an uncurated dataset. The analysis also identified a small subset of problematic images, which upon reconsideration revealed a high level of label discrepancy. This variability inherent to using human labeling highlights one of the main benefits of automatic scoring. Such accuracy also make conceivable high-density screening.

Enhancing the imaging capabilities by including UV or SONICC results, for instance, could certainly enrich the model. But several research avenues could also be pursued without additional laboratory equipment. In particular, it should be possible to leverage side information that is currently not being used.

  • The four-way classification scheme used is a distillation of 38 categories which are present in the source data. While these categories are presumed to be somewhat inconsistent across datasets, they could potentially provide an additional supervision signal.

  • Because one goal of this classifier is to be able to generalize across datasets, it would be worthwhile to investigate the contribution of techniques that have been designed to specifically reduce the effect of domain shift across data sources on the classification outcomes Ganin and Lempitsky (2015); Bousmalis et al. (2016).

  • Each crystallization experiment records a series of images taken over times. Using the timecourse information could enhance the success rate of the classifier Mele et al. (2014b).

Note in closing that the current study focused on crystallization as an outcome, which is but a small fraction of the protein solubility diagram. Patterns of precipitation, phase separation, and clear drops, also provide information as to whether and where crystallization might occur. The success in identifying crystals, precipitate and clear can be thus also be used to accurately chart the crystallization regimes and to identify pathways for optimization Snell et al. (2008c); Fusco et al. (2014); Altan et al. (2016). The application of this approach to large libraries of historical data may therefore reveal patterns that guide future crystallization strategies, including novel chemical screens and mutagenesis programs.


We acknowledge discussions at various stages of this project with I. Altan, S. Bowman, R. Dorich, D. Fusco, E. Gualtieri, R. Judge, A. Narayanaswamy, J. Noah-Vanhoucke, P. Orth, M. Pokross, X. Qiu, P. F. Riley, V. Shanmugasundaram, B. Sherborne and F. von Delft. PC acknowledges support from National Science Foundation Grant no. NSF DMR-1749374.


  • Harrison et al. (2017) S Harrison, B Lahue, Z Peng, A Donofrio, C Chang,  and M Glick, “Extending ‘predict first’ to the design make-test cycle in small-molecule drug discovery,” Future Med. Chem. 9, 533–536 (2017).
  • McPherson (1999) Alexander McPherson, Crystallization of Biological Macromolecules (CSHL Press, Cold Spring Harbor, 1999) p. 586.
  • Chayen (2004) Naomi E. Chayen, “Turning protein crystallisation from an art into a science,” Curr. Opin. Struct. Biol. 14, 577–583 (2004).
  • Fusco and Charbonneau (2016) Diana Fusco and Patrick Charbonneau, “Soft matter perspective on protein crystal assembly,” Colloids Surf. B: Biointerfaces 137, 22–31 (2016).
  • Ng et al. (2016) J.T. Ng, C. Dekker, P. Reardon,  and F. von Delft, “Lessons from ten years of crystallization experiments at the SGC,” Acta Cryst. D 72, 224–235 (2016).
  • Fazio et al. (2015) V.J. Fazio, T.S. Peat,  and J. Newman, “Lessons for the future,” Methods Mol. Biol. 1261, 141–156 (2015).
  • Newman et al. (2012) Janet Newman, Evan E. Bolton, Jochen Muller-Dieckmann, Vincent J. Fazio, D. Travis Gallagher, David Lovell, Joseph R. Luft, Thomas S. Peat, David Ratcliffe, Roger A. Sayle, Edward H. Snell, Kerry Taylor, Pascal Vallotton, Sameer Velanker,  and Frank von Delft, “On the need for an international effort to capture, share and use crystallization screening data,” Acta Cryst. F 68, 253–258 (2012).
  • Kotseruba et al. (2012) Yulia Kotseruba, Christian A Cumbaa,  and Igor Jurisica, “High-throughput protein crystallization on the world community grid and the gpu,” J. Phys. Conf. Ser. 341, 012027 (2012).
  • Newman (2011) Janet Newman, “One plate, two plates, a thousand plates. how crystallisation changes with large numbers of samples,” Methods 55, 73 – 80 (2011).
  • Zhang et al. (2017) Shuheng Zhang, Charline J.J. Gerard, Aziza Ikni, Gilles Ferry, Laurent M. Vuillard, Jean A. Boutin, Nathalie Ferte, Romain Grossier, Nadine Candoni,  and Stéphane Veesler, “Microfluidic platform for optimization of crystallization conditions,” J. Cryst. Growth 472, 18 – 28 (2017), industrial Crystallization and Precipitation in France (CRISTAL-8), May 2016, Rouen (France).
  • Thielmann et al. (2012) Y. Thielmann, J. Koepke,  and H. Michel, “The esfri instruct core centre frankfurt: Automated high-throughput crystallization suited for membrane proteins and more,” J. Struct. Funct. Genomics 13, 63–69 (2012).
  • Snell et al. (2008a) Edward H. Snell, Angela M. Lauricella, Stephen A. Potter, Joseph R. Luft, Stacey M. Gulde, Robert J. Collins, Geoff Franks, Michael G. Malkowski, Christian Cumbaa, Igor Jurisica,  and George T. DeTitta, “Establishing a training set through the visual analysis of crystallization trials. part II: Crystal examples,” Acta Cryst. D 64, 1131–1137 (2008a).
  • (13) This estimate is based on personnal communication with five experienced crystallographers at GlaxoSmithKline: 2 seconds/observation 8 observations 96 wells. Note that current technology can automatically store and image plates at about 3 min/plate.
  • (14) Julie Wilson, “Automated classification of images from crystallisation experiments,” in Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining, edited by Petra Perner (Springer Berlin Heidelberg) pp. 459–473.
  • Snell et al. (2008b) Edward H. Snell, Joseph R. Luft, Stephen A. Potter, Angela M. Lauricella, Stacey M. Gulde, Michael G. Malkowski, Mary Koszelak-Rosenblum, Meriem I. Said, Jennifer L. Smith, Christina K. Veatch, Robert J. Collins, Geoff Franks, Max Thayer, Christian Cumbaa, Igor Jurisica,  and George T. DeTitta, “Establishing a training set through the visual analysis of crystallization trials. part i: 1̃50 000 images,” Acta Cryst. D 64, 1123–1130 (2008b).
  • (16) David Hargreaves, Personal communication.
  • Spraggon et al. (2002) Glen Spraggon, Scott A. Lesley, Andreas Kreusch,  and John P. Priestle, “Computational analysis of crystallization trials,” Acta Cryst. D 58, 1915–1923 (2002).
  • Cumbaa and Jurisica (2005) Christian Cumbaa and Igor Jurisica, “Automatic classification and pattern discovery in high-throughput protein crystallization trials,” J. Struct. Funct. Genomics 6, 195–202 (2005).
  • Kawabata et al. (2008) Kuniaki Kawabata, Kanako Saitoh, Mutsunori Takahashi, Hajime Asama, Taketoshi Mishima, Mitsuaki Sugahara,  and Masashi Miyano, “Evaluation of protein crystallization state by sequential image classification,” Sensor Rev. 28, 242–247 (2008).
  • Buchala and Wilson (2008) Samarasena Buchala and Julie C. Wilson, “Improved classification of crystallization images using data fusion and multiple classifiers,” Acta Cryst. D 64, 823–833 (2008).
  • Winn et al. (2011) Martyn D. Winn, Charles C. Ballard, Kevin D. Cowtan, Eleanor J. Dodson, Paul Emsley, Phil R. Evans, Ronan M. Keegan, Eugene B. Krissinel, Andrew G. W. Leslie, Airlie McCoy, Stuart J. McNicholas, Garib N. Murshudov, Navraj S. Pannu, Elizabeth A. Potterton, Harold R. Powell, Randy J. Read, Alexei Vagin,  and Keith S. Wilson, “Overview of the ccp4 suite and current developments,” Acta Cryst. D 67, 235–242 (2011).
  • Mar (2017) “MAchine Recognition of Crystallization Outcomes (MARCO),”  (2017), [Online; accessed 17-March-2018].
  • Vallotton et al. (2010) Pascal Vallotton, Changming Sun, David Lovell, Vincent J. Fazio,  and Janet Newman, “Droplit, an improved image analysis method for droplet identification in high-throughput crystallization trials,” J. Appl. Crystallogr. 43, 1548–1552 (2010).
  • (24) N. Rosa, M. Ristic, B. Marshall,  and J. Newman, “Keeping crystallographers app-y,” Acta Cryst. F submitted.
  • Mele et al. (2014a) Katarina Mele, Rongxin Li, Vincent J. Fazio,  and Janet Newman, “Quantifying the quality of the experiments used to grow protein crystals: the iqc suite,” J. Appl. Cryst. 47, 1097–1106 (2014a).
  • LeCun et al. (1989) Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard,  and Lawrence D Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation 1, 541–551 (1989).
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio,  and Geoffrey Hinton, ‘‘Deep learning,” Nature 521, 436 (2015).
  • Rawat and Wang (2017) Waseem Rawat and Zenghui Wang, “Deep convolutional neural networks for image classification: A comprehensive review,” Neural Comput. 29, 2352–2449 (2017).
  • Berg et al. (2010) A Berg, J Deng,  and L Fei-Fei,  “Large scale visual recognition challenge (ILSVRC),”  (2010) .
  • Litjens et al. (2017) Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen AWM van der Laak, Bram van Ginneken,  and Clara I Sánchez, “A survey on deep learning in medical image analysis,” Med. Image Anal. 42, 60–88 (2017).
  • Angermueller et al. (2016) Christof Angermueller, Tanel Pärnamaa, Leopold Parts,  and Oliver Stegle, “Deep learning for computational biology,” Mol. Syst. Biol. 12, 878 (2016).
  • Krause et al. (2017) Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S Corrado, Lily Peng,  and Dale R Webster, “Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy,” arXiv:1710.01711 [cs.CV] (preprint) (2017).
  • Liu et al. (2017) Yun Liu, Krishna Gadepalli, Mohammad Norouzi, George E Dahl, Timo Kohlberger, Aleksey Boyko, Subhashini Venugopalan, Aleksei Timofeev, Philip Q Nelson, Greg S Corrado, et al., “Detecting cancer metastases on gigapixel pathology images,” arXiv:1703.02442 [cs.CV] (preprint) (2017).
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton,  and Ronald J Williams, “Learning representations by back-propagating errors,” Nature 323, 533 (1986).
  • Bottou (2010) Léon Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010 (Springer, 2010) pp. 177–186.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens,  and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) pp. 2818–2826.
  • Szegedy et al. (2017) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke,  and Alexander A Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” arXiv:1602.07261 [cs.CV] (preprint) (2017).
  • Zoph et al. (2017) Barret Zoph, Vijay Vasudevan, Jonathon Shlens,  and Quoc V Le, “Learning transferable architectures for scalable image recognition,” arXiv:1707.07012 [cs.CV] (preprint) (2017).
  • Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro,  and D Sculley, “Google vizier: A service for black-box optimization,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2017) pp. 1487–1495.
  • Silberman and Guadarrama (2017) Nathan Silberman and Sergio Guadarrama, “TensorFlow-Slim image classification model library,”  (2017), [Online; accessed 02-February-2018].
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,  and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15, 1929–1958 (2014).
  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,  and Xiaoqiang Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,”  (2015).
  • Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems (2012) pp. 1223–1231.
  • Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning 4, 26–31 (2012).
  • Watkins (2018) Chris Watkins, “C4, C3 Classifier Pipeline. v1. CSIRO. Software Collection.”  (2018), [Online; accessed 09-May-2018].
  • dee (2017) “DeepCrystal,”  (2017), [Online; accessed 09-May-2018].
  • Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas,  and Martin Wattenberg, “Smoothgrad: Removing noise by adding noise,” arXiv:1706.03825 [cs.LG] (preprint) (2017).
  • Vanhoucke (2018) Vincent Vanhoucke, “Marco repository in TensorFlow Models,”  (2018), [Online; accessed 01-May-2018].
  • clo (2018) “Google Cloud Machine Learning Engine,”  (2018), [Online; accessed 02-February-2018].
  • Cumbaa et al. (2003) Christian A. Cumbaa, Angela Lauricella, Nancy Fehrman, Christina Veatch, Robert Collins, Joe Luft, George DeTitta,  and Igor Jurisica, “Automatic classification of sub-microlitre protein-crystallization trials in 1536-well plates,” Acta Cryst. D 59, 1619–1627 (2003).
  • Saitoh et al. (2005) Kanako Saitoh, Kuniaki Kawabata, Hajime Asama, Taketoshi Mishima, Mitsuaki Sugahara,  and Masashi Miyano, “Evaluation of protein crystallization states based on texture information derived from greyscale images,” Acta Cryst. D 61, 873–880 (2005).
  • Bern et al. (2004) Marshall Bern, David Goldberg, Raymond C. Stevens,  and Peter Kuhn, “Automatic classification of protein crystallization images using a curve-tracking algorithm,” J. Appl. Cryst. 37, 279–287 (2004).
  • Liu et al. (2008) R. Liu, Y. Freund,  and G. Spraggon, “Image-based crystal detection: a machine-learning approach.” Acta Cryst. D 64, 1187–95 (2008).
  • Cumbaa and Jurisica (2010) Christian A. Cumbaa and Igor Jurisica, “Protein crystallization analysis on the world community grid,” J. Struct. Funct. Genomics 11, 61–69 (2010).
  • Pan et al. (2006) Shen Pan, Gidon Shavit, Marta Penas-Centeno, Dong-Hui Xu, Linda Shapiro, Richard Ladner, Eve Riskin, Wim Hol,  and Deirdre Meldrum, “Automated classification of protein crystallization images using support vector machines with scale-invariant texture and gabor features,” Acta Cryst. D 62, 271–279 (2006).
  • Po and Laine (2008) M.J. Po and A.F. Laine, “Leveraging genetic algorithm and neural network in automated protein crystal recognition,” in Proceedings of the 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS’08 - ”Personalized Healthcare through Technology” (2008) pp. 1926–1929.
  • Yann and Tang (2016) Margot Lisa-Jing Yann and Yichuan Tang, ‘‘Learning deep convolutional neural networks for x-ray protein crystallization image analysis,” in Thirtieth AAAI Conference on Artificial Intelligence (2016).
  • (58) Jeffrey Hung, John Collins, Mehari Weldetsion, Oliver Newland, Eric Chiang, Steve Guerrero,  and Kazunori Okada, “Protein crystallization image classification with elastic net,” in SPIE Medical Imaging, Vol. 9034 (SPIE) p. 14.
  • Fusco et al. (2014) Diana Fusco, Timothy J. Barnum, Andrew E. Bruno, Joseph R. Luft, Edward H. Snell, Sayan Mukherjee,  and Patrick Charbonneau, “Statistical analysis of crystallization database links protein physico-chemical features with crystallization mechanisms,” PLoS ONE 9, e101123 (2014).
  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International Conference on Machine Learning (2015) pp. 1180–1189.
  • Bousmalis et al. (2016) Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan,  and Dumitru Erhan, “Domain separation networks,” in Advances in Neural Information Processing Systems (2016) pp. 343–351.
  • Mele et al. (2014b) Katarina Mele, B. M. Thamali Lekamge, Vincent J. Fazio,  and Janet Newman, “Using time courses to enrich the information obtained from images of crystallization trials,” Cryst. Growth Des 14, 261–269 (2014b).
  • Snell et al. (2008c) Edward H. Snell, Ray M. Nagel, Ann Wojtaszcyk, Hugh O’Neill, Jennifer L. Wolfley,  and Joseph R. Luft, “The application and use of chemical space mapping to interpret crystallization screening results,” Acta Cryst. D 64, 1240–1249 (2008c).
  • Altan et al. (2016) Irem Altan, Patrick Charbonneau,  and Edward H. Snell, “Computational crystallization,” Arch. Biochem. Biophys. 602, 12–20 (2016).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description