Embryo staging with weakly-supervised region selection and dynamically-decoded predictions
To optimize clinical outcomes, fertility clinics must strategically select which embryos to transfer. Common selection heuristics are formulas expressed in terms of the durations required to reach various developmental milestones, quantities historically annotated manually by experienced embryologists based on time-lapse EmbryoScope videos. We propose a new method for automatic embryo staging that exploits several sources of structure in this time-lapse data. First, noting that in each image the embryo occupies a small subregion, we jointly train a region proposal network with the downstream classifier to isolate the embryo. Notably, because we lack ground-truth bounding boxes, our we weakly supervise the region proposal network optimizing its parameters via reinforcement learning to improve the downstream classifier’s loss. Moreover, noting that embryos reaching the blastocyst stage progress monotonically through earlier stages, we develop a dynamic-programming-based decoder that post-processes our predictions to select the most likely monotonic sequence of developmental stages. Our methods outperform vanilla residual networks and rival the best numbers in contemporary papers, as measured by both per-frame accuracy and transition prediction error, despite operating on smaller data than many.
July 27, 2019
Following its introduction in 1978, in vitro fertilization (IVF), in which an egg and sperm are combined outside the body, has rapidly emerged as one of the most successful assisted reproductive technologies, contributing to roughly 1.7% of all births in the United States . A single cycle of IVF may lead to the growth of multiple ovarian follicles, each of which may contain an oocyte (egg cell). These oocytes are aspirated with a fine needle using ultrasound guidance while the patient is under anesthesia and subsequently fertilized with sperm. Only a fraction of the oocytes fertilize, and a smaller fraction continue to grow and develop normally as embryos before being considered ready for transfer into the uterus (typically after 5-6 days, though some labs use embryos grown only 3 days). Although this process typically generates multiple embryos, most infertility clinics strongly encourage (and some require) transfer of only one embryo at a time because of the greater maternal and fetal risks associated with multi-fetal gestation.
Unfortunately, even under the best circumstances and with genetic testing, the implantation rate following embryo transfer is around 70%  and may be significantly lower without genetic testing, meaning that patients may be forced to undergo multiple transfers of embryos generated from IVF cycle(s) in order to achieve a single normal pregnancy. This leads to the clinical challenge of identifying and prioritizing those embryo(s) most likely to lead to a normal pregnancy with the fewest total transfers, in the least time and at the lowest cost. These priorities are often in direct conflict, leading to the wide variability in clinical decisions made between doctors, clinics and countries.
To prioritize embryo selection, embryologists typically incorporate scores based on a morphological evaluation of each embryo . Historically, embryos were removed from the incubator for assessment under a microscrope by a trained embryologist one to two times daily. The development of incubators with built-in time-lapse monitoring has enabled non-invasive embryo assessment with comparatively fine-grained detail, inspiring significant interest in applying embryo “morphokinetics” to score and prioritize embryos . Informally, the morphokinetics comprise the timing and morphologic appearance of embryos as they grow and pass through a series of sequential developmental stages, with earlier stages corresponding to cell divisions and subsequent stages corresponding to larger structural milestones, e.g. formation of the blastocyst.
Modern incubators use a high-powered microscope to capture images of a developing embryo approximately every minutes. Currently, embryologists must perform the morphokinetic analysis manually, viewing a sequence of photographs and annotating the time stamps at which each embryo achieves various developmental milestones. These scores are combined according to heuristic formulas to rank the embryos by their putative viability for transfer into a prepared endometrium.
In this paper, we investigate machine learning techniques for automatically detecting these transition times. Specifically, we propose several methods to (i) learn region proposal models from weak supervision to discard background so that the classifier can focus on the region corresponding to the embryo; (ii) incorporate the temporal context of the video into the model architecture; and (iii) post-process our predictions at the sequence level, using dynamic programming to determine the most likely monotonically-increasing sequence of morphokinetic stages.
Our experiments focus on a dataset consisting of time-lapse videos extracted from EmbryoScope™(Vitrolife, Sweden) incubators at a large academic medical center’s fertility clinic. Each frame in the raw videos has resolution. We downsize these to the standard ImageNet dimensions for compatibility with pretrained nets and corresponding hyperparameters.
Compared to a baseline deep residual network (ResNet) , we find significant benefits from each of our three proposed techniques. Our region proposal network selects a region and is optimized by reinforcement learning, following the policy gradient algorithm, using the cross entropy loss of the downstream classifier as a reward signal. This technique improves frame-level accuracy from to . Adding a Long Short-Term Memory (LSTM) recurrent neural network to post-process the predictions, we achieve an additional gain, boosting frame-level accuracy to . Finally, we evaluate two variants of our dynamic programming technique for decoding monotonic predictions, one based on finding the most likely monotonically increasing sequence, and another that minimizes the expected distance between the predicted and actual states (e.g. when the ground truth is ‘stage ’, we prefer to predict stage over stage ). When applied to the raw classifier, the dynamic programming post-processing methods confer improvements to frame-level accuracy of and , respectively. Notably, our techniques yield complementary benefits: altogether, they combine to achieve a frame-level accuracy of , and reduce the transition-level error (number of frames off) of the raw classifier by to .
The EmbryoScope time-lapse system is an embryo incubator capable of holding up to 12 wells simultaneously, each containing one embryo. Built in to the device are both a high-powered microscope and a camera used jointly to photograph each embryo on a -minute cycle. Each frame of the resulting time-lapse video consists of a resolution grayscale image with a well number in the lower-left corner and the time superimposed in the lower-right corner, as seen in Figure 1. The system also captures each image in multiple focal planes, although the central focal plane alone is used in this study.
Our dataset consists of EmbryoScope time lapse videos extracted from incubators at a large academic medical center. These videos span different patients, each with to wells and corresponding videos. Videos begin roughly hours after fertilization, and end roughly hours after fertilization. Annotations in the videos correspond to distinct morpho-kinetic stages, with the embryologist marking the time at which each embryo was first observed in each developmental stage. Among cells that mature successfully (the rest are discarded), the stages are monotonically increasing, meaning that among non-discarded embryos the ground truth labels never regress from a more advanced stage to a less mature stage. We transform our stage transition annotations into per-frame stage labels by applying the most recently assigned stage. We also assign a special tStart stage label to frames before the first stage. The first observed stage corresponds to the moment when two pronuclei are visible (tPNF), and the next several stages correspond to cell divisions. After the embryo reaches cells, the subsequent stages correspond to higher-level features, like the formation of the blastocyst. The embryo stage distribution in the full data set is given in Figure 2.
Because most embryo selection heuristics depend only on the time to reach the cell division milestones, in this study, we focus on the first six stages of development for each embryo, cutting off each video at hour . Moreover, these stages admit a cleaner problem because for a significant portion of our videos, expert (ground truth) annotations are missing for the latter stages. The stages that we address include the initial stage (tStart), the appearance and breakdown of the male and female pronucleus (tPNf), and the appearance of 2 through 4+ cells (t2, t3, t4, t4+). Among these frames, the class distribution is 11.48%, 6.11%, 25.70%, 4.36%, 25.85%, 26.50%. Because we want to be sure that our models generalize not only across frames, or even across embryos but also across patients (mothers), we stratify the dataset by patient, creating training/validation/test splits by randomly selecting // patients and their respective wells. This yields // embryos in the respective splits, corresponding to // frames.
We cast predicting embryo morphokinetics as a multiclass classification problem, where the input is a time-lapse EmbryoScope™video, and the output is a sequence of labels indicating the predicted stage of the embryo at each frame of the video.
Our simplest method consists of applying standard image recognition tools to predict the stage label for each frame given the image . Image classification is now a mature technology, and for all known related tasks, the current best-performing methods are deep convolutional neural networks (CNNs). All of our approaches are based upon convolutional neural networks. Specifically, we choose the ResNet-50 architecture as our base model due to . By default, this model takes as input a resolution image which we downsize from the original image. The output consists of a -dimensional softmax layer corresponding to the class labels, and we optimize the network in the standard fashion to minimize cross entropy loss. We initialize the network with pre-trained weights learned on the benchmark ImageNet image recognition challenge , a practice widely known to confer significant transfer learning benefits . We suspect that given a more relevant source task with a comparably-large dataset (ideally, concerning gray-scale images from cellular microscopy), we might get even greater benefits, although we leave this investigation for future work.
3.1 Weakly-Supervised Embryo Detection
Motivating our first contribution for improving performance of the embryo classifier, we observe that the embryo’s cell(s) lie in a small region of the image, and that the rest of the image, containing the rest of the well and surrounding background consists only of imaging artifacts that have no relevance to stage prediction. We postulate that by first detecting where the embryo is, and then subsequently basing classifications on the cropped region containing only the cell, we could filter out the background noise, improving predictive performance. Moreover, since the subsequent classification is based on a smaller region, we could either (i) save computation, or (ii) refer back to the original image to extract a higher-resolution zoom on the cropped region, providing greater detail to the classifier.
The most standard way to cast the bounding box detection task is to train a model with labeled data corresponding to the height and width of the box as well as an and coordinates to locate the box. For typical detection tasks, current deep learning-based object detection systems require large annotated datasets with bounding box labels. However, we do not have any such labels available for our task.
To learn embryo-encapsulating bounding boxes without explicitly annotated boxes, we propose a new approach that relies only on image-level class labels, optimizing the region proposal model via weak supervision using reinforcement learning. To begin, noting that the embryo size does not vary much, we fix the box dimensions to (a crop), focusing only on identifying the box center. Since we only have the image-level label for the image classification task, the training objective of the detector is to help a downstream classifier to better classify the image. Our two-step detect-then-classify algorithm is described below:
Given an input image , the detector predicts a probabilistic distribution over a rectangular grid of candidate box centers.
Sample a region and get the cropped subregion .
Feed to the classifier to predict probabilities for each class .
Let be the label and be the usual cross entropy loss function. The expected loss of the two-step classification algorithm is
Note that both the detector and classifier share the objective of minimizing the expected classification loss . The intuition behind this objective is that if the image crop has a larger intersection with the cell, it is easier for the classifier to classify the image. On the other hand, if a large part of the cropped image is background, the classifier should not perform much better than random guessing. Note that our detector outputs a probability distribution over grid-cells. At test time we make predictions by centering the bounding box at the expected and coordinates.
The loss function involves computing the expectation with respect to all possible regions. We use the Monte Carlo method to estimate the loss by drawing sample regions . The optimization problem becomes
The gradient for the classifier’s parameter is
The gradient for the detector’s parameter is estimated using the policy gradient, a common reinforcement learning algorithm. Moreover, we incorporate a standard technique for variance reduction, use average rewards as a baseline . This gives us the gradient for ,
In preliminary experiments, we found that solely relying on the objective (2) converges quickly to an unsatisfactory local optimum where the distributions of regions are always peaked on one specific region proposal. To overcome this issue, we encourage exploration in the reinforcement learning objective by adding the negative entropy of the region distribution , a technique made by Mnih et al. . The augmented overall loss function is
where is the weight to balance the term.
The detector predicts the region distribution using a sliding window method based on a Region Proposal Network (RPN) . The region proposal is computed from a intermediate feature map at conv4_2 in Resnet-50. Based on our exploration of the data, we found that the embryo is typicaly contained in a rectangular region that is roughly one quarter the size of the image. To simplify the distribution, we fix the width and height of the rectangle region to be of the size of the image using this prior knowledge. We assume that the center of the region proposal lies on grid, so that we only need to predict the probability of the region lying at each position in that grid. The probability is computed by applying a convolutional filter to the feature map followed by a softmax operation
where indicates the probability of selecting the box center at the -th row and -th column of the grids.
Our base classifier is a Resnet-50 convolutional network that takes a cropped image as its input. We remove the layers in conv5 to speed up computation.
While the detector outputs a distribution of regions, at test time we want to use only “the best” region. Some early experiments revealed the heuristic of choosing the expected center coordinates of the predicted distribution. The average box center is computed by
Our first idea to incorporate context across adjacent frames is to employ recurrent neural networks with Long Short-Term Memory (LSTM)  units. The LSTM takes as input a sequence of inputs, updates its internal state at each time, and predicts a sequence of outputs. The inputs to the LSTM consist of -dimensional feature vectors extracted from the hidden layers of a vanilla CNN. We then feed the feature vector to a bi-directional LSTM layer with units for each direction. We apply a linear mapping of the LSTM output at each time step to classes to get a sequence of predictions . We set to optimizing the model to predict accurate on the middle frames. We do not use predictions made on the first or last frames because they lack sufficient context.
3.3 Structured Decoding with Dynamic Programming (DP)
For embryos that successfully reach the blastocyst stage, ground truth stages in our selected data set are monotonically non-decreasing, reflecting the condition that any viable embryo must continue to grow and developrather than arrest and die. The predictions of frame level CNNs or LSTMs with short sequences cannot learn this constraint since the model does not have enough context. Therefore we impose this inductive bias through a dynamic programming decoder that enforces monotonicity of predictions. For each video, our model predicts the probability of the embryo stages at every frame , where is the total number of frames in the video. We want to find a decoded label sequence such that and most match the frame prediction for each frame. We define a potential function to measure how much the decoded label deviates from and turn the decoding to the following optimization problem:
We investigate two potential functions, the negative log likelihood (NLL) and the earth mover’s distance (EMD), defined by and , respectively, where is the number of development stages. This optimization problem can be solved in polynomial time using Dynamic Programming (DP) with a forward pass and a backward pass.
4.1 Embryo Detection
We train the region proposal network with SGD with momentum and learning rate , with batch size set to . The image is first downsampled to before feeding into the detector. For each image in the batch, we sample regions, extracted as images cropped the from input image, and feed the cropped images into the classifier. We train two detectors with and without entropy regularization ( respectively) to measure the effect of using the augmented loss function.
We also compare another approach to learn the detector using differentiable bi-linear sampling. The idea is that the detector only predicts a single region that is fed to the classifier. We use differentiable bi-linear sampling when cropping the image at that region so that the gradient with respect to the classification loss can be back-propagated to the detector. We change the last layer of the detector to be a fully connected layer to predict the coordinates of the center of the box. We were unable to make this alternative approach converge using SGD, so we eventually settled on the Adam optimizer with default parameters.
To evaluate the performance of the learned detector, we manually label a tiny data set with 120 images randomly sampled from the validation set, corresponding to 20 images from each embryo stage, and use these ground truth labels to get a quantitative evaluation of the detector. We report the Jaccard index, which is calculated by the intersection over union between the ground truth box and predicted box, as well as the euclidean distance between the ground truth box center and predicted box center, measured in pixels in the raw image. We also include the classification accuracy of a two-step detector-classifier on the selected images as this is our actual training objective.
Detection results are shown in Table 1. Our RL training with entropy loss achieves a Jaccard index of and a center distance only pixels from the manually labeled images. The detector trained without the entropy term underperforms the detector with entropy, reflecting network convergence to some local optimum based on the current best performing region at an early stage. The differentiable sampling approach performs poorly for detection; this shows that using a stochastic region proposal in our RL training is crucial for successfully training a detector.
We visualize the detection results of a random sample of images in Figure 4. We see that the predicted boxes contain the region of the ground truth box in almost all images and are only fractionally larger than the ground truth boxes.
|Training Method||Jacc. Index||Distance||Accuracy|
|RL w/ Entropy Loss||0.6957||11.58||82.50%|
|RL w/o Entropy Loss||0.6876||18.83||75.83%|
4.2 Embryo Staging
The baseline model is ResNet-50 applied to raw image resized to . Our method (DetCls, ‘detect then classify’) first uses the detector learned in the previous section to identify the region of the embryo on resized input. We experiment with two image cropping methods. The first method crops the region on the resized input, while the other crops the raw image and resizes it to . The cropped image with size or will then be fed into the same ResNet-50 as the baseline. We also try to add LSTM to our DetCls method with crops.
After successfully training the detector in an end-to-end manner, we subsequently use the same detector to compare all downstream models, The detector is set to test mode to predict only one region. We initialize each classifier as a ResNet-50 with pretrained ImageNet weights, updating all weights using the Adam optimizer with a learning rate of and default parameters. We apply random rotation augmentation on data. The validation data set is used for early stopping and all metrics are evaluated on test data.
We report the per-frame accuracy of our raw predictions, as well as the per-frame accuracy of the DP predictions (for both objectives). We also report the mean absolute error (MAE) and root mean squared error (RMSE) (measured in frames) of the predicted stage transition times after post-processing. To better justify these results, we include the result of a naive baseline that simply labels each frame using the mode stage among all frames captured at the same time in the training set, and predicts the transition time for each stage using the median transition time among all embryo videos in the training set. Table 2 summarizes the results of four models.
Effect of detection.
Two of our single frame DetCls models significantly outperformed the baseline before and after post-processing in all metrics. Of note, the gain in accuracy due to detection after post-processing is typically as great or greater than the gains seen in raw accuracy (without post-processing). The performance of DetCls112 and DetCls224 is comparable. The model with high resolution cropping performs only slightly better after post-processing, suggesting that the performance gains with respect to the baseline are mainly due to removing irrelevant background in the raw input and not due to enabling higher-resolution inputs.
Using temporal information.
DP post-processing yields an accuracy improvement of to to all three single-frame models and allows us to generate a monotonic prediction sequence to predict the stage transition time. DP using Earthmover’s distance achieves slightly better performances on three models (Baseline, DetCls224, DetCls224+LSTM) than DP using likelihood. Adding LSTM to DetCls224 further improves the raw accuracy and two metrics (accuracy, MAE) after post-processing. The improvement is less significant after post-processing. This suggests that the DP decoders already encode most of the temporal relationships between frames.
DP: label likelihood
DP: earthmover’s distance s.t. monotonicity
5 Related Work
5.1 Computer Vision Methodology Papers
Over the past several years, a variety of papers have made rapid progress on both single and multiple object detection using convolutinoal neural networks. Two of the most popular approaches are Faster R-CNN  and YOLO , from which we draw loose inspiration in designing our region proposal network for predicting the bounding box. Traditional object recognition methods are trained on large datasets where the true bounding boxes are annotated, data which is not freely available in many domains, including ours. Several previous works seek to address this problem, learning weakly-supervised object detection that relies only on the image-level class label [4, 3]. Unlike our method, these approaches are not end-to-end trainable with a classifier.
,  and  use reinforcement learning to learn an attention mechanism for selecting most relevant image regions or video frames for downstream visual recognition tasks. Jaderberg et al.  propose an alternative method that learns a geometric transform on the input image, which is fed into a classifier using differentiable bi-linear sampling. Our work shares a similar idea of localizing the object before classifying it.
The idea of extending models to include temporal information has been explored extensively in recent years.  used a two-stream architecture applied to a single frame as well as multi-frame optical flow in order to combine spatial and temporal information.  studied techniques for using RNNs to improve frame-level object detection by incorporating context from adjacent frames. They also introduce several additional losses, e.g. to encourage smoothness in the predictions across adjacent frames.
5.2 Embryology Applications Papers
The problem of predicting embryo annotations from time lapse videos has been addressed in the literature by . They use an 8-layer convolutional network to count the number of cells in an embryo image (up to 5 cells), a related but different setting from ours. To incorporate temporal information, they use conditional random fields and similarly use dynamic programming to enforce monotonicity constraints.  collected annotations of time-lapse morphokinetic data and used principal component analysis and logistic regression analysis to predict pregnancy with an AUC of 0.70.  performed a meta-analysis of RCTs comparing use of a morphokinetic algorithm versus single time-point embyro evaluations and found an improved ongoing clinical pregnancy rate with use of the technology.  demonstrated that there is significant variability in some morphokinetic intervals between IVF clinics, suggesting that the parameters used to select embryos may require tuning for each particular clinic. The human-selected morphorkinetic annotations are in near perfect agreement across repeated exams 
Multiple applications of CNNs to embryo assessment were presented at the 2018 American Society of Reproductive Medicine Annual Meeting. Zaninovic et al.  analyzed 18,000 images of blastocyts using CNNs trained on raw time-lapse images and was able to classify the quality of the embryos into three morphologic quality grades with 75% accuracy. Iwata et al.  performed a similar analysis to predict good-quality embryos with 80% accuracy. Malmsten et al.  built a CNN using raw time-lapse from images of 11,898 human embryos to classify up to the 8-cell stage with 82% reported accuracy. They also reported that the cell-division transition times predicted within frames of when the embryologist annotates the transition for 91% of transitions.
To our knowledge, our work is the first to use deep learning to predict embryo morphokinetics (the above works were published after our work was first made public), the first to improve performance by localizing the embryo through a weakly-supervised reinforcement learning method, and the first to demonstrate the benefit of incorporating contextual frames via LSTMs.
Beyond embryology, CNN-based classification techniques have emerged as popular tools in the clinical literature, with successes such as image-based classification of skin lesions  including keratinocyte carcinomas versus benign seborrheic keratoses and malignant melanomas versus benign nevi, and the detection of diabetic retinopathy from retinal fundus imaging .
This paper introduced a suite of techniques for recognizing stages of embryo development, achieving frame-level accuracy. We also achieve a mean average error for predicting stage transition times of . We believe that several directions realizable in future work could bring this technology to the level of clinical utility. To begin, our results are achieved using only embryos for training. Given that deep learning methods are notable, in that performance tends not to saturate quickly with dataset size, we plan to access a considerably larger dataset for future studies, to test the limits of our current methodology. Additionally, our models are initialized by using pretrained weights from an ImageNet classifier originally on full color photographs. We suspect that transfer from a comparably large dataset of more relevant images (gray-scale microscopy) might yield additional gains. Identifying such a dataset for transfer remains a challenge. Moreover, even if we can access such a dataset of unlabeled images, deciding upon a (possibly unsupervised) objective for the source task could poses an interesting research problem. Additionally, we plan to extend our experiments to predict not only the stages useful for current embryo selection heuristics but to predict all stages of development. And finally, we hope to use the models learned from the morphokinetic prediction as themselves a source task, fine-tuning the models to the more pressing downstream problem of assessing implantation potential directly. We note that assessing the viability of an embryo represents an interesting off-policy learning problem. Outcomes are only observed for those embryos that implanted. Success on this task may require not only representation learning, but also estimating counterfactual quantities.
- Adolfsson and Andershed  E Adolfsson and AN Andershed. Morphology vs morphokinetics: a retrospective comparison of inter-observer and intra-observer agreement between embryologists on blastocysts with known implantation outcome. JBRA Assist Reprod, pages 228–237, 2018.
- Ba et al.  Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In Proceedings of the International Conference on Learning Representations, 2015.
- Bilen and Vedaldi  Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2846–2854, 2016.
- Bilen et al.  Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with convex clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1081–1089, 2015.
- Center for Disease Control  Center for Disease Control. Assisted reproductive technology (art). https://www.cdc.gov/art/artdata/index.html, 2016. Accessed: 2019-03-27.
- Cetinkaya and Kahraman  Caroline Pirkevi Cetinkaya and Semra Kahraman. Morphokinetics of embryos - where are we now? Journal of Reproductive Biotechnology and Fertility, pages 1–8, 2016.
- Chamayou et al.  Sandrine Chamayou, Pasquale Patrizio, Giorgia Storaci, Venera Tomaselli, Carmelita Alecci, Carmen Ragolia, Claudia Crescenzo, and Antonino Guglielmino. The use of morphokinetic parameters to select all embryos with full capacity to implant. Assisted Reproductive Genetics, pages 703–710, 2013.
- Esteva et al.  Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115, 2017.
- Gulshan et al.  Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Iwata et al.  K Iwata, M Sato, I Matsumoto, T Shimura, K Yumoto, A Negami, and Y Mio. Deep learning based on images of human embryos obtained from high-resolusion time-lapse cinematography for predicting good-quality embryos. Fertility and Sterility, 110(4):e213, 2018.
- Jaderberg et al.  Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
- Khan et al.  Aisha Khan, Stephen Gould, and Mathieu Salzmann. Deep convolutional neural networks for human embryonic cell counting. In ECCV, 2016.
- Malmsten et al.  J Malmsten, N Zaninovic, Q Zhan, M Toschi, Z Rosenwaks, and J Shan. Automatic prediction of embryo cell stages using artificial intelligence convolutional neural network. Fertility and Sterility, 110(4):e360, 2018.
- Milewski et al.  R Milewski, AJ Milewska, A Kuczyńska, B Stankiewicz, and Kuczyński W. Do morphokinetic data sets inform pregnancy potential? Assisted Reproductive Genetics, pages 357–365, 2016.
- Mnih et al.  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- Pribenszky et al.  C Pribenszky, AM Nilselid, and M Montag. Time-lapse culture with morphokinetic embryo selection improves pregnancy and live birth chances and reduces early pregnancy loss: a meta-analysis. Reprod Biomed Online, pages 511–520, 2017.
- Redmon et al.  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Sermanet et al.  Pierre Sermanet, Andrea Frome, and Esteban Real. Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054, 2014.
- Simon et al.  Alexander L Simon, Michelle Kiehl, Erin Fischer, J Glenn Proctor, Mark R Bush, Carolyn Givens, Matthew Rabinowitz, and Zachary P Demko. Pregnancy outcomes from more than 1,800 in vitro fertilization cycles with the use of 24-chromosome single-nucleotide polymorphism–based preimplantation genetic testing for aneuploidy. Fertility and sterility, 110(1):113–121, 2018.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
- Tripathi et al.  Subarna Tripathi, Zachary C Lipton, Serge Belongie, and Truong Nguyen. Context matters: Refining object detection in video with recurrent neural networks. arXiv preprint arXiv:1607.04648, 2016.
- Yeung et al.  Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2678–2687, 2016.
- Yosinski et al.  Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
- Zaninovic et al.  N Zaninovic, P Khosravi, I Hajirasouliha, JE Malmsten, E Kazemi, Q Zhan, M Toschi, O Elemento, and Z Rosenwaks. Assessing human blastocyst quality using artificial intelligence (ai) convolutional neural network (cnn). Fertility and Sterility, 110(4):e89, 2018.