# SampleAhead: Online Classifier-Sampler Communication

for Learning from Synthesized Data

###### Abstract

State-of-the-art techniques of artificial intelligence, in particular deep learning, are mostly data-driven. However, collecting and manually labeling a large scale dataset is both difficult and expensive. A promising alternative is to introduce synthesized training data, so that the dataset size can be significantly enlarged with little human labor. But, this raises an important problem in active vision: given an infinite data space, how to effectively sample a finite subset to train a visual classifier?

This paper presents an approach for learning from synthesized data effectively. The motivation is straightforward – increasing the probability of seeing difficult training data. We introduce a module named SampleAhead to formulate the learning process into an online communication between a classifier and a sampler, and update them iteratively. In each round, we adjust the sampling distribution according to the classification results, and train the classifier using the data sampled from the updated distribution. Experiments are performed by introducing synthesized images rendered from ShapeNet models to assist PASCAL3D+ classification. Our approach enjoys higher classification accuracy, especially in the scenario of a limited number of training samples. This demonstrates its efficiency in exploring the infinite data space.

SampleAhead: Online Classifier-Sampler Communication

for Learning from Synthesized Data

Qi Chen Weichao Qiu Yi Zhang Lingxi Xie Alan Yuille {qchen42,yzh,alan.yuille}@jhu.edu {qiuwch,198808xc}@gmail.com Department of Computer Science, The Johns Hopkins University

## 1 Introduction

Recent progress in computer vision has been boosted by deep neural networks trained with a large amount of labeled data. Researchers made every effort to increase the volume [?; ?] and representativeness [?] of these datasets, however, the collection and annotation remain labor-intensive and error-prone. A smart idea to address this problem is to generate synthesized data (e.g., from a virtual world [?; ?]) with a minimal amount of human labor.

But, because the synthesized environment allows us to sample an infinite amount of data, an important yet unstudied problem is raised: given a constrained time, how to effectively sample a finite subset so as to maximize the performance of a vision system? We address this problem with object recognition, a fundamental task in computer vision. Note that for some specific tasks such as object pose estimation, integrating synthesized data produces fundamental contribution to recognition accuracy, but previous approaches often sampled data uniformly from the synthesized space [?], leading to a redundant set of easy training cases, while the hard cases cannot get trained sufficiently.

Inspired by previous work [?] which adjusted data weights according to their difficulties in an online manner, we suggest a learning system which is composed of two components, with a classifier (parameterized by a set of network weights ) dealing with the recognition task, and a sampler (parameterized by a class distribution over viewpoint parameters, e.g., azimuth and elevation angles) sampling training data from the infinite data space. The major algorithm for optimization is similar to AdaBoost [?], i.e., increasing the weight of difficult samples in training the classifier.

The training process involves updating and in an iterative manner. The unit that controls the classifier-sampler communication is named SampleAhead. In each iteration, the distribution is determined by the testing results in a standalone validation set, and then used to sample a new batch of data for training the classifier (updating the parameter ). To improve computational efficiency, we partition the entire space into a finite number of buckets. In each training epoch, the classifier is first applied on a validation set to estimate the difficulty of each bucket, and the sampler follows to construct a new training subset. This is a two-stage sampling process. Every time, a bucket is first sampled from the distribution , and then a datum is sampled from the bucket following a uniform distribution. This iteration continues until the maximal number of rounds is reached.

We conduct experiments in a challenging task known as object pose estimation, which aims at predicting the viewpoint from which we capture a 2D image of an object. We use PASCAL3D+ [?] as the target (testing) dataset, and render a large number of synthesized images from ShapeNet [?]. In comparison to the baseline approach [?] which always sampled the data space from a fixed distribution, our method produces higher recognition accuracy especially in more challenging scenarios, in agreement with our motivation. In particular, when the number of extra training cases is limited, the advantage of our approach becomes even more significant.

## 2 Related Work

State-of-the-art artificial intelligence and machine learning systems are powered by big data. Training complicated models especially deep neural networks requires sufficient data to prevent over-fitting. The availability of large-scale datasets facilitates the ability of training very deep neural networks [?]. However, researchers often required a considerable amount of labor to collect and annotated a large-scale dataset [?; ?], or a smaller one with reasonable variability [?; ?].

On the other hand, the rapid development of computer graphics allows researchers to construct an unreal environment [?], and sample a large number of annotated synthesized data with little human labor [?; ?]. Another possibility is to apply generative deep learning models to simulate the distribution of real data [?]. It has been verified that synthesized [?; ?; ?; ?] or generated [?] data are helpful in training better models. However, in either case, we are provided with an infinite space of training data, and facing the issue of making use of these synthesized data in a constrained time, i.e., the number of sampled data is finite. A related area to this problem is named Active Vision [?; ?], in which one is allowed to manipulate the viewpoint of the camera(s) in order to explore and learn richer visual knowledge from the environment. Recently, this idea was also applied to train robots in the task of visual question answering [?; ?].

There exist several ways of sampling training data from a given distribution. A straightforward solution is bootstrapping, which sampled training data with replacement. Researchers soon developed other algorithms to increase the probability of sampling a hard example, such as AdaBoost [?] and a series of negative example mining methods [?; ?] to assist training in SVM [?] and CNN [?]. At a finer level, it is also possible to adjust the weights of different elements, so that the loss function would lean towards penalizing the errors in hard examples [?; ?; ?; ?]. All these approaches were verified to outperform uniform sampling, especially when the easy examples occupy a considerable fraction of the data space.

In this paper, we also focus on a more efficient sampling strategy. Different from the previous work, we are working in an infinite (continuous) data space. Instead of sampling from each instance (e.g., an image [?] or a regional feature [?]), we partition the entire data space into a finite number of buckets and perform two-stage sampling, detailed in Section 3.3.

## 3 Approach

### 3.1 Background

The goal of this work is to train an effective vision model from an infinite synthesized dataset. Throughout this paper, we assume the target model to be a classifier, denoted by : , where and are the input and output vectors, e.g., an image matrix and a one-hot encoded vector, and are the parameters in the model , e.g., the weights in a deep neural network.

Training data are sampled from an image space . The sampling process is a function , where are the parameters (e.g., object position, viewpoint, lighting, etc.) required by the generator . Note that is sampled from the parameter space , which is continuous and thus infinite. The core of this paper is to sample a number of ’s at each training iteration. Following a large corpus of previous work, we assume that each is sampled independently and identically from a distribution defined in the parameter space . We denote the process of generating a training data by .

A naive example is to set to be a uniform distribution over , i.e., for all . This is equivalent to generating a sufficient large synthesized dataset at the beginning, and traverse each item orderly. However, in most scenarios, the classifier are dealing with relatively easy training cases, e.g., those cases that are already been correctly classified, so that the weights cannot get trained efficiently.

### 3.2 The SampleAhead Module

To deal with the above issue, we introduce a module named SampleAhead. This module updates the data distribution before each iteration, increasing the probability that hard examples are sampled and fed into the classifier.

Ideally, at the -th iteration, for each sample , we hope that tends to have peaks at the hard cases. We start with defining the difficulty of , denoted by , as the probability that is not correctly classified by the classifier after the -st iteration. However, directly computing for every could be sensitive to noise. We make use of kernel estimation, which randomly distributes a set of probes over the entire space , and estimate the difficulty of by:

(1) |

where

(2) |

and is the weight added to by the probe , e.g., is inversely proportional to the -distance between and . The probe set is often large in order to guarantee the coverage over the space .

The next step is to define the probability distribution function for each . Inspired by AdaBoost [?], we take the classification results in the previous iteration into consideration. Mathematically,

(3) |

where and are hyper-parameters. We use rather than to avoid the distribution from being modified too much. This strategy improves training stability.

# of Iterations | |||||
---|---|---|---|---|---|

Uniform Sampling | |||||

Our Approach | |||||

-value |

### 3.3 Approximation

Note that accurately sampling from in Eqn (3) requires computing the function value at each , which is computationally intractable given a large . Here we provide an approximation for efficient online sampling.

The basic idea is to partition the entire space into a finite number () of buckets, i.e., . Each bucket is a continuous subset of , and any two different buckets do not intersect with each other. Thus, we simplify Eqn (1) by only considering the probes in the same bucket, namely,

(4) |

where is the indicator function. Note that for any , every element has the same distance to each probe, thus the same difficulty (omitting ):

(5) |

This actually leads to a two-stage sampling process, in which a bucket-level probability is computed for each bucket:

(6) |

Every time we hope to generate a , we first determine the bucket index from a finite set , and then sample a from following a uniform distribution.

In practice, we update values throughout iterations. At the beginning, is simply defined as the probability that a uniform sampling in falls into . In updating with Eqn (3), note that both and are constants within , thus Eqn (3) is simplified as:

(7) |

The overall flowchart is illustrated in Algorithm 1. Note that each update of requires a complete testing on the validation set. For efficiency, we only update after each training epoch, rather than each iteration (mini-batch).

The definition of buckets differs from case to case, and is discussed individually in experiments.

## 4 Experiments

### 4.1 MNIST: Digit Classification

#### Dataset and Settings

We first evaluate our approach on a toy problem, which is handwritten digit classification on the MNIST dataset [?]. MNIST contains training images and testing images. The resolution of each image is . We use this relatively simple dataset to observe the behavior of our approach on a series of data augmentation as well as discover the advantage of our approach with respect to the number of training samples.

Following [?], we consider seven types of augmentation, including digit rotation, vertical/horizontal scaling, horizontal/vertical shifting, and horizontal/vertical shearing. Each digit is processed by one and exactly one augmentation. We further partition each type into a finer stage according to the transformation parameter. The rotation angle is randomly sampled from , and it is divided into four bins . All other scaling/shifting/shearing parameters are divided into two bins. Thus, we obtain bins for each original training image.

The bucket set is the Cartesian product of the class set ( elements) and the bins, i.e., there are buckets in total. This is to say, we assume, by Eqn (4), that all samples with the same class and a similar transformation share the same difficulty, which is reasonable. We randomly sample images from each bucket to compose of the probe set ( elements).

#### Results and Analysis

We use LeNet [?] as the classifier . It contains convolutional layers, pooling layers and fully-connected layers. The network is trained with Stochastic Gradient Descent. Each iteration contains a mini-batch of size ( real data and augmented data). The initial learning rate is , decayed with the inv policy, and the weight decay is fixed to be . The original training process lasts for iterations, but we allow a larger number (, , and ) of iterations so that more augmented data are seen. Without data augmentation, the classification error is not reduced () through this large number of iterations.

Results are summarized in Table 1. We can observe that our approach works consistently better than the baseline approach, which always performs uniform sampling from the data space. In particular, when the model is constrained to see a limited amount of training data, the advantage of our approach becomes even more significant, e.g., with iterations, the absolute and relative error rate drops brought by our approach are and , respectively, with a -value of , demonstrating strong statistical significance. This implies that our approach explores the data space more efficiently by aggressively looking for those challenging training samples. However, as the number of training data increases, the advantage becomes smaller, e.g., with iterations, the absolute and relative error rate drops are and , respectively, with a -value of which still suggests statistical significance. This is because MNIST is a simple dataset. Given a sufficient amount of training data, purely random sampling can gradually achieve comparable performance to our approach.

### 4.2 ShapeNet: Object Pose Estimation

Approach | aero. | bicy. | boat | bus | car | chair | table | moto. | sofa | train | tv | mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

[?] | |||||||||||||

Baseline | |||||||||||||

Ours | |||||||||||||

[?] | |||||||||||||

Baseline | |||||||||||||

Ours | |||||||||||||

[?] | |||||||||||||

Baseline | |||||||||||||

Ours | |||||||||||||

[?] | |||||||||||||

Baseline | |||||||||||||

Ours |

#### Dataset and Settings

We move to a natural image dataset named PASCAL3D+ [?], a challenging corpus for 3D object detection and pose estimation. The rigid object classes (with more than images per class) in the PASCALVOC dataset [?] were augmented with 3D annotations, exhibiting more variability than other 3D datasets.

Due to the limited amount of data, we follow a recent baseline named RenderForCNN [?] which generated synthesized data to assist network training. To construct an augmented training set, a joint distribution of viewpoint angles and camera distances was first estimated from the PASCAL3D+ real training set, and million synthesized images were rendered from 3D models of ShapeNet [?] following the same distribution. This is to say, the data distribution is fixed throughout the entire training process. Differently, we add the SampleAhead module to enable updating data distribution according to validation results, based on Eqn (7).

Note that in Eqn (7) controls the fraction of newly generated data. Setting causes our algorithm degenerate to the baseline, i.e., freezing the distribution throughout the entire training process. In practice, we set to take advantage of new data meanwhile preventing the training process from being slowed down by the time-consuming data generation (image rendering) process.

Based on these settings, we perform two challenging tasks, known as object-detection-and-pose-estimation and viewpoint prediction [?].

Approach | aero. | bicy. | boat | bott. | bus | car | chair | table | moto. | sofa | train | tv | mean |

(Baseline) | |||||||||||||

(Ours) | |||||||||||||

(Baseline) | |||||||||||||

(Ours) |

#### Object Detection and Pose Estimation

In the first task, the system is asked to detect the object and estimate its azimuth view angle simultaneously (the elevation view angle is not considered). Following [?], the output is considered correct if it is accepted by both object detection and pose estimation. The correctness object detection is measured by the IOU between the predicted and ground-truth bounding boxes. For view angle prediction, we partition the entire azimuth range into , , and bins, and compute the accuracy that the predicted angle falls into the same bin as the ground-truth angle. out of of the classes (bottle) is not evaluated in this task, as the azimuth angle of such objects is unrecognizable.

The synthesized training set contain 3D objects captured from an azimuth angle of and an elevation angle of . We partition the viewpoint hemisphere into bins of an equal size. Adding the classes, we have buckets in total. The validation subset from the PASCAL3D+ is used as the probe set .

Following the baseline [?], we extract region proposals from RCNN [?], and use these images to train an AlexNet [?] for joint object and viewpoint classification ( classes, ). All technical details (learning rate, weight decay, etc.) remain the same as the baseline. We train the network for iterations, while the baseline needs iterations to traverse all synthesized images. We do not update data distribution (performing uniform sampling) in the first iterations so as to provide a stable initialization.

Results are summarized in Table 2. In terms of average accuracy (the last column), our approach outperforms the baseline in every single task. Note that we only use half the number of iterations compared to the baseline, which demonstrates a favorable efficiency in exploring the infinite data space. Note that our baseline used on old-styled detector (RCNN) and classifier (AlexNet) which limited its accuracy, yet recent work [?; ?] reported higher accuracy than our work with stronger backbones, e.g., [?] used Fast-RCNN for detection and VGGNet for classification. We chose to report on the same network configuration in order to make fair comparison to our baseline [?]. Yet our approach is easily generalized to a wide range of network architectures.

An interesting property of our approach is the increase in accuracy gain when the number of bins goes up. As shown in Figure 3, this happens in both the overall accuracy and individual classes e.g., bike. This is a side benefit brought by our approach, which mines more difficult examples to improve the performance in these challenging tasks.

We diagnose our approach with additional experiments. Our approach mainly benefits from two abilities, i.e., updating sampling distribution during the training process and generating new data based on the updated distribution. Switching off the former ability turns it back to the baseline, with , , and accuracy drops in , respectively. The benefit brought by our approach becomes more significant as the number of bins goes up (i.e., the task becomes more challenging). This is qualitatively verified in Figure 2, in which our approach increases the sampling probability of the difficult buckets, thus improving the overall accuracy. On the other hand, we also disable the latter ability by only allowing our approach to sample from the original million synthesized images. This causes , , and drops, respectively, because the synthesized dataset is fixed and the difficult class, when requiring more samples, may come into duplicated training data. This ablation study shows that both generating and sampling strategies are useful yet complementary to our approach.

As a final note, we find that our approach does not work well for the class table, which contributes the largest deficit compared to the baseline. This class has a significant difference from others, that rotating it by merely changes its appearance, thus the -bin viewpoint estimation is just a random guess. In this scenario, the baseline approach memorizes the data distribution, but our approach actually discards this “cheating benefit” and thus performs “a worse guess”.

#### Viewpoint Estimation

The second task is aimed at estimating the viewpoint to the target object. Following [?], we directly use the trained model previously, and remove the factor of inaccurate object detection by directly using the ground-truth bounding box for each object. Given the ground-truth azimuth, elevation and in-plane angles and the predicted values, we compute their rotation matrices and accordingly, and the included angle between them is computed by , where is the Frobenius norm. There are two metrics in evaluation. The first one, named , computes the fraction that ; and the second one, named , directly measures the median value in degrees.

Results are summarized in Table 3. Our mean value is just slightly higher than the baseline. Note that the table class contributes negatively for the same reason analyzed in the previous task; but in all the remaining classes, our approach performs better. The average accuracies over the remaining classes are vs. . In addition, the median estimation error is significantly reduced (a relative drop). All these experiments verify the effectiveness of our approach in learning from synthesized data.

## 5 Conclusions

This paper focuses on a new problem, which aims at effectively sampling synthesized data from an infinitely large parameter space. Our motivation is very simple, i.e., increasing the probability of generating hard examples so that the classifier gets trained better. To this end, we insert a novel module named SampleAhead, which maintains a distribution over the sampling space. In each training iteration, the distribution is first updated according to the current recognition results, and then used to sample synthesized training data and optimize the vision system. The concept of buckets is introduced to accelerate this process. Although being simple, our approach works well in a challenging vision task – joint object detection and pose estimation, especially when the recognition task is difficult (e.g., the number of azimuth bins is large). The advantage of our approach becomes more significant in the scenario of limited training time.

Our algorithm has potential applications in reinforcement learning in the real world. A typical setting is to place an agent (e.g., a robot) in a room, and facilitate it to learn from the surrounding world by itself. Our research matches this scenario very well, since the data space is almost infinite but training time is limited.

## References

- [Bajcsy, 1988] R. Bajcsy. Active perception. Proceedings of the IEEE, 76(8):966–1005, 1988.
- [Blake and Yuille, 1993] Andrew Blake and Alan Yuille. Active vision. MIT press, 1993.
- [Butler et al., 2012] D.J. Butler, J. Wulff, G.B. Stanley, and M.J. Black. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision, 2012.
- [Chang et al., 2015] A.X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- [Chen et al., 2016] W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischinski, D. Cohen-Or, and B. Chen. Synthesizing training images for boosting human 3d pose estimation. In International Conference on 3D Vision, 2016.
- [Ciresan et al., 2010] D.C. Ciresan, U. Meier, L.M. Gambardella, and J. Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207–3220, 2010.
- [Das et al., 2017] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Embodied question answering. arXiv preprint arXiv:1711.11543, 2017.
- [Deng et al., 2009] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
- [Everingham et al., 2010] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
- [Felzenszwalb et al., 2010] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
- [Freund and Schapire, 1997] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
- [Girshick et al., 2014] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
- [Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
- [Gordon et al., 2017] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316, 2017.
- [He et al., 2014] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. Springer, 2014.
- [Johnson et al., 2017] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C.L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition, 2017.
- [Krizhevsky et al., 2012] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- [LeCun et al., 1998] Y LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [Loshchilov and Hutter, 2015] I. Loshchilov and F. Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
- [Massa et al., 2016] F. Massa, R. Marlet, and M. Aubry. Crafting a multi-task cnn for viewpoint estimation. arXiv preprint arXiv:1609.03894, 2016.
- [Poirson et al., 2016] P. Poirson, P. Ammirato, C.Y. Fu, W. Liu, J. Kosecka, and A.C. Berg. Fast single shot detection and pose estimation. In International Conference on 3D Vision, 2016.
- [Qiu and Yuille, 2016] W. Qiu and A. Yuille. Unrealcv: Connecting computer vision to unreal engine. In Workshops on European Conference on Computer Vision, 2016.
- [Richardson et al., 2016] E. Richardson, M. Sela, and R. Kimmel. 3d face reconstruction by learning from synthetic data. In International Conference on 3D Vision, 2016.
- [Rowley et al., 1998] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, 1998.
- [Shrivastava et al., 2016] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In Computer Vision and Pattern Recognition, 2016.
- [Shrivastava et al., 2017] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In Computer Vision and Pattern Recognition, 2017.
- [Simo-Serra et al., 2014] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, and F. Moreno-Noguer. Fracking deep convolutional image descriptors. arXiv preprint arXiv:1412.6537, 2014.
- [Su et al., 2015] H. Su, C.R. Qi, Y. Li, and L.J. Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In International Conference on Computer Vision, 2015.
- [Sung, 1996] K.K. Sung. Learning and example selection for object and pattern detection. 1996.
- [Tulsiani and Malik, 2015] S. Tulsiani and J. Malik. Viewpoints and keypoints. In Computer Vision and Pattern Recognition, 2015.
- [Varol et al., 2017] G. Varol, J. Romero, X. Martin, N. Mahmood, M.J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In Computer Vision and Pattern Recognition, 2017.
- [Wang and Gupta, 2015] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In International Conference on Computer Vision, 2015.
- [Wu et al., 2014] Z. Wu, S. Song, A. Khosla, X. Tang, and J. Xiao. 3d shapenets for 2.5d object recognition and next-best-view prediction. arXiv preprint arXiv:1406.5670, 2014.
- [Xiang et al., 2014] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In Winter Conference on Applications of Computer Vision, 2014.