VConv-DAE: Deep Volumetric Shape Learning Without Object Labels
With the advent of affordable depth sensors, 3D capture becomes more and more ubiquitous and already has made its way into commercial products. Yet, capturing the geometry or complete shapes of everyday objects using scanning devices (e.g. Kinect) still comes with several challenges that result in noise or even incomplete shapes.
Recent success in deep learning has shown how to learn complex shape distributions in a data-driven way from large scale 3D CAD Model collections and to utilize them for 3D processing on volumetric representations and thereby circumventing problems of topology and tessellation. Prior work has shown encouraging results on problems ranging from shape completion to recognition. We provide an analysis of such approaches and discover that training as well as the resulting representation are strongly and unnecessarily tied to the notion of object labels. Thus, we propose a full convolutional volumetric auto encoder that learns volumetric representation from noisy data by estimating the voxel occupancy grids. The proposed method outperforms prior work on challenging tasks like denoising and shape completion. We also show that the obtained deep embedding gives competitive performance when used for classification and promising results for shape interpolation.
Despite the recent advances in 3D scanning technology, acquiring 3D geometry or shape of an object is a challenging task. Scanning devices such as Kinect are very useful but suffer from problems such as sensor noise, occlusion, complete failure modes (e.g. dark surfaces and gracing angles). Incomplete geometry poses severe challenges for a range of application such as interaction with the environment in Virtual Reality or Augmented Reality scenarios, planning for robotic interaction or 3D print and manufacturing.
To overcome some of these difficulties, there is a large body of work on fusing multiple scans into a single 3D model [?]. While the surface reconstruction is impressive in many scenarios, acquiring geometry from multiple viewpoint can be infeasible in some situations. For example, failure modes of the sensor will not be resolved and some viewing angles might simply be not easily accessible e.g. for a bed or cupboard placed against a wall or chairs occluded by tables.
There also has been significant research on analyzing 3D CAD model collections of everyday objects. Most of this work [?] use an assembly-based approach to build part based models of shapes. Thus, these methods rely on part annotations and can not model variations of large scale shape collections across classes. Contrary to this approach, Wu et al. (Shapenet [?]) propose a first attempt to apply deep learning to this task and learn the complex shape distributions in a data driven way from raw 3D data. It achieves generative capability by formulating a probabilistic model over the voxel grid and labels. Despite the impressive and promising results of such Deep belief nets [?], these models can be challenging to train. While they show encouraging results on challenging task of shape completion, there is no quantitative evaluation. Furthermore, it requires costly sampling techniques for test time inference, which severely limits the range of future 3D Deep Learning applications.
While deep learning has made remarkable progress in computer vision problems with powerful hierarchical feature learning, unsupervised feature learning remains a future challenge that is only slowly getting more traction. Labels are even more expensive to obtain for 3D data such as point cloud. Recently, Lai et al. [?] propose a sparse coding method for learning hierarchical feature representations over point cloud data. However, their approach is based on dictionary learning which is generally slower and less scalable than convolution based models . Our work also falls in this line of work and aims at bringing the success of deep and unsupervised feature learning to 3D representations.
To this end, we make the following contributions:
We propose a fully convolutional volumetric auto-encoder which, to our knowledge, is the first attempt to learn a deep embedding of object shapes in an unsupervised fashion.
Our method outperforms previous supervised approach of shapenet [?] on denoising and shape completion task while it obtains competitive results on shape classification. Furthermore, shape interpolation on the learned embedding space shows promising results.
We provide an extensive quantitative evaluation protocol for task of shape completion that is essential to compare and evaluate the generative capabilities of deep learning when obtaining ground truth of real world data is challenging.
Our method is trained from scratch and end to end thus circumventing the training issues of previous work, shapenet [?], such as layer wise pre-training. At test time, our method is at least two orders of magnitude faster than shapenet.
Part and Symmetry based Shape Synthesis. Prior work [?] uses an assembly-based approach to build deformable part-based models based on CAD models. There is also work that detect the symmetry in point cloud data and use it to complete the partial or noisy reconstruction. A comprehensive survey of such techniques is covered in Mitra et al.[?]. Huang et al.[?] learns to predict procedural model parameters for shape synthesis given a 2D sketch using CNN. However, part and symmetry based methods are typically class specific and require part annotations which are expensive. In contrast, our work does not require additional supervision in the form of parts, symmetry, multi-view images or their correspondence to 3D data.
Deep learning for 3D data. ShapeNet [?] is the first work that applied deep learning to learn the 3D representation on large scale CAD model database. Apart from recognition, it also desires capability of shape completion. It builds a generative model with convolutional RBM [?] by learning a probability distribution over class labels and voxel grid. The learned model is then fine tuned for the task of shape completion. Following the success of Shapenet, there have been recent work that improves the recognition results on 3D data [?], uses 3D-2D(multi-view images) correspondence to improve shape completion (repairing) results [?] propose intrinsic CNN [?] for 3D data or learn correspondence between two surfaces(depth map) [?]. Our work is mainly inspired by Shapenet in the functionality but differs in methodology. In particular, our network is trained completely unsupervised and discovers useful visual representations without the use of explicitly curated labels. By learning to predict missing voxels from input voxels, our model ends up learning an embedding that is useful for both classification as well as interpolation.
Denoising Auto-Encoders. Our network architecture is inspired by DAE [?] main principle that predicting any subset of variables from the rest is a sufficient condition for completely capturing the joint distribution between a set of variables. In order to share weights for stationary distributions such as they occur in images, convolutional auto-encoders have been proposed [?]. Our model differs with such architecture as it is not stacked and learned end to end without any layer fine-tuning or pre-taining. Furthermore, we use learnable upsampling unit (deconvolutional layers) to reconstruct back the encoded input.
Learnable Upsampling Layer. The concept of upsampling layer was first introduced by Zeiler and Fergus [?] to visualize the filters of internal layer in a 2D ConvNet. However, they simply transpose the weights and do not learn the filter for upsampling. Instead, Long et al. [?] and Dosovitskiy et al. [?] first introduced the idea of deconvolution (up-sampling) as a trainable layer although for different applications. Recently, using up-sampling to produce spatial output [?] - has also seen first applications for computer graphics. Note that a few concurrent works [?] also propose a decoder based on volumetric upsampling that outputs 3D reconstruction. In particular, Yumer et al. [?] uses a similar architecture for predicting deformed version of the input shape. In contrast, we propose a denoising volumetric auto encoder for shape classification and completion that learns an embedding of shapes in an unsupervised manner.
Rest of the paper is organized as follows: In the next section, we first formulate the problem and describe our deep network and training details. We then move on to the experiment section where we first show the experiments for classification on ModelNet database and the qualitative results for shape interpolation on the learned embedding. We then formulate the protocol for evaluating current techniques for the task of shape completion and show our quantitative results. We conclude with qualitative results for shape completion and denoising.
3Unsupervised Learning of Volumetric Representation by Completion
Given a collection of shapes of various objects and their different poses, our aim is to learn the shape distributions of various classes by predicting the missing voxels from the rest. Later, we want to leverage the learnt embedding for shape recognition and interpolation tasks as well as use the generative capabilities of the auto encoder architectures for predicting enhanced version of corrupted representations. These corruptions can range from noise like missing voxels to more severe structured noise patterns.
3.1VConv-DAE: Fully Convolutional Denoising Auto Encoder
Voxel Grid Representation. Following Shapenet[?], we adopt the same input representation of a geometric shape: a voxel cube of resolution . Thereafter, each mesh is first converted to a voxel representation with 3 extra cells of padding in both directions to reduce the convolution border artifacts and stored as binary tensor where 1 indicates the voxel is inside the mesh surface and 0 indicates the voxel is outside the mesh. This results in the overall dimensions of voxel cube of size 303030.
Overview of Architecture. To this end, we learn an end to end, voxel to voxel mapping by phrasing it as two class (1-0) auto encoder formulation from a whole voxel grid to a whole voxel grid. An overview of our VConv-DAE architecture is shown in Figure 1. Labels in our training corresponds to the voxel occupancy and not class label. Our architecture starts with a dropout layer directly connected to the input layer. The left half of our network can be seen as an encoder stage that results in a condensed representation (bottom of figure) which is connected to a fully connected layer in between. In the second stage (right half), the network reconstructs back the input from this intermediate representation by deconvolutional(Deconv) layers which acts as a learnable local up-sampling unit. We will now explain the key components of the architecture in more detail.
Data Augmentation Layer. While data augmentation has been used a lot to build deep invariant features for images [?], it is relatively little explored on volumetric data. We put a dropout [?] layer on the input. This serves the purpose of input data augmentation and an implicit training on a virtually infinite amount of data and has shown in our experiments to greatly avoid over-fitting.
Encoding Layers: 3D Convolutions The first convolutional layer has 64 filters of size 9 and stride 3. The second convolutional layer has 256 filters of size 4 and stride 2 meaning each filter has 64444 parameters. This results into 256 channels of size 333. These feature maps are later flattened into one dimensional vector of length 6912 (= 256333) which is followed by a fully connected layer of same length (6912). This bottleneck layer later acts as a shape embedding for classification and interpolation experiments later. The fixed size encoded input is now reconstructed back with two deconv layers. First Deconv layer contains 64 filters of size 5 and stride 2 while the last deconv layer finally merges all 64 feature cubes back to the original voxel grid. It contains a filter of size 6 and stride 3.
Decoding Layers: 3D Deconvolutions While CNN architecture based on convolution operator have been very powerful and effective in a range of vision problems, Deconvolutional (also called convolutional transpose) based architecture are gaining traction recently. Deconvolution (Deconv) is basically convolution transpose which takes one value from the input, multiplies the value by the weights in the filter, and place the result in the output channel. Thus, if the 2D filter has size f f, it generates a ff output matrix for each pixel input. The output is generally stored with a overlap (stride ) in the output channel. Thus, for input x, filter size f, and stride d, the output is of dims . Upsampling is performed until the original size of the input has been regained.
We did not extensively experiment with different network configurations. However, small variations in network depth and width did not seem to have significant effect on the performance. Some of the design choice also take into account the input voxel resolution. We chose two convolutional layers for encoder to extract robust features at multiple scales. Learning a robust shape representation essentially means capturing the correlation between different voxels. Thus, receptive field of convolutional filter plays a major role and we observe the best performance with large conv filters in the first layer and large deconv filter in the last layer. We experimented with two types of loss functions: mean square loss and cross entropy loss. Since we have only two classes, there is not much variation in performance with cross entropy being slightly better.
3.2Dataset and Training
Dataset. Wu et al. [?] use Modelnet, a large scale 3D CAD model dataset for their experiments. It contains 151,128 3D CAD models belonging to 660 unique object categories. They provide two subset of this large scale dataset for the experiments. The first subset contains 10 classes that overlaps with the NYU dataset [?] and contains indoor scene classes such as sofa, table, chair, bed etc. The second subset of the dataset contains 40 classes where each class has at least 100 unique CAD models. Following the protocol of [?], we use both 10 and 40 subset for classification while completion is restricted to subset of 10 that mostly corresponds to indoor scene objects.
Training Details. We train our network end-to-end from scratch. We experiment with different levels of dropout noise and observe that training with more noisy data helps in generalising well for the task of denoising and shape completion. Thus, we set a noise level of for our Dropout data augmentation layer which eliminates half of the input at random and therefore the network is trained for reconstruction by only observing of the input voxels. We train our network with pure stochastic gradient descent and a learning rate of for epochs. We use momentum with a value of . We use the open source library Torch for implementing our network and will make our code public at the time of publication.
We conduct a series of experiments in particular to establish a comparison to the related work of Shapenet [?]. First, we evaluate the representation that our approach acquires in an unsupervised way on a classification task and thereby directly comparing to Shapenet. We then support the empirical performance of feature learning with qualitative results obtained by linear interpolation on the embedding space of various shapes. Thereafter, we propose two settings to evaluate quantitatively the generative performance of 3D deep learning approach on a denoising and shape completion task – on which we also benchmark against Shapenet and baselines related to our own setup. We conclude the experiments with qualitative results.
4.1Evaluating the Unsupervised Embedding Space
Features learned from deep networks are state-of-the-art in various computer vision problems. However, unsupervised volumetric feature learning is a less explored area. Our architecture is primarily designed for shape completion and denoising task. However, we are also interested in evaluating how the features learned in unsupervised manner compare with fully supervised state-of-the-art 3D classification only architecture.
We conduct 3D classification experiments to evaluate our features. Following shapenet, we use the same train/test split by taking the first 80 models for training and first 20 examples for test. Each CAD model is rotated along gravity direction every 30 degree which results in total 38,400 CAD models for training and 9,600 for testing. We propose following two methods to evaluate our network for the task of classification:
Ours-UnSup : We feed forward the test set to the network and simply take the fixed length bottleneck layer of dimensions 6912 and use this as a feature vector for a linear SVM. Note that the representation is trained completely unsupervised.
Ours-Fine Tuned(FT): We follow the set up of Shapenet [?] which puts a layer with class labels on the top most feature layer and fine tunes the network. In our network, we take the bottleneck layer which is of 6912 dimensions and put another layer in between bottleneck and softmax layer. So, the resulting classifier has an intermediate fully connected layer 6912-512-40.
For comparison, we also report performance for Light Field descriptor (LFD [?], 4,700 dimensions) and Spherical Harmonic descriptor (SPH [?], 544 dimensions). We also report the overall best performance achieved so far on this dataset [?].
|10 classes||SPH [?]||LFD [?]||SN [?]||Ours-UnSup||Ours-FT||VoxNet||MvCnn [?]|
|40 classes||SPH [?]||LFD [?]||SN [?]||Ours-UnSup||Ours-FT||VoxNet||MvCnn [?]|
Our representation, Ours-UnSup, achieves 75 % accuracy on the 40 classes while trained completely unsupervised. When compared to the setup of fine tuned shapenet, our fine tuned representation, Ours-FT, compares favorably and outperforms shapenet on both 10 and 40 classes.
Comparison with the state-of-the-art. Our architecture is designed for shape completion and denoising while Voxnet and MvCnn is for recognition only. We demonstrate that it also lends to recognition and shows promising performance. MvCnn [?] outperforms other methods by a large margin. However, unlike the rest of the methods, it works in the image domain. For each model, MvCnn [?] first renders it in different views and then aggregates the scores of all rendered images obtained by CNN. Compared to the fully supervised classification network of Voxnet, our accuracy is only 3 % less on 40 classes. This is because Voxnet architecture is shallow and contains pooling layers that help classification but are not suitable for reconstruction purpose, as also noted in 3D Shapenet.
4.2Linear Interpolation in the embedding space.
|Source(t=1)||t= 3||t=5||t=7||t = 9||Target(t= 10)|
Encouraged by the unsupervised volumetric feature learning performance, we analyze the representation further to understand the embedding space learned by our auto-encoder. To this end, we randomly choose two different instances of a class in the same pose as input and then, feed it to the encoder part of our trained model which transforms any shape into a fixed length encoding vector of length 6912. We call these two instances Source and Target in Table 1. On a scale from 1 to 10, where 1 corresponds to source instance and 10 to target, we then linearly interpolate the eight intermediate encoded vectors. These interpolated vectors are then fed to the second (decoder) part of the model which decodes the encoded vector into volumes. Note that we chose linear interpolation over non-linear partially because of the simplicity and the fact that the feature space already achieves linear separability on 10 classes with an accuracy of 80 %. In Table 1, we show the interpolated volumes between source and target at each alternative step. We observe that in most cases new, connected shapes are inferred as intermediate steps and a plausible transition is produced even for highly non-convex shapes.
5Denoising and Shape Completion
In this section, we show experiments that evaluate our network on the task of denoising and shape completion. This is very relevant in the scenario where geometry is captured with depth sensors that often comes with holes and noise due to sensor and surface properties. Ideally, this task should be evaluated on real world depth or volumetric data but obtaining ground truth for real world data is very challenging and to the best of our knowledge, there exists no dataset that contains the ground truth for missing parts and holes of Kinect data. Kinect fusion type approach still suffer from sensor failure modes and large objects like furnitures often cannot be scanned from all sides in their typical location.
We thus rely on CAD model dataset where complete geometry of various objects is available and we simulate different noise characteristics to test our network. We use the 10-class subset of ModelNet database for experiments. In order to remain comparable to Shapenet, we use their pretrained generative model for comparison and train our model on the first 80 (before rotation) CAD models for each class accordingly. This results in 9600 training models for the following experiments.
We first evaluate our network on the same random noise characteristics with which we train the network. This is challenging since the test set contains different instances than training set. We also vary the amount of random noise injected during test time for evaluation. Training is same as that for classification and we use the same model trained on the first 80 CAD models. At test time, we use all the test models available for these 10 classes for evaluation.
Baseline and Metric To better understand the network performance for reconstruction at test time, we study following methods:
Convolutional Auto Encoder (CAE): We train the same network without any noise meaning without any dropout layer. This baseline tells us the importance of data augmentation or injecting noise during training.
Shapenet(SN): We use pretrained generative model for 10 class subset made public by Shapenet and their code for completion. We use the same hyper-parameters as given in the their source code for completion. Therefore, we set number of epochs to , number of Gibbs iteration to 1 and threshold parameter to . Their method assumes that an object mask is available for the task of completion at test time. Our model does not make such assumption since this is difficult to obtain at test time. Thus, we evaluate shapenet with two different scenario for such a mask: first, SN-1, by setting the whole voxel grid as mask and second, SN-2, by setting the occupied voxels in test input as mask. Given the range of hyper-parameters, we report performance for the best hyperparameters.
Metric : We count the number of voxels which differs from the actual input. So, we take the absolute difference between the reconstructed version of noisy input and original (no-noise) version. We then normalise reconstruction error by total number of voxels in the grid (13824=242424)). Note that the voxel resolution of 303030 is obtained by padding 3 voxels on each side thus network never sees a input with voxel in those padding. This gives us the resulting reconstruction or denoising error in %.
5.2Slicing Noise and Shape completion
In this section, we evaluate our network for a structured version of noise that is motivated by occlusions in real world scenario failure modes of the sensor which generates “holes” in the data. To simulate such scenarios, we inject slicing noise in the test set as follows: For each instance, we first randomly choose n slices of volumetric cube and remove them. We then evaluate our network on three amount of slicing noise depending upon how many slices are removed.
Injected slicing noise is challenging on two counts: First, our network is not trained for this noise. Secondly, injecting 30 % of slicing noise leads to significant removal of object with large portion of object missing. hus, evaluating on this noise relates to the task of shape completion. For comparison, we again use Shapenet with the same parameters as described in the previous section. In the Table below, 10, 20, 30 indicates the % of slicing noise. So, 30 % means that we randomly remove all voxels lying on 9 (30 % ) faces of the cube. We use the same metric as described in the previous section to arrive at the following numbers in %
Discussion Our network performance is significantly better than the CAE as well as Shapenet. This is also shown in the qualitative results shown later in Table 2. Our network superior performance over no noise network (CAE) justifies learning voxel occupancy from noisy shape data. The performance on different noise also suggest that our network finds completing slicing noise (completion) more challenging than denoising random noise. 30 % of slicing noise removes significant chunk of the object.
In Table 2 and Table 3, each row contains 4 images where the first one corresponds to the ground truth, second one is obtained by injecting noise (random and slicing) and acts as a input to our network. Third image is the reconstruction obtained by our network while fourth image is the outcome of shapenet.
As shown in the qualitative results, our network can fill in significant missing portion of objects when compared to shapenet. All images shown in Table 3 are for 30 % slicing noise scenario whereas the Table 2 corresponds to inputs with % random noise. Judging by our quantitative evaluation, our model finds slicing noise to be the most challenging scenario. This is also evident in the qualitative results and partially explained by the fact that network is not trained for slicing noise. Edges and boundaries are smoothed out to some extent in some cases.
Runtime comparison with Shapenet We compare our runtime during train and test with Shapenet. All runtime reported here are obtained by running the code on Nvidia K40 GPU. Training time for Shapenet is quoted from their paper where it is mentioned that pre-training as well as fine tuning each takes days and test time of ms is calculated by estimating the time it takes for one test completion. In contrast, our model trains in only day. We observe strongest improvements in runtime at test time, where our model only takes ms which is x faster than Shapenet – an improvement of two orders of magnitude. This is in part due to our network not requiring sampling at test time.
7Conclusion and Future Work
We have presented a simple and novel unsupervised approach that learns volumetric representation by completion. The learned embedding delivers comparable results on recognition and promising results for shape interpolation. Furthermore, we obtain stronger results on denoising and shape completion while being trained without labels. We believe that the transition from RBM to feed forward models, first evaluation-qualitative results for shape completion, promising recognition performance and shape interpolation results will stimulate further work on deep learning for 3D geometry. In future, we plan to extend our work to deal with deformable objects and larger scenes.
Acknowledgement. This work was supported by funding from the European Union’s Horizon 2020 research and innovation program under the Marie
Sklodowska-Curie grant agreement No 642841.