ChainerCV: a Library for Deep Learning in Computer Vision
Despite significant progress of deep learning in the field of computer vision, there has not been a software library that covers these methods in a unifying manner. We introduce ChainerCV, a software library that is intended to fill this gap. ChainerCV supports numerous neural network models as well as software components needed to conduct research in computer vision. These implementations emphasize simplicity, flexibility and good software engineering practices. The library is designed to perform on par with the results reported in published papers and its tools can be used as a baseline for future research in computer vision. Our implementation includes sophisticated models like Faster R-CNN and SSD, and covers tasks such as object detection and semantic segmentation.
In recent years, the computer vision community has witnessed rapid progress thanks to deep learning methods in areas including image classification (Alex Krizhevsky and Hinton, 2012), object detection (Ren et al., 2015) and semantic segmentation (Badrinarayanan et al., 2017). High quality software tools are essential to keep up the rapid pace of innovation in deep learning research. The quality of a software library is hugely influenced by traditional software quality metrics such as consistent coding conventions and coverage of tests and documentation. In addition to that, a deep learning library, specifically a library that hosts implementations of deep learning models, needs to guarantee quality during the training phase. Training a machine learning model to have a good performance is difficult due to numerous details that can hinder it from achieving its full potential. This makes it all the more important that a library hosts implementations of high performance training code so that it can give guidance to developers and researchers who want to extend and develop further from these implementations. We think that training code should perform on par with the performance reported by the paper that the implementation is based on. On top of providing a high quality implementation for training a model, we also aim at making it more accessible, especially for users with limited experience, to run inference on sophisticated computer vision models such as Faster R-CNN (Ren et al., 2015).
The rapid progress of deep learning research has been enabled by a number of frameworks including Chainer (Tokui et al., 2015), TensorFlow (Abadi et al., 2015) and PyTorch 111http://pytorch.org. These frameworks have supported fundamental components of deep learning software such as automatic differentiation and effective parallelization using GPUs. However, they are intended to target general usage, and do not aim to provide complete implementations of vision algorithms.
Our software, ChainerCV, supports algorithms to solve tasks in the computer vision field such as object detection, while considering usability and predictable performance as the top priorities. This makes it perfect to be used as a building block in larger software projects such as robotic software systems even by developers who are not computer vision experts. Recently there has been a growing trend of building new neural network models using existing architectures as building blocks. Examples can be seen in tasks such as instance segmentation (He et al., 2017) and scene graph generation (Xu et al., 2017), which depend on object detection algorithms to localize objects in images. ChainerCV’s algorithms can be used as components to construct software that can solve complex computer vision problems.
Training a network is a critical part of a machine learning algorithm, and ChainerCV is designed to make this process easy. In many use cases, users need a machine learning model to perform well on a particular dataset they have. Often a pretrained model is not sufficient for the users’ tasks, and so they must re-train the model using their datasets. To make training a model easier in such cases, ChainerCV provides reference implementations to train models, which can be used as a baseline to write new training code. In addition to that, pretrained models can be used together with the users dataset to fine-tune the model. ChainerCV also provides set of tools for training a model including dataset loader, prediction evaluator and visualization tools.
Reproducibility in machine learning and computer vision is one of the most important factors affecting the quality of the research. ChainerCV aims at easing the process of reproducing the published results by providing training code that is guaranteed to perform on par with them. These algorithms would serve as baselines to find a new idea through refinement and as a tool to compare a new approach against existing approaches.
To summarize, ChainerCV offers the following two contributions:
High quality implementations of deep learning-based computer vision algorithms to solve problems with emphasis on usability.
Reference code and tools to train models, which is guaranteed to perform on par with the published results.
Our code is released at https://github.com/pfnet/chainercv.
2. Related Work
Deep learning frameworks such as Chainer (Tokui et al., 2015) and TensorFlow (Abadi et al., 2015) play a fundamental role in deep learning software. However, these software packages focus on fundamental components such as automatic differentiation and GPU support. Keras (Chollet et al., 2015) is a high-level deep learning API that is intended to enable fast experimentation. While ChainerCV shares a similar goal with Keras to enable fast prototyping, our software provides more thorough coverage of software components for the computer vision tasks. In addition to that, Keras does not provide high performance training code for sophisticated vision models like Faster R-CNN (Ren et al., 2015).
OpenCV (Itseez, 2015) is a prominent example of computer vision software libraries supporting numerous highly tuned implementations. The library supports wide range of algorithms including some deep learning-based algorithms, which emphasize on running inference on a cross platform environment. Different from their work, ChainerCV aims at acceralating research in this field in a more comprehensive manner by providing high quality training code on top of implementations to conduct inference.
Orthogonal to our open source work, there are several proprietary software solutions that support computer vision algorithms based on deep learning. These include Computer Vision System Toolbox by MATLAB and Google Cloud Vision API by Google Cloud Platform.
Model Zoo hosts a number of open source implementations and their trained models. Algorithms for a wide range of tasks are hosted on their website, but they are not provided as a library that organizes code in some standardized manner. There are open source implementations released by the authors of papers and third-party implementations released by open source developers. The primarily aim of these works is to make a prototype of a research idea. Unlike them, one of our goals is to develop an implementation that follows a good software engineering practices so that it is readable and easily extendable to other projects. We assure the quality by developing through peer review process, and thorough coverage of documentations and tests.
More closely related to our work is pytorch/vision, which is a computer vision library that uses PyTorch as its backend. Similar to our work, it hosts pretrained models to let users use high performance convolutional neural networks off-the-shelf. At the time of writing, its support for pretrained model and data preparation are limited only to classification tasks.
3.1. Task Specific Models
Currently, ChainerCV supports networks for object detection and semantic segmentation (Figure 1). Object detection is the task of finding objects in an image and classifying them. Semantic segmentation is the task of segmenting an image into pieces and assigning object labels to them.
We implemented our detection algorithms in a unifying manner by exploiting the fact that many of the leading state of the art architectures have converged on a similar structure (Huang et al., 2016). All of these architectures use convolutional neural networks to extract features and use sliding windows to predict localization and classification. Our implementation includes architectures that can be grouped by Faster R-CNN (Ren et al., 2015) and Single Shot Multibox Detector (SSD) (Liu et al., 2016) meta-architectures. Faster R-CNN takes a crop proposed by an external neural network called Region Proposal Networks and carry out classification on the crop of the input image. SSD tries to alleviate the extra time running Region Proposal Networks by directly predicting classes and coordinates of bounding boxes. These meta-architectures are instantiated into more concrete networks that have different feature extractors or different head architectures. These different implementations inherit from the base class for each meta-architecture using our flexible class design.
Our implementation of semantic segmentation models includes SegNet (Badrinarayanan et al., 2017). The architecture follows an encoder-decoder style. We have separated a module to calculate loss from a network that predicts a probability map. This design makes the loss reusable in other implementation of semantic segmentation models, which we are planning to add in the future.
Models for a certain task are designed to have a common interface. For example, detection models support a predict method that takes images and outputs bounding boxes around regions where objects are predicted to be located. The common interface allows users to swap different models easily inside their code. On top of that, the common interface is necessary to build functions that interact with neural network models by passing input images and receiving predictions. For instance, thanks to this interface, we can write a function that iterates over a dataset and visualizes the predictions of all the samples.
In order to train and evaluate deep learning models, datasets are needed. ChainerCV provides an interface to datasets commonly used in computer vision tasks, such as datasets from the Pascal VOC Challenge (Everingham et al., 2010). The datasets object downloads data from the Internet if necessary, and returns requested contents with an array-like interface.
A transform is a function that takes an image and annotations as inputs and applies a modification to the inputs such as image resizing. These functions are composed together to create a custom data preprocessing pipeline.
ChainerCV uses TransformDataset to compose different transforms. This is a class that wraps around a dataset by applying a function to a sample retrieved from the underlying dataset, which is often prepared to simply load data from a file system without any modifications.
We found that extending a dataset with an arbitrary function is effective especially in the case where multiple objects are processed in an interdependent manner. Such interdependence of transforms happen in a scenario when an image is randomly flipped horizontally to augment a dataset and coordinates of bounding boxes are altered depending on whether the image is flipped or not. See Figure 2 for the code to carry out the data preprocessing pipeline.
3.4. Visualizations and Evaluation Metrics
ChainerCV supports a set of functions for visualization and evaluation, which are important for conducting research in computer vision. These functions can be used across different models by enforcing a consistent data representation for each type of data. For example, an image array is assumed to be RGB and shaped as (C, H, W), where the elements of the tuple are channel size, height and width of the image. The evaluations and visualizations in Section 4 are carried out using the functions in ChainerCV.
3.5. GPU Arrays
As done in Chainer (Tokui et al., 2015), ChainerCV uses cupy.ndarray to represent arrays stored in GPU memory and numpy.ndarray to represent arrays stored in CPU memory. CuPy 222https://github.com/cupy/cupy is a NumPy like multi-dimensional array library with GPU acceleration. Many functions in ChainerCV support both types as arguments, and returns the output with the same type as the input. These functions include non-maximum suppression (Oro et al., 2016), which is efficiently implemented using CUDA parallelization. It is often complicated to set up a library to call CUDA kernels from a python module because the installation needs to consider a variety of machine configurations. ChainerCV relies on CuPy when calling CUDA kernels, and its installation procedure is quite simple.
We report performance of the implemented training code, and verify that the scores are on par with the ones reported in the original papers. Note that due to randomness in training, it is inevitable to produce slightly different scores from the original papers.
4.1. Faster R-CNN
We evaluated the performance of our implementation of Faster R-CNN, and compared it to the performance reported in the original paper (Ren et al., 2015). We experimented with a model that uses VGG-16 model (Simonyan and Zisserman, 2015) as a feature extractor. The model is trained on the PASCAL VOC detection dataset. The model is trained on the 2007 trainval and evaluated on 2007 test using our training code. Some detection results of this trained model are shown in Figure 3. The performance is compared against the original implementation using mean average precision in Table 1. Due to the stochastic training process, it is known that the final performance fluctuates (Chen and Gupta, 2017).
4.2. Single Shot Multibox Detector (SSD)
We evaluated the performance of our implementations of SSD300 and SSD512, and compared them to the performance reported in (Cheng-Yang Fu and Berg, 2017). We trained these models with the trainval splits of PASCAL VOC 2007 and 2012 for training. The performance is compared against the original implementation using mean average precision in Table 1. Note that we changed the train batchsize of SSD512 from 32 to 24 due to the GPU memory limitation. We also show some detection results of SSD512 in Figure 3.
We evaluated the performance of our SegNet implementation, and compared it to the performance reported in the journal version of the original paper (Badrinarayanan et al., 2017). It is trained on the train split of CamVid (Badrinarayanan et al., 2017), and evaluated on the test split. The performance is measured by pixel accuracy, mean pixel accuracy and mean IoU, which are the metrics used in (Badrinarayanan et al., 2017). The score is shown in Table 2 and an example result is shown in Figure 3.
|pixel accuracy||mean pixel accuracy||mIoU|
|Original (Badrinarayanan et al., 2017)||82.7||62.3||46.3|
In this article we have introduced a new computer vision software library that focuses on deep learning-based methods. Our software lowers the barrier of entry to use deep learning-based computer vision algorithms by providing a convenient and unified interface. It also provides evaluation and visualization tools to aid research and development in the field. Our implementation achieves performance on par with the reported results, and we expect it to be used as a baseline to be extended with new ideas.
We would like to thank Richard Calland for helpful discussion.
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). tensorflow.org
- Alex Krizhevsky and Hinton (2012) Ilya Sutskever Alex Krizhevsky and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS. 1097–1105.
- Badrinarayanan et al. (2017) Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. PAMI (2017).
- Chen and Gupta (2017) Xinlei Chen and Abhinav Gupta. 2017. An Implementation of Faster RCNN with Study for Region Sampling. arXiv preprint arXiv:1702.02138 (2017).
- Cheng-Yang Fu and Berg (2017) Ananth Ranga Ambrish Tyagi Cheng-Yang Fu, Wei Liu and Alexander C. Berg. 2017. DSSD : Deconvolutional Single Shot Detector. arXiv preprint arXiv:1701.06659 (2017).
- Chollet et al. (2015) François Chollet and others. 2015. Keras. github.com/fchollet/keras. (2015).
- Everingham et al. (2010) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. IJCV 88, 2 (June 2010), 303–338.
- He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr DollÃ¡r, and Ross Girshick. 2017. Mask R-CNN. arXiv preprint arXiv:1703.06870 (2017).
- Huang et al. (2016) Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy, and Google Research. 2016. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv preprint arXiv:1611.10012 (2016).
- Itseez (2015) Itseez. 2015. OpenCV. (2015). github.com/itseez/opencv
- Liu et al. (2016) Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-yang Fu, and Alexander C Berg. 2016. SSD: Single Shot MultiBox Detector. arXiv preprint arXiv:1512.02325v2 (2016).
- Oro et al. (2016) D. Oro, C. FernÃ¡ndez, X. Martorell, and J. Hernando. 2016. Work-efficient parallel non-maximum suppression for embedded GPU architectures. In ICASSP.
- Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv preprint arXiv:1506.01497v1 (2015).
- Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
- Tokui et al. (2015) Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a Next-Generation Open Source Framework for Deep Learning. In Proceedings of Workshop on Machine Learning Systems in NIPS.
- Xu et al. (2017) Danfei Xu, Yuke Zhu, Christopher Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In CVPR.