Automatic Annotation for Object Manipulation
We propose invisible marker for accurate automatic annotation to manipulate objects.
Invisible marker is invisible in visible light, whereas it can be visible by applying ultraviolet light in the dark.
By capturing images while alternately switching between visible and invisible light at high speed, massive annotation datasets for objects painted with invisible marker are created quickly and inexpensively.
We show comparison with manual annotation and demonstrations of semantic segmentation by deep learning for deformable objects such as cloth, liquid, and powder.
111An accompanying video is available at the following link:
Accurate object recognition is essential for robots to manipulate objects. For example, deformable objects such as clothes, liquid, and powder are important to realize automatic laundry folding task and applications for biology and the medical area such as fully automated experiment in laboratory. However, these deformable objects are challenging to recognize accurately.
Deep learning has succeeded in computer vision (CV), natural language processing (NLP) and robotics area [long2015fully, ronneberger2015u, badrinarayanan2017segnet, young2018recent, hatori2018interactive, takahashi2019deep]. However, deep learning generally requires massive datasets for training to achieve good performance. Even though datasets are important, creating datasets consume enormous costs such as time, human resources, and money. Many objects are still beyond the reach of deep learning recognition because of the difficulties in creating appropriate datasets.
Hence, we focus on automatic dataset creation for object manipulation, especially data annotation for semantic segmentation in deep learning. We use a marker what we call invisible marker which emits light to react light of wavelengths outside from the regular white (visible) light, and the marker does not change the appearance of objects under visible light (See Fig. 1). We show comparison with manual annotation and the effective for automatic annotation of deformable objects such as cloth, liquid, and powder.
The rest of this paper is organized as follows. Related works and contribution are described in II, while III explains our proposed method. IV outlines our experiment setup and evaluation settings with results presented in V. VI discusses the cost of the proposed method. Finally, future work and conclusions are described as in VII.
Ii Related Works & Contributions
Many researchers have recently been developing annotation methods to reduce the effort for creating datasets. The majority of these methods, however, falls in either of the two categories, such as manual annotation and automatic annotation.
Ii-a Manual Annotation
Popular semantic segmentation datasets such as Pascal VOC, MS-COCO, and COCO-Stuff are manually annotated [everingham2010pascal, lin2014microsoft, caesar2018coco]. Reduction of the workers’ manual efforts by annotation tools [russell2008labelme, apolloscape_arXiv_2018, 2018arXiv180504687Y, andriluka2018fluid] and using crowdsourcing [hatori2018interactive, chang2017matterport3d] enables the creation of large-scale datasets. One of the challenges in manual annotation is that manual labeling increases human error and individual judgment in ambiguous cases.
Ii-B Automatic Annotation
To exclude human error and to create high-quality datasets easily, automatic annotation researches have been developing. The studies can be roughly classified into three groups.
1) Approaches that focus on the features of the object itself such as color tracking [liensberger2009color], object temperature for hot liquid using thermography [schenck2016detection], and movement by background subtraction [brutzer2011evaluation]: These approaches have difficulty in the annotation of the object when the feature of the object is shared among multiple objects and environment. For example, in color tracking and background subtraction, if something other than the target object has a similar color to the target object or move, the other objects are also annotated as the target object.
2) Approaches that focus on the given features such as augmented reality (AR) marker [brachmann2014learning, kehl2017ssd]: Even if there are multiple objects, the objects are distinguished by different markers in each object. However, if a feature like a marker is given to the object, there is a problem that the appearance of the object is changed.
3) Simulation approaches [hong2018virtual, aleksi2019affordance]: This method can artificially create large datasets by changing various background images and by capturing images of the target object from multiple locations. However, the quality of the dataset is usually low for objects that are difficult to simulate, such as deformable objects. The gap between simulation and the real world is also challenging in transfer learning for simulation to real world.
Our method attaches the marker to objects in the real environment. However, the appearance of the object does not change by invisible marker. In our method, not only rigid objects, but also objects that require enormous time and cost to annotate manually, such as deformable objects, are particularly effective. We believe this research can apply various robotics applications, especially, object manipulation.
Iii Invisible Marker
Invisible marker emits light in the condition of wavelengths outside from the visible light, and does not emit light with regular visible light, or emits much weaker light than visible light. By using transparent or white but nearly transparent colored material in visible light for the invisible marker, the appearance of objects itself does not change when Invisible marker is attached to objects as a marker. Any material can be used as long as it satisfies the above condition. Depending on the object to be applied, not only the liquid but also the spray can be used. This study uses fluorescent paint since it can be easily obtained. The object painted with fluorescent paint is luminescent under ultraviolet (UV) light. The property becomes particularly evident in the dark. One of the limitations of invisible marker is to create datasets in bright places since invisible marker is almost invisible in bright place. Data acquisition in outdoor is difficult to apply in a real scene, but it can be used indoors where the amount of light can be controlled.
In the dark except UV light, objects without fluorescent paint cannot be visible. At the time of capturing image under UV light conditions, only the part to which the fluorescent paint is applied emits visible light (See second-row in Fig. 1). Then, annotated data from the image captured by UV light are obtained by applying the threshold value of RGB in each target object to assign the class labels of semantic segmentation. Therefore, our proposed method does not depend on the complexity of the background, and only the fluorescent paint attached objects are detected. Besides, our method can distinguish the individual object from multiple objects by applying different colors to each object since various color of fluorescent paints can be created by mixing red, green, and blue color (Fig. 2). The detail result will be discuss in Section V-E.
Iv Experiment Setup
The purpose of experiments is to verify the accuracy of datasets automatically generated by invisible markers, followed by a discussion of cost to compare the proposed method with the manual annotation method.
We note that our values of the hyper-parameters provided in this article are tuned by random search.
Iv-a Data Acquisition Device
Datasets for semantic segmentation require a set of images under the visible light condition as input data paired with output data labeled for which class each pixel of the image belongs. We develop a system that can create such paired images automatically using invisible marker (See Fig. 3). Our system is composed of three parts:
Camera part which captures images of a target object
Lighting part which controls the lighting output toward the target object
Control part which controls the timing of capturing images and changing the light condition
At the camera part, we use a camera which can control the capturing timing with external input. When the camera receives an external trigger input from a control part, the camera captures an image of the target object. Then, the image is sent to a control computer. In this experiment, we use a RealSense D415 camera [keselman2017intel] with wiring to input external capturing a trigger from the control computer.
At the lighting part, we use an RGB LED for lighting as the visible light, and a high brightness power UV LED as the invisible light. We also use power LED drivers to control them.
At the control part, a NUCLEO-F303K8 board [nucleo] provides trigger signals to the lighting part to control their emission timings and intensities. The NUCLEO-F303K8 board can be controlled from the computer through USB. At the same time, the control part also outputs a trigger signal for the camera part to control the capturing timing (See timing chart in Fig. 3). To create datasets of semantic segmentation for dynamic changing objects, visible light and UV light are alternately switched at high speed.
Iv-B Target objects
Accurate recognition of deformable objects such as cloth, liquid, and powder is important to fold clothes and manipulate medicine and cook in industry and daily life. A handkerchief as the cloth, water as the liquid, and baking soda as the powder are prepared. These are mixed with fluorescent paint.
Iv-C Data Collection
For data collection, images are captured by the data acquisition system described in Section IV-A during motions such as the folding cloth, waving liquid in a plastic bottle, and stirring powder with a spoon. The sampling rate of capturing a image for input and annotation in visible light and UV light is 30 Hz. In each object, one motion is performed with in each backgrounds (Fig. 4). Totally, six motions are performed in each object. The total numbers of acquired images for cloth, liquid, and powder are 3920, 3081, and 4199, respectively. Datasets of five out of six motions in each object are used for training for all combinations of motions, and the data of one remaining motion are used for evaluation by 6-fold cross-validation. That means evaluation is performed with untrained background. Acquired images are resized from to . Then, all values are normalized to be between 0 and 1.
Iv-D Deep Learning & Training
In order to show the effectiveness of the datasets created using invisible marker, we verify with typical deep learning models for semantic segmentation such as FCN [long2015fully], U-Net [ronneberger2015u], and SegNet [badrinarayanan2017segnet]. These models are trained with datasets composed of images in visible light as input and the annotated data by the proposed method as output. Chainer, deep learning library, is used for implementation [tokui2015chainer]. All our network experiments were performed on a machine equipped with 256 GB RAM, an Intel Xeon E5-2667v4 CPU, and eight Tesla P100-PCIE with 12GB resulting in about 10 to 30 minutes of training time each material and deep learning model.
|IoU among manual||94.0%||93.6%||84.6%|
|manual and invisible marker||89.8%||77.6%||84.0%|
V Experiment Results
V-a Annotation result by invisible marker
We first show examples of datasets created by the data acquisition device (Fig. 1). In Fig. 1, we can observe that fine unevenness of the edge of powder is annotated accurately. In addition, only the target object is annotated even though the background include similar color to the objects, and others such as human hand and spoon. Conventional methods, such as focusing on colors or background subtraction, are challenging to annotate these situations. The method that focuses on color cannot annotate correctly if the background color is similar to the target object. In the background subtraction method, not only the target object but also human hand, spoons, and plastic bottles will be annotated. Besides, these methods are difficult to handle multiple objects and instance segmentation as will be described in Section V-E.
V-B Comparison with the Proposed Method and Manual Method
Next, we compare annotations by manual and by using invisible marker. In order to strictly compare the proposed method with manual annotation, it is necessary to create a manual annotation dataset and train them. However, even though the number of images in our dataset is only 11200, the experiment is hard because the manual annotation cost is about $ 12500. Instead, we selected three images for each object and each background, that is totally 54 images, and three people per image are assigned for annotation using Amazon Mechanical Turk (AMT). Then, two comparisons were performed using the intersection over union (IoU) as the evaluation method as follow:
Applying IoU only with manual annotation to measure individual differences, and
applying IoU to the proposed method and manual annotation to measure how much the proposed method close from the manual annotation.
IoU is simply a ratio of the union of two regions and the intersection of the two regions. Table I shows the mean value and standard deviation about the comparisons.
V-B1 Individual Differences in Manual
In Table I, IoU among manual shows that individual differences depend on the clarity of boundaries between the object and others. Individual differences are small in cloth and liquid because these boundaries are clear. On the other hand, individual differences are large when there is an unclear boundary between the object and others like small about of power around boundary. Fig. 5 shows annotated images of powder by manually and proposed method. Manual annotation No.1 ignores fine powder area, while manual annotation No.2 annotates to cover the entire area of the fine powder. And manual annotation No.3 is an intermediate type of No.1 and No.2. When humans annotated ambiguous scenes such as these power, individual judgment create variances in the datasets. Our proposed method can control annotation by adjusting the amount of fluorescent paint for the light intensity, and the threshold value of images captured in UV light for the annotated images. In this experiment, the amount of fluorescent paint and the threshold value are adjusted to ignore fine powder like manual annotation No. 1 since the ignored area is too fine to manipulate by robots.
|IoU among manual||94.6%|
|manual and invisible marker||93.6%|
V-B2 Gaps between the Proposed Method and Manual Method
In Table I, the result of IoU between manual and invisible marker shows that difference between manual annotation and invisible marker is more significant in liquid than cloth and powder. The first and second rows of Fig. 6 shows one of the examples of the biggest gap between the image in visible light and the annotated images because of fast move of objects. The movement of the liquid surface is shifted, and the position of the human hand is shifted on the cloth. We think that the gap of liquid was more significant than the cloth because the liquid moved more and faster than the cloth. This challenge can be alleviated by hardware and software approach. In the hardware approach, higher sampling camera suppresses the gap. In the software approach, by training a large-scale data set through deep learning, the gaps in the data set are absorbed by the generalization performance. In this research, we focus on the software approach because of limitation of camera sampling rate. The experiment result will be described in section V-C.
In order to investigate whether the gaps in capturing timing is the main factor, we evaluate IoU in the same way as Table I using stationary liquid on the desk instead of liquid in movement. We selected four images of stationary liquid, and three people per image are assigned for annotation using AMT. The result is shown in Table II. As a result, it can be seen that the difference between the proposed method and the manual method is almost within individual human differences. According the results, it is thought that if the problem of gap in capturing timing can be alleviated by above the software approach, it can be applied even to the target object that moves fast.
V-C Results of Semantic Segmentation
The purposes of training networks with the invisible marker datasets are to confirm whether inference works for fine annotation, and to evaluate generalization ability which absorbs the gap of the captured image between visible light and UV light by the fast movement of the objects.
Table III shows the evaluation result of IoU of inferred semantic segmentation from validation data which is untrained backgrounds. In U-Net, the accuracy for validation data is over 80% for cloth, liquid, and powder. Fig. 7 shows the inferred results of cloth, liquid, and powder using untrained backgrounds data. As can be seen from the accuracy of around 80% in III and Fig. 7, deep learning can infer fine annotation from our datasets.
As described in Section V-B2, the fast movement of the object makes the gap between captured images in visible light and UV light (See the first and second rows of Fig. 6). However, the inferred results in the third rows of Fig. 6 show that the generalization performance of deep learning absorbs these gaps, and the accurate semantic segmentation images are inferred. These images are inferred using untrained background. We conclude that our proposed method can create accurate annotation datasets enough to deep learning for semantic segmentation of deformable materials automatically
V-D Inference for Non-mixed Fluorescent Object
Fig. 8 shows the inference result of the object, which is not mixed with fluorescent paint, using the trained networks through the datasets with fluorescent paint object. The target object to be inferred is the liquid that would have the greatest appearance change. The inferred results show that the semantic segmentation is accurate. We can say that the change in appearance due to the fluorescent paint is small enough within the generalization ability of deep learning.
V-E Extension of proposed method
In the proposed method, the color of invisible marker can be changed for each object. Even if there are multiple objects, the color can be assigned to each object. Therefore, even in the situation where objects overlap, or same and/or multiple objects, each object can be annotated separately. In addition, it is also possible to create annotation datasets for the grasping part of the object by giving invisible marker to a part of it instead of the whole object. Moreover, by giving different colors to each grasping point of the same object, robots can manipulate the grasping point properly according to the purpose.
Fig. 9 (a) shows the results of segmentation for multiple objects, overlapping of objects, and instance segmentation of two white color bottle and a brown color bottle. Fig. 9 (b) shows the results of annotation for two grasping point by different labels. These annotations are challenges with conventional methods that focus on color and background subtraction. In terms of color, other objects and backgrounds of the same color as the object are annotated as the same class. In the background difference method, if multiple objects move at the same time, they are treated as the same class.
Vi Discussion of cost for manual annotation and proposed method
In this section, we discuss the cost of time and money in manual annotation and the proposed method. Table IV show the cost for manual method and the proposed method. In the manual annotation, annotation part by manual incurs the cost. In the proposed method, coloring objects incur the cost. In the proposed method, the coloring of the fluorescent paint took only a few minutes per object. Time per image is 0 minute since painting time can be ignored considering taking a lot of images per object. The cost of the fluorescent paint was $ 18 for all objects. For the acquisition system, UV light as an additional device is required for the proposed method. Once we make the data acquisition system, it can be used again. In the manual annotation, the total cost increase as the number of images increases and cost per image does not depend on datasets size. However, in the proposed method, cost of time and money per image decrease because colored objects can be reused once objects are painted by invisible marker. Since deep learning requires thousands to tens of thousands of images, it is clear that the proposed method can create datasets quickly and inexpensively.
|Manual annotation||Proposed method|
|Time in total for objects and image||30 [min] for recording images||30 [min] for recording images|
|31250 [min] for annotation||0 [min] for annotation|
|0 [min] for paining||Less than 5 [min] for painting|
|Time per image||About 2.5 [min] / image||About 0 [min] / image|
|Money in total for objects and images||$ 200 for camera||$ 200 for camera|
|$ 75 for LED||$ 75 for LED|
|$ 12500 for annotation||$ 18 for fluorescent paint|
|$ 25 for UV light|
|Money per image||$ 1.12 / image||$ 0.0014 / image|
In this paper, we proposed a method to create annotation automatically using invisible marker, which is visible under UV light, whereas invisible under visible light. By switching between visible light and invisible light at high speed, our system can acquire large datasets of dynamical change of deformable objects such as cloth, liquid, and powder. The challenge of annotation gaps due to capturing timing shift can be absorbed sufficiently by the generalization performance of deep learning. High accuracy of segmentation is shown by multiple deep learning models such as FCN, SegNet, and U-Net. The manual annotation takes $ 1.12 dollar and 2.5 minutes per image, but in the proposed method take the price per image is $ 0.0014, and the work time other than capturing is negligible. We conclude that our proposed method can create accurate annotate dataset quickly and inexpensively.
These datasets can widen the range of robotic tasks such as folding clothes, cooking, and biomedical applications. For future works, we would like to develop manipulation tasks with a robot such as cooking and laundry folding.