# Integrating Objects into Monocular SLAM:

Line Based Category Specific Models

###### Abstract.

We propose a novel Line based parameterization for category specific CAD models. The proposed parameterization associates 3D category-specific CAD model and object under consideration using a dictionary based RANSAC method that uses object Viewpoints as prior and edges detected in the respective intensity image of the scene. The association problem is posed as a classical Geometry problem rather than being dataset driven, thus saving the time and labour that one invests in annotating dataset to train Keypoint Network(newell2016stacked, ; parkhiya2018constructing, ) for different category objects. Besides eliminating the need of dataset preparation, the approach also speeds up the entire process as this method processes the image only once for all objects, thus eliminating the need of invoking the network for every object in an image across all images. A 3D-2D edge association module followed by a resection algorithm for lines is used to recover object poses. The formulation optimizes for shape and pose of the object, thus aiding in recovering object 3D structure more accurately. Finally, a Factor Graph formulation is used to combine object poses with camera odometry to formulate a SLAM problem.

^{†}

^{†}copyright: acmcopyright

^{†}

^{†}conference: 11th Indian Conference on Computer Vision, Graphics and Image Processing; December 18–22, 2018; Hyderabad, India

^{†}

^{†}booktitle: 11th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2018), December 18–22, 2018, Hyderabad, India

^{†}

^{†}journalyear: 2018

^{†}

^{†}doi: 10.1145/3293353.3293434

^{†}

^{†}isbn: 978-1-4503-6615-1/18/12

^{†}

^{†}price: 15.00

^{†}

^{†}article: 81

^{†}

^{†}ccs: Computing methodologies Vision for robotics

^{†}

^{†}ccs: Computing methodologies Shape inference

..

## 1. Introduction

Simultaneous Localization and Mapping (SLAM) is the most vital cog in various mobile robotic applications involving ground robots(lategahn2011visual, ), aerial(ccelik2013monocular, ) and under water vehicles(cho2017visibility, ). Monocular SLAM has emerged as a popular choice given its light weight and easy portability, especially in restrictive payload systems such as micro aerial vehicles(MAV) and hand held camera platforms.

SLAM has evolved in various flavors such as active SLAM (leung2006active, ), wherein planning is interleaved with SLAM, dynamic SLAM(kundu2011realtime, ) which reconstructs moving objects and robust SLAM(agarwal2015others, ). Object SLAM(SLAM++, ) is a relatively new paradigm wherein SLAM information is augmented with objects in the form of its poses to achieve more semantically meaningful maps with the eventful objective of improving the accuracy of SLAM systems.

Object SLAM presents itself in two popular threads. In first, instance specific models are assumed to be known apriori (Choudhary, ) . In the second, a general model for an object is used such as ellipsoids and cuboids (yang2018cubeslam, ) and (SLAM++, ). Relying on instance level models for various objects in the scene makes the first theme difficult to scale to various objects in the scene whereas general models such as cuboids do not provide meaningful information at the level of object parts and limit its relevance in application that require grasping and handling objects.

To overcome such limitations, (parkhiya2018constructing, ) positioned their research as one that combines the benefits of both. In particular, category specific models were developed in lieu of instance level models, which retained the semantic potential of the former along with the generic nature of the later at the level of object category. However, reliance of (parkhiya2018constructing, ) on a keypoint trained network for a particular category limits its expressive power as every new object category entails the estimation of a new network model for that category along with the concomitant issues of annotation, GPU requirement and dataset preparation. More specifically, in a scene that contains three object categories (parkhiya2018constructing, ) is entailed to invoke three separate network models corresponding to each category to solve for the pose and shape of the respective category of the object.

Motivated by the fact that many objects can be represented as line structures, this paper presents a novel line parameterization of objects for an object category. By associating 3D line that characterize the object category in 3D and its observation in the image in the form of 2D line segment, we solve for the object pose and shape in a decoupled formulation.

Significantly, this approach bypasses the need for keypoint annotation as we expand our pipeline to new categories as well as the requirement of estimating and maintaining an assortment of network models for various category of objects. It achieves this by relying on line segment detectors for observation of object line segments in the image rather than network models trained for semantic keypoints.

The paper shows the scalability of the line parameterized objects to three categories (chair, table and laptop) and successfully integrates the shape and pose optimized object with a factor graph based backend pose-graph optimization. Thereby, it successfully embeds 3D objects into the scene while simultaneously estimating the camera trajectory. High fidelity estimation of camera trajectory and object poses vindicates the efficacy as well as the novelty of the proposed framework.

Fig 1 Shows a typical Object SLAM run with the object poses rendered in 3D as the closest CAD model corresponding the optimized wireframe meshes shown in the inset image. Sample camera locations from the trajectory are shown in pink circles with the camera trajectory itself shown in the black dotted lines.

## 2. Related Work

Mostly, all state-of-the-art SLAM systems (ORB, ; LSD, ; SVO, ; Edge, ) and reconstruction methods using IMUs (sm1, ; sm2, ) rely on the pose-graph/ factor-graph optimization (g2o, ; GTSAM, ) or bundle adjustment. In the following section we will review the related work on object-SLAM and discuss some limitations in them and the keypoint based approach which motivated for the proposed approach. There are some approaches which tried to fuse the properties of classical geometry with deep learning models to improve object pose and shape. Latest in the line of such implementations is (zhu2018object, ) which recover both global camera pose and 3D point cloud based shape with very few, limited view observations.

### 2.1. Object-SLAM

Recent developments and the following stabilization of the SLAM systems, has led the community to incorporate objects into the SLAM framework and solve for object poses and shapes along with the robot poses in an unified framework. Some of the recent approaches for object-oriented SLAM are (SLAM++, ; Paull, ; RAS2016, ).

Majority of the object-based SLAM rely on depth information from RGB-D or stereo sensors. In (crocco2016structure, ; Choudhary, ) instance level models are assumed, which is known as shape priori. In (Choudhary, ), a framework for multi-robot object-SLAM is proposed but again with a shape priori and RGB-D sensors. In one other paradigm there is no instance-level models, available as priori. In (Paull, ), again with the help of RGB-D cameras, the association and object poses are solved jointly, in a factor graph framework. Among monocular objectSLAM/SfM approaches, (crocco2016structure, ; RAS2016, ) fall under this paradigm. In such approaches, objects are modeled as bounding boxes (RAS2016, ; Sunderhauf2015, ) or as ellipsoids (crocco2016structure, ).

Our proposed approach hence falls under a third paradigm, where we assume line based category-models, and not instance-level models.

### 2.2. Object-Category Models

Over the last few years researchers have gradually started to re-introduce more and more geometric structure in object class models and improve their performance (felzenszwalb2010object, ) . Object-category model based approach is employed to solve various problems in monocular vision, in fact (murthy2017reconstructing, ) - (tulsiani2017learning, ) employed category-level models to reconstruct objects from single image. (yang2018cubeslam, ) Propose a method for 3D cuboid object detection and multi-view object-SLAM without prior object models. They propose an efficient and accurate 3D cuboid fitting approach on single image, without prior knowledge of object model or orientation.

Approaches based on category-level model advocate incorporating category specific shape priors of an object to compensate for information loss when dealing with monocular image based processing. We employ these models Fig. 2 to incorporate object observation factors into monocular SLAM by representing all instances of a category by same model,

### 2.3. Object Detection and View Point Estimation

Convolutional Neural Networks (CNNs) have been the driving factor behind the recent advances in object detection(redmon2016you, ; ren2015faster, ; liu2016ssd, ).
These CNNs are not only highly accurate, but are very fast as well. In fact when run on a GPU, they can process at a latency of 100-300 milliseconds for each image frame. Estimating good bounding boxes for object belonging to a specific category marks the outset of our architecture.

One such CNN based model is Render For CNN (su2015render, ), our proposed solution uses the same to estimate viewpoint of an object in an image. Render For CNN has been trained on large, category specific datasets for several objects, rendered using available 3D CAD models (chang2015shapenet, ) that are easily accessible. Models that are trained for the task of object viewpoint prediction on rendered dataset work very well when they are fine-tuned on large dataset comprising of real images (everingham2010pascal, ).

## 3. Methodology

In this section we explain the end to end functioning of our Line based pipeline, giving detailed insight into each of the constituent stages.

### 3.1. Pipeline Overview

The render for CNN pipeline (su2015render, ) is trained for category specific view point estimation of an object. When presented an image, YOLO detector (redmon2016you, ) regresses bounding boxes on objects of interest. An LSD detector(von2012lsd, ) outputs the line segments within the YOLO bounding boxes. The render for CNN model outputs the viewpoint prior. The data-association module associates lines of the mean wireframe model in 3D with the LSD observations of line segments within the bounding boxes. Subsequent to the data-association a pose-shape optimization module using Ceres Solver (agarwal2015others, ) outputs the pose and shape of these objects. In a Object SLAM run the pose-shape optimization outputs constitute the camera pose-landmark constraint. Whereas the camera motion is estimated using state of the art SLAM module (mur2015orb, ). These constraints are finally optimized with GTSAM (dellaert2012gtsam, ) as the backend engine to output the camera trajectory along with objects embedded in the scene. This pipeline is vividly portrayed in Fig 3

### 3.2. Line based Category-Level Model

In our approach, we lay an emphasis on the use of category-level models as opposed to instance-level models for objects. To construct a line based category level model, each object is first characterized as a set of 3D lines that are common across all instances of the category. For example, such lines for the chair category could be legs of the chair, edges of the chair backrest, for laptops they can be the edges around the display screen and those contouring the keyboard, constituting the base and so on.

Any line based model is represented by a vector of 6* dimension, where is the number of lines present in the parameterized model, each corresponding to a key edge of a model representing the object. Each of these lines is represented by a 3D direction and a 3D point , one that lies on the line.

(1) |

(2) |

(3) |

(4) |

While the 3D point can be any point lying on the line, it is roughly chosen to be the midpoint of the edge of 3D CAD models. (e.g. midpoint of the leg of a chair)

If no prior information about object is known then search space is a prohibitive 6* dimensional space representing shape of the object. But based on the 3D annotation of CAD model, search space can be reduced so that while optimizing for shape only possible deformations in that object are looked at, rather than any arbitrary line deformation. A simple principle component analysis(jolliffe2011principal, ) is performed on the annotated CAD model dataset to get the top seven linearly independent principle directions of the deformation. These eigen vectors are sorted based on their eigen values. The number seven is chosen based on the coverage of the eigen vectors.

While solving for a shape, an object is represented by the mean shape plus weighted linear combination of the deformation directions. In such a shape representation, each chair can be represented by those weights (or shape parameters, ) for each principle deformation direction. This linear subspace model has much lower dimension than . This is easy to see, since there are various planar conditions and symmetry present in the objects.

Mathematically, if is the mean shape of the category, and s are a deformation basis obtained from PCA over a collection of aligned ordered 3D CAD models as explained in this section, any object obtained with shape parameters can be represented as,

(5) |

where is the number of basis vectors (the top- eigenvectors after PCA) and is vector consisting of all .

### 3.3. Edge Correspondence

Object invariant line detection is easier when compared to finding salient keypoints in non machine learning methods. We use LSD edge detector(von2012lsd, ) to achieve the same. The main problem here arises in associating correct 2D lines, of all the lines detected, with the respective 3D lines. Finding association is a chicken and egg problem in this case. We need a good pose estimation to find the correspondence between 3D CAD model and image and we need a good association to estimate pose of object. We get an approximate viewpoint of the object using the RenderForCNN viewpoint(su2015render, ) network and introduce a method to compute approximate translation of object. We employ this viewpoint and translation as initialization for a dictionary based RANSAC method to get most suitable Edge correspondences.

The parameterization discussed in section 3.2 allows for the representation of CAD models in terms of a set of vectors where each vector represents a line. To put it formally, we find correspondence map from 3D lines to 2D line segments. First, the line segments in image are filtered using the bounding box data we have from Object detector(redmon2016you, ). We use a custom cost function to give a score to a 3D-2D correspondence

(6) |

where, accounts for angle and and account for the distance between the line and line segment. In following subsections, we discuss the method to compute translation and the aforementioned costs.

#### 3.3.1. Computing Translation

Apart from a viewpoint initialization, an approximate value of translation is also needed for projection. Getting exact translation requires 3D length and projected 2D length of a line segment, but since the exact 3D information of object is not known, we need to rely on approximation of 3D model of that particular category of object.

We use the information available from bounding box and mean 3D model to find translation approximation. Height and Width of bounding box are independently sufficient to get a good estimate of given that object’s mean 3D model’s height and width matches mean model respectively. In order to get even better estimate in general case where both height and width of objects could deviate from mean model, we simply take average of both estimate.

(7) |

(8) |

(9) |

Here, are taken from camera matrix, are the height and width of bounding box and and are the top left corner of bounding box. and are constants obtained from mean 3D model.

#### 3.3.2. computing , ,

The projection of the 3D edge to image plane can be found by projecting any two points from the 3D line and then taking their direction and mid point (See Fig 5). R and T are rotation and translation of the 3D line.

here, is some non-zero number used to get two points on line based on one point, and direction, and is the projection function.

In fig. 5, and are the end points of an edge detected by LSD (for a line segment to be categorized as associated line, it has to be very close to the projected 3D line. The image here is exaggerated for representation purpose) and and are the projection of two points from 3D line. and are the perpendicular distances when is projected on . Using simple projective geometry, we get,

(10) |

(11) |

Adding angle directly in cost function would create complication of adding distance with angles so instead we observe that value captures the variation of angle between the two lines. This is used as

(12) |

captures the perpendicular distance of the midpoint of from the projected line . This is used as

(13) |

and lastly, distance between and is minimized to pick the lines radially closer to the projected line. This is used as

(14) |

#### 3.3.3. Association Pseudocode

We generate a dictionary of 3-5 most representative CAD models (selected manually) for each category of object, represented by . Also, we sample viewpoint around azimuth initialization and translation around the computed for . Let’s call the sampled set and , respectively.

Now, we can write the pseudocode for our based association algorithm which iterates over dictionary models and sampled view points, projects them and calculate associated lines and cost of association. Association for a line in model with a view point and translation is the line segment in image which has the minimum cost corresponding to that line in model.

Finally, it picks the association pertaining to the lowest association cost. see,

### 3.4. Pose and Shape Optimization

Once the association information is known, we formulate an optimization problem to find Pose and Shape of the Object. Ceres (agarwal2015others, ) toolbox is used for this purpose. In following subsections we take a look at different constraints used in the formulation.

The final cost function is

(16) |

#### 3.4.1. pose constraints

In figure 6, is a 3D line projected to an image plane forming 2D line. The normal constructed by the of and is perpendicular to the 3D line. Let be the point on the line and be the direction.

(17) |

taking the difference between two points and from same line

(18) |

So, the cost function is

(19) |

R and T are the parameters we want to optimize for.

#### 3.4.2. normal constraint

Each category object has a base e.g base of chair for sitting. We define base of the object as the plane which is parallel to ground plane when the object is kept in normal position.

We use this observation and put in the constraint to force the base of object to be parallel to ground. We consider normal of ground plane to be the y-axis.

(20) |

here is the y-axis and and belongs to the adjacent base lines from .

#### 3.4.3. shape constraints

Finally, we use our eigen vector formulation discussed in section 3.2 to optimize for the shape of object.

So, for any line

expanding =

(21) |

(22) |

using these in equation 24 to get shape constraint

(23) |

#### 3.4.4. Optimizing Pose and Shape

The optimizer is called for pose, and , of the object with cost

(24) |

followed by the call to optimizer for shape, , of the object with cost

(25) |

where is a regularizer that prevent shape parameters () from deviating from the category-model. Improvement in shape can result in improvement of the pose of object and vice-versa, thus, both optimizations are called iteratively to achieve better results.

### 3.5. Integrating Object Pose with Monocular SLAM

The category-models learned using line based approach are incorporated into a monocular SLAM back end. Here we have, = , where SE(3) represents rigid-body transform of a 3D point in camera frame at time i with respect to camera frame at time j. is a 44 matrix represented below

(26) |

If 3D coordinate of a world point X with respect to frame i is X then using the transformation we can represent it with respect to camera frame j as X = X.

For a given set of relative pose measurement {} of robot across all the frames i j , we define the pose-SLAM problem as estimating i that maximizes the log-likelihood of relative pose measurements, which can be framed as problem of minimizing observation errors (minimizing the negative of log likelihood).

(27) |

Where is assumed to be the uncertainty associated with each pose measurement . In order to minimize the problem posed above(27), we employ factor graphs (kaess2012isam2, ) using publicly available GTSAM (dellaert2012gtsam, ) framework to construct and optimize the proposed factor graph model.

Minimizing error function (24) and (25) in an alternating manner with respect to object shape and pose parameters yield estimated shape() and pose() for a given frame i. Pose observation obtained after shape and pose error minimization form additional factors in SLAM factor graph, therefore for each object node in the factor graph if pose of object is denoted by , following error is minimized.

(28) |

Here denotes data association function that uniquely identifies every object observed so far. Finally object-SLAM error that jointly estimate robot pose and object poses using relative object pose observations is expressed as:

(29) |

## 4. Results

In this section we present experimental results on multiple real-world sequences comprising of different category objects vis-a-vis chair, table, and Laptop . We evaluate the performance of the proposed line based approach for Object SLAM. We also emphasize on the nature of our approach that exploit Key edges in an object, corresponding to the respective wire-frame model to obtain object trajectory and precisely estimate their pose in various real-world scenarios. Fig 10 shows result of our Line based pipeline on PASCAL VOC (everingham2010pascal, ) dataset.

In Table 1, the comparison of our approach against the trajectory generated by ORB-SLAM is shown. The localization error is computed for each object and best, worst and average are reported. Our objects CAD models are in metric scale and we scale the trajectory using ration of translation between end points in trajectory. After doing this, the results generated are in meters. The ground truth is collected by placing markers at the object positions. This table is to emphasize that our approach is able to embed objects in 3D space without deteriorating (even slightly improving it) the trajectory generated by ORB SLAM.

Lastly, we evaluate our pipeline against the keypoint method(parkhiya2018constructing, ) by comparing the execution times. The time bottleneck for keypoint method during evaluation is in the forward pass of network.

Here, we compare frame processing time for both method for a image containing 3 objects. The hardware specifications for keypoint method are TitanX GPU with 12 GB memory and for line based method intel i5 processor with 8GB ram.

Time per frame in keypoint method

= inference time per object

= 285 ms

=

Time per frame in our method

= time per frame for LSD + processing time per image

= 120

=

So, we have a increase in speed by more that 2 times for the same process.

Futher, we provide a video run and other relevant results in the supplementary material.

### 4.1. Dataset

We demonstrate object SLAM using our approach on numerous sequences of monocular video in an indoor setting, comprising of office spaces and laboratory which constitute our dataset. We collected our dataset using a micro aerial vehicle (MAV) flying at a constant height above the ground. Sequence 1 and 2 of our dataset are elongated loops with 2 parallel sides, following dominant straight line motion while Sequence 3 is a 360 rotation in place with no translation from origin. Estimated robot(MAV) trajectory and object locations for these runs have been visualized in Fig 8 for both ORB-SLAM (ORB, ) and our line based object-SLAM, with and without object loop closure.

Sequence ID | Approach | # Objects | Object localization Error (metres) | ||
---|---|---|---|---|---|

Best | Worst | Average | |||

1 | ORB | 7 | 0.1558 | 1.0331 | 0.457 |

Ours | 0.1592 | 0.9190 | 0.5030 | ||

2 | ORB | 12 | 1.52 | 3.20 | 2.23 |

Ours | 1.55 | 3.12 | 2.1 | ||

3 | ORB | 9 | 3.05 | 4.65 | 3.89 |

Ours | 3.75 | 4.61 | 3.85 |

### 4.2. Instance Retrieval

We apply principle component analysis (jolliffe2011principal, ) to select Eigen vectors that represent the object space in section 3.2. In section 3.4, we formulate the optimization problem to solve for the shape of the object. The solution for this optimization gives us coefficients of the top Eigen vectors, which represent shape of the object.

Now we retrieve the closest instance from the 3D CAD model collection, that best defines the shape of the object, by running a K-Nearest Neighbors search. In Fig. 9 we present results of retrieving instance of 3D CAD model by running a 5-Nearest Neighbor search and then manually selecting the closest instance. We used these retrieved instances to visualize objects in a robot trajectory.

### 4.3. Normal Correction

In the pose optimization formulation, while solving for pose, R and T of an object a normal correction cost() is also included. In fig 7 clear improvement can be seen in pitch and roll of the objects with the inclusion of the normal cost(eq. 20), herein demonstrated using trajectory corresponding to sequence 3 of our dataset visualized with gazebo.

## 5. Conclusions

The paper introduces a novel line based parameterization to represent various objects that are generally available in indoor environment. We provide a complete pipeline which finds object poses using Pose and Shape optimization and then embeds the objects in map with the monocular SLAM trajectory, using factor graph optimization backend, to localize the object with reasonable accuracy in the navigable space.

We show the result of the proposed pipeline on various real world scenes containing objects from multiple category. The pipeline is able to localize the objects in map without deteriorating the ORB SLAM performance and in fact improving the trajectory to some extent.

The line based parameterization can prove to be useful in cases where keypoint information is hard to obtain. It circumvents the training and data collection phases and speeds up the evaluation process for associaton.

The performance of pipeline depends on robustness of the association algorithm. We plan to implement a graph based optimization method to give the associations for objects and further improve the performance and robustness of the proposed pipeline.

## References

- (1) N. Alejandro, Y. Kaiyu, and D. Jia, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision, pp. 483–499, Springer, 2016.
- (2) P. Parv, K. Rishabh, M. J. Krishna, B. Brojeshwar, and K. K. Madhava, “Constructing category-specific models for monocular object-slam,” arXiv preprint arXiv:1802.09292, 2018.
- (3) L. Henning, G. Andreas, and K. Bernd, “Visual slam for autonomous ground vehicles,” in Robotics and Automation (ICRA) 2011 IEEE International Conference on, pp. 1732–1737, IEEE, 2011.
- (4) Ç. Koray and S. A. K, “Monocular vision slam for indoor aerial vehicles,” Journal of electrical and computer engineering, vol. 2013, pp. 4–1573, 2013.
- (5) C. Younggun and K. Ayoung, “Visibility enhancement for underwater visual slam based on underwater light scattering model,” in Robotics and Automation (ICRA) 2017 IEEE International Conference on, pp. 710–717, IEEE, 2017.
- (6) L. Cindy, H. Shoudong, and D. Gamini, “Active slam using model predictive control and attractor based exploration,” in Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 5026–5031, IEEE, 2006.
- (7) K. Abhijit, K. K. Madhava, and J. CV, “Realtime multibody visual slam with a smoothly moving monocular camera,” in Computer Vision (ICCV) 2011 IEEE International Conference on, pp. 2080–2087, IEEE, 2011.
- (8) A. Sameer and M. Keir, “Others,âceres solver,â,” 2015.
- (9) S.-M. R. F, N. R. A, S. Hauke, K. P. HJ, and D. A. J, “Slam++: Simultaneous localisation and mapping at the level of objects,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359, 2013.
- (10) C. Siddharth, C. Luca, N. Carlos, R. John, L. Zhen, C. H. I, and D. Frank, “Multi robot object-based slam,” in International Symposium on Experimental Robotics, pp. 729–741, Springer, 2016.
- (11) S. Yang and S. Sebastian, “Cubeslam: Monocular 3d object detection and slam without prior models,” arXiv preprint arXiv:1806.00557, 2018.
- (12) M.-A. R. M. J. M. M. and T. J. D., “ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
- (13) E. Jakob, S. Thomas, and C. Daniel, “Lsd-slam: Large-scale direct monocular slam,” in European Conference on Computer Vision, pp. 834–849, Springer, 2014.
- (14) F. Christian, Z. Zichao, G. Michael, W. Manuel, and S. Davide, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
- (15) S. Maity, A. Saha, and B. Bhowmick, “Edge slam: Edge points based monocular visual slam,” in ICCV Workshop, pp. 2408–2417, 2017.
- (16) A. Mallik, B. Bhowmick, and S. Alam, “A multi-sensor information fusion approach for efficient 3d reconstruction in smart phone,” in IPCV, 2015.
- (17) B. Bhowmick, A. Mallik, and A. Saha, “Mobiscan3d: A low cost framework for real time dense 3d reconstruction on mobile devices,” in UIC, 2014.
- (18) K. Rainer, G. Giorgio, S. Hauke, K. Kurt, and B. Wolfram, “g 2 o: A general framework for graph optimization,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3607–3613, IEEE, 2011.
- (19) D. F et al., “Gtsam,” URL: https://borg. cc. gatech. edu, 2012.
- (20) Z. Rui, W. Chaoyang, L. Chen-Hsuan, W. Ziyan, and L. Simon, “Object-centric photometric bundle adjustment with deep shape prior,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 894–902, IEEE, 2018.
- (21) M. Beipeng, L. Shih-Yuan, P. Liam, L. John, and H. J. P, “Slam with objects using a nonparametric pose graph,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pp. 4602–4609, IEEE, 2016.
- (22) G.-L. Dorian, S. Marta, T. J. D, and M. JMM, “Real-time monocular object slam,” Robotics and Autonomous Systems, vol. 75, pp. 435–449, 2016.
- (23) C. Marco, R. Cosimo, and D. B. Alessio, “Structure from motion with objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4141–4149, 2016.
- (24) S. Niko, D. Feras, M. Sean, E. Markus, U. Ben, and M. Michael, “Slam–quo vadis? in support of object oriented and semantic slam,” 2015.
- (25) F. P. F, G. R. B, M. David, and R. Deva, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
- (26) M. J. Krishna, K. G. Sai, C. Falak, and K. K. Madhava, “Reconstructing vehicles from a single image: Shape priors for road scene understanding,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 724–731, IEEE, 2017.
- (27) T. Shubham, K. Abhishek, C. Joao, and M. Jitendra, “Learning category-specific deformable 3d models for object reconstruction,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 719–731, 2017.
- (28) R. Joseph, D. Santosh, G. Ross, and F. Ali, “You only look once: Unified real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
- (29) R. Shaoqing, H. Kaiming, G. Ross, and S. Jian, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
- (30) L. Wei, A. Dragomir, E. Dumitru, S. Christian, R. Scott, F. Cheng-Yang, and B. A. C, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37, Springer, 2016.
- (31) S. Hao, Q. C. R, L. Yangyan, and G. L. J, “Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694, 2015.
- (32) C. A. X, F. Thomas, G. Leonidas, H. Pat, H. Qixing, L. Zimo, S. Silvio, S. Manolis, S. Shuran, S. Hao, et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
- (33) E. Mark, V. G. Luc, W. C. KI, W. John, and Z. Andrew, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
- (34) V. G. R. Grompone, J. Jérémie, M. Jean-Michel, and R. Gregory, “Lsd: a line segment detector,” Image Processing On Line, vol. 2, pp. 35–55, 2012.
- (35) M.-A. Raul, M. J. M. Martinez, and T. J. D, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
- (36) D. Frank et al., “Gtsam,” URL: https://borg. cc. gatech. edu, 2012.
- (37) J. Ian, “Principal component analysis,” in International encyclopedia of statistical science, pp. 1094–1096, Springer, 2011.
- (38) K. Michael, J. Hordur, R. Richard, I. Viorela, L. J. J, and D. Frank, “isam2: Incremental smoothing and mapping using the bayes tree,” The International Journal of Robotics Research, vol. 31, no. 2, pp. 216–235, 2012.