I Introduction
Recent progress in computer vision has enabled the implementation of autonomous vehicle prototypes across urban and highway scenarios
[schwarting2018planning]. Autonomous vehicles need accurate selflocalization in the environment allowing them to plan their actions. For accuracy in localization, the High Definition (HD) maps of the environments containing information on 3D geometry of road boundaries, lanes, traffic signs, and other semantically meaningful landmarks are necessary. However, the process of creating these HD maps involves the use of expensive sensors mounted on the collection vehicles [jiao2018machine], thereby limiting the scale of their coverage. It is also desired that any changes in the environment, such as the type or positions of traffic signs, are regularly reflected in the map. Therefore, the creation and maintenance of HD maps at scale remain a challenge.To extend the map coverage to more regions or update the landmarks over time, crowdsourced maps are an attractive solution. However, in contrast with the automotive data collection vehicles with high grade calibrated sensors, crowdsourced maps would require the use of consumergrade sensors whose intrinsics may be unknown, or change over time. The commonly available sensors for crowdsourced mapping are a monocular color camera and a global positioning system (GPS). To utilize these sensors for crowdsourced mapping, it is required to perform camera selfcalibration followed by monocular depth or egomotion estimation. Over the years, geometry as well as deep learning based approaches have been proposed to compute the camera intrinsics [bogdan2018deepcalib, gordon2019depthwild, schonberger2016structure], and estimate the depth/egomotion [mur2015orb, engel2017direct, gordon2019depthwild, zhou2017unsupervised] from a sequence of images. However, the stateoftheart solution to crowdsourced mapping assumes the camera intrinsics to be known a priori, and relies upon only geometry based egomotion estimation [dabeer2017end].
Geometry based approaches for selfcalibration and visual depth/egomotion estimation often depend on carefully designed features and matching them across frames. Thus, they fail in scenarios with limited features such as highways, during illumination change, occlusions, or have poor matching due to structure repetitiveness. Recently, deep learning based approaches for camera selfcalibration as well as depth and egomotion estimation have been proposed [zhou2018deeptam, zhou2017unsupervised, godard2018digging, gordon2019depthwild]. These methods perform in an endtoend fashion and often being selfsupervised, enable application in challenging scenarios. They are usually more accurate than geometry based approaches on short linear trajectories, resulting in a higher local agreement with the ground truth [zhou2017unsupervised, gordon2019depthwild]. Moreover, deep learning based approaches can estimate monocular depth from a single frame as opposed to geometry based approaches that require multiple frames. Nonetheless, the localization accuracy of geometry based approaches is higher for longer trajectories due to loop closure and bundle adjustment. Therefore, we hypothesize that eliminating the requirement to know the camera intrinsics a priori while mapping through a hybrid of geometry and deep learning methods, will increase the global map coverage and enhance the scope of its application.
In this work, we focus on the 3D positioning of traffic signs, as it is critical to the safe performance of autonomous vehicles, and is useful for traffic inventory and sign maintenance. We propose a framework for crowdsourced 3D traffic sign positioning that combines the strengths of geometry and deep learning approaches to selfcalibration and depth/egomotion estimation. Our contributions are as follows:

We evaluate the sensitivity of the 3D position triangulation to the accuracy of the selfcalibration.

We quantitatively compare deep learning and multiview geometry based approaches to camera selfcalibration, as well as depth and egomotion estimation for crowdsourced traffic sign positioning.

We demonstrate crowdsourced 3D traffic sign positioning using only GPS information and a monocular color camera without the prior knowledge of camera parameters.

We show that combining the strengths of deep learning with multiview geometry is important for increased map coverage.

To facilitate evaluation and comparison on this task, we construct and provide an open source 3D traffic sign ground truth positioning dataset on KITTI
^{1}^{1}1https://github.com/hemangchawla/3dgroundtruthtrafficsignpositions.git.
Ii Related Work
Traffic sign 3D positioning
Arnoul et al. [arnoul1996traffic]
used a Kalman filter for tracking and estimating positions of traffic signs in static scenes. In contrast, Madeira et al.
[madeira2005automatic] estimated traffic sign positions through leastsquares triangulation using GPS, Inertial Measurement Unit (IMU), and wheel odometry. Approaches using only a monocular color camera and GPS were also proposed [krsak2011traffic, welzel2014accurate]. However, Welzel et al.[welzel2014accurate] utilized prior information about the size and height of traffic signs to achieve an average absolute positioning accuracy up to 1m. A similar problem of mapping the 3D positions and orientations of traffic lights was tackled by Fairfield et al. [fairfield2011traffic]. For related tasks of 3D object positioning and distance estimation, deep learning approaches [chen2016monocular, ku2019monocular, qin2019monogrnet, zhu2019learning] have been proposed. However, they primarily focus on volumetric objects, ignoring the nearplanar traffic signs. Recently, Dabeer et al. [dabeer2017end] proposed an approach to crowdsource the 3D positions and orientations of traffic signs using costeffective sensors with known camera intrinsics, and achieved a single journey average relative and absolute positioning accuracy of 46 cm and 57 cm respectively. All the above methods either relied upon collection hardware dedicated to mapping the positions of traffic control devices or assumed known accurate camera intrinsics.Camera selfcalibration
Geometry based approaches for selfcalibration use two or more views of the scene to estimate the focal lengths [bocquillon2007constant, gherardi2010practical], while often fixing the principal point at the image center [de1998self]. Structure from motion (SfM) reconstruction using a sequence of images has also been applied for selfcalibration [pollefeys2008detailed, schonberger2016structure]. Moreover, deep learning approaches have been proposed to estimate the camera intrinsics using a single image through direct supervision [lopez2019deep, rong2016radial, workman2015deepfocal, zhuang2019degeneracy], or as part of a multitask network [bogdan2018deepcalib, gordon2019depthwild]. While selfcalibration is essential for crowdsourced 3D traffic sign positioning, its utility has not been evaluated until now.
Monocular Depth and EgoMotion estimation
Multiview geometry based monocular visual odometry (VO), and simultaneous localization and mapping (SLAM) estimate the camera trajectory using visual feature matching and local bundle adjustment [klein2007parallel, mur2015orb], or through minimization of the photometric reprojection error [engel2014lsd, engel2017direct, newcombe2011dtam]. Supervised learning approaches predict monocular depth [cao2017estimating, eigen2014depth, liu2015learning] and egomotion [wang2017deepvo, zhou2018deeptam] using ground truth depths and trajectories, respectively. In contrast, selfsupervised approaches jointly predict egomotion and depth utilizing image reconstruction as a supervisory signal [casser2019unsupervised1, godard2018digging, gordon2019depthwild, zhou2017unsupervised, godard2017unsupervised, li2018undeepvo, zhan2018unsupervised]. Selfsupervised depth prediction has also been integrated with geometry based direct sparse odometry [engel2017direct] as a virtual depth signal [yang2018deep]. However, some of these selfsupervised approaches rely upon stereo image pairs during training [yang2018deep, godard2017unsupervised, li2018undeepvo, zhan2018unsupervised].
Iii Method
In this section, we describe our proposed system for 3D traffic sign positioning. The input is a sequence of color images of width and height , and corresponding GPS coordinates . The output is a list of detected traffic signs with the corresponding class identifiers , absolute positions , and the relative positions with respect to the corresponding frames in which the sign was detected. An overview of the proposed system for 3D traffic sign positioning is depicted in Fig. 2. Our system comprises of the following key modules:
Iiia Traffic Sign Detection & Interframe Sign Association
The first requirement for the estimation of 3D positions of traffic signs is detecting their coordinates in the image sequence and identifying their class. The output of this step is a list of 2D bounding boxes enclosing the detected signs, and their corresponding track and frame numbers. Using the center of the bounding box we extract the coordinates of the traffic sign in the image. However, we disregard those bounding boxes that are detected at the edge of the images to account for possible occlusions.
IiiB Camera SelfCalibration
For utilizing the crowdsourced image sequences to estimate the 3D positions of traffic signs, we must perform selfcalibration for cameras whose intrinsics are previously unknown. For this work, we utilize the pinhole camera model. From the set of geometry based approaches, we evaluate the Structure from Motion based method using Colmap [schonberger2016structure]. Note that selfcalibration suffers from ambiguity for the case of forward motion with parallel optical axes [bocquillon2007constant]. Therefore we only utilize those parts of the sequences in which the car is turning. To extract the subsequences in which the car is turning, the RamerDouglasPeucker (RDP) algorithm [ramer1972iterative, douglas1973algorithms] is used. From the deep learning based approaches, we evaluate the SelfSupervised Depth From Videos in the Wild (VITW) [gordon2019depthwild]. The burden of annotating training data [lopez2019deep, zhuang2019degeneracy] makes supervised approaches inapplicable to crowdsourced usecases.
IiiC Camera EgoMotion and Depth Estimation
Given the camera calibration, the 3D traffic sign positioning requires the computation of the camera egomotion or depth as shown in Fig. 2 and 3.
EgoMotion
For applying approach A described in Fig. 3 to 3D traffic sign positioning, the egomotion of the camera must be computed from the image sequence. Note that cameracalibration through Colmap involves SfM, but only utilizes those subsequences which contain a turn (Sec. IIIB). Therefore, we evaluate stateoftheart geometry based monocular approach ORBSLAM [mur2015orb] against selfsupervised Monodepth 2 [godard2018digging] and VITW. While the geometry based approaches compute the complete trajectory for the sequence, the selfsupervised learning based approaches output the camera rotation and translation per image pair. The adjacent pair transformations are then concatenated to compute the complete trajectory. After performing visual egomotion estimation, we use the GPS coordinates to scale the estimated trajectory. First, we transform the GPS geodetic coordinates to local EastNorthUp (ENU) coordinates. Thereafter, using the Umeyama’s algorithm [umeyama1991least], a similarity transformation, (rotation , translation , and scale ) is computed that scales and aligns the estimated camera positions () with the ENU positions () minimizing the mean squared error between them. The scaled and aligned camera positions are therefore given by
(1) 
Thereafter, this camera trajectory is used for computation of the 3D traffic sign positions as described in section IIID.
Monocular Depth
For applying approach B described in Fig. 3 to 3D traffic sign positioning, dense monocular depth maps are needed. To generate the depth maps, we evaluate the selfsupervised approaches, Monodepth 2, and VITW. These approaches simultaneously predict the monocular depth as well as the egomotion of the camera. While the estimated dense depth maps maintain the relative depth of the observed objects, we obtain metric depth by preserving forward and backward scale consistency. Given camera calibration matrix , the shift in pixel coordinates due to rotation and translation between adjacent frames and , is given by
(2) 
where and represent the unscaled depths corresponding to the homogeneous coordinates of pixels and . By multiplying equation 2 with forward scale estimate , it is seen that scaling the relative translation similarly scales the depths and . This is also explained through the concept of similar triangles in Fig. 4. Given relative ENU translation , we note that the scaled relative translation is given by,
(3) 
Therefore, the forward scale estimate
(4) 
Similarly the backward scale estimate is computed. Accordingly, for frames , the scaling factor is given by the average of forward and backward scale estimates, and . Thereafter, these scaled dense depth maps are used for computation of the 3D traffic sign positions as described in section IIID.
IiiD 3D Positioning and Optimization
For the final step of estimating and optimizing the 3D positions of the detected traffic signs, we adopt two approaches as shown in Fig. 3.
Approach A
In this approach, the estimated camera parameters, the computed and scaled egomotion trajectory, and the 2D sign observations in images are used to compute the sign position through triangulation. For a sign observed in frames, we compute the initial sign position estimate using the midpoint algorithm [szeliski2010computer]. Thereafter, nonlinear Bundle Adjustment (BA) is applied to refine the initial estimate by minimizing the reprojection error to output
(5) 
To compute the sign positions relative to frames , the estimated absolute sign position is projected to the corresponding frames in which it was observed
(6) 
If the relative depth of the sign is found to be negative, triangulation of that sign is considered to be failed. We can use this approach with the full trajectory of the sequence or with short subsequences corresponding to the detection tracks. The use of full and short trajectory for triangulation is compared in section IVD.
Approach B
In approach B, the estimated camera parameters, the scaled dense depth maps, and the 2D sign observations in images are used to compute the 3D traffic sign positions through inverse projections. For a sign observed in frames, each corresponding depth map produces a sign position hypothesis given by
(7) 
where represents the pixel coordinate of sign in the frame , and is the corresponding depth scaling factor. Since, sign depth estimation may not be as reliable beyond a certain distance, we discard that sign position hypotheses whose estimated relative depth is more than 20m. For computing the absolute coordinates of the sign, each relative sign position is projected to the world coordinates, and their centroid is computed as the absolute sign position,
(8) 
Finally, for both the above approaches, the metric absolute positions of traffic signs are converted back to the GPS geodetic coordinates.
Iv Experiments
In order to evaluate the best approach to 3D traffic sign positioning, it is pertinent to consider the impact of the different components on the overall accuracy of the estimation. First, we analyze the sensitivity of 3D traffic sign positioning performance against the camera calibration accuracy, demonstrating the importance of good selfcalibration. Thereafter, we compare approaches to egomotion and depth estimation, and camera selfcalibration that compose the 3D sign positioning system. Finally, the relative and absolute traffic sign positioning errors corresponding to the approaches A and B are evaluated. For the above comparisons, we use the traffic signs found in the raw KITTI odometry dataset [geiger2012we], sequences (Seq) 0 to 10 (Seq 3 is missing from the raw dataset), unless specified otherwise.
Iva Ground Truth Traffic Sign Positions
While 3D object localization datasets usually contain annotations for volumetric objects, such as vehicles and pedestrians, such annotations for nearplanar objects like traffic signs are lacking. Furthermore, related works dealing with 3D traffic sign positioning have relied upon closed source datasets [dabeer2017end, welzel2014accurate]. Therefore we generate the ground truth (GT) traffic sign positions required for validation of the proposed approaches in the KITTI dataset. We choose the challenging KITTI dataset, commonly used for benchmarking egomotion, as well as depth estimation because it contains the camera calibration parameters, and synced LiDAR information that allows annotation of GT 3D traffic sign positions.
As shown in Fig. 5, the LiDAR scans corresponding to the images captured, along with the GT trajectory poses are used to annotate the absolute as well as relative GT positions of the traffic signs. In total, we have annotated 73 signs across the 10 validation sequences.
IvB Sensitivity to Camera Calibration
The stateoftheart approach to 3D sign positioning relies upon multiview geometry triangulation. In this section, we analyze the sensitivity of this method to the error in the estimate of camera focal lengths and principal point. To evaluate the sensitivity, we introduce error in the GT camera intrinsics and perform SLAM, both with (w/) and without (w/o) loop closure (LC) using the incorrect camera matrix, followed by the sign position triangulation using the full trajectory. Its performance for the corresponding set of camera intrinsics is then evaluated as the mean of relative positioning error normalized by the number of signs successfully triangulated. We perform this analysis for KITTI Seq 5 (containing multiple loops) and 7 (containing a single loop). For each combination of camera parameters, we repeat the experiment 10 times and report the minimum of the above metric.
OneataTime
The oneatatime (OAT) sensitivity analysis measures the effect of error (15% to +15%) in a single camera parameter while keeping the others at their GT values. Fig. 6 shows the sensitivity of sign positioning performance to the error in focal lengths ( and are varied simultaneously) and principal point ( and are varied simultaneously). The performance w/ LC is better than that w/o LC. Furthermore, the performance gap between triangulation w/ and w/o LC is higher with a higher number of loops (Seq 5). Moreover, the triangulation is more sensitive to underestimating the focal length, and overestimating the principal point, primarily at large errors.
Interaction
The interaction sensitivity analysis measures the effect of error (5% to +5%) while varying the focal lengths and the principal point simultaneously. Fig. 7 shows the sensitivity to the combined errors in focal lengths and principal point for Seq 5 and Seq 7. The sensitivity to varying the principal point is higher than the sensitivity to varying the focal length for both the sequences. Furthermore for this shorter range of errors, underestimating the focal length and overestimating the principal point results in a better performance than contrariwise. This is in contrast to the observed effect when the percentage errors in intrinsics are higher (cf. Fig. 6). Note that the best performance is not achieved at zero percentage errors for the focal length and principal point. We conclude that accurate estimation of the camera intrinsics is pertinent for accurate sign positioning.
IvC Sign Positioning Components Analysis
In order to compute the sign positions, we need the camera intrinsics through selfcalibration, and the egomotion/depth maps as shown in Fig. 2. Here we quantitatively compare stateoftheart deep learning and multiview geometry based methods to monocular camera selfcalibration, as well as depth and egomotion estimation. For these experiments, Monodepth 2 and VITW are trained on 44 sequences from KITTI raw in the city, residential, and road categories.
SelfCalibration
Table I shows the average percentage error for selfcalibration with VITW and Colmap. VITW estimates the camera intrinsics for each pair of images in a sequence. Therefore, we compute the mean () and median (m) of each parameter across image pairs as the final estimate. To evaluate the impact of the turning radius on selfcalibration with VITW, we also compute the parameters considering only those frames detected as part of a turn (through the RDP algorithm). Multiview geometry based Colmap gives the lowest average percentage error for each parameter. The second best selfcalibration estimation is given by VITW Turns (m). However, both of the above fail in selfcalibrating the camera using Seq 4, which does not have any turns. For such a sequence, VITW (m) performs better than VITW (). All methods underestimate the focal length, and overestimate the principal point. Moreover, VITW estimates the focal length with higher magnitude of error compared to that of the principal point. The upper bound for error estimate of is inversely proportional to the amount of rotation about the axis [gordon2019depthwild]. . Therefore, estimates of and are better than that of and for all methods, because of the nearplanar motion in the sequences.
Method  %  %  %  % 

Colmap  0.901.51  0.901.51  0.770.40  1.760.34 
VITW ()  23.126.68  25.553.28  1.070.59  3.641.14 
VITW (m)  23.207.41  25.333.63  0.990.61  3.321.10 
VITW Turns ()  14.694.34  22.962.83  1.250.38  4.081.17 
VITW Turns (m)  11.826.65  22.622.92  1.200.34  3.881.18 
EgoMotion
Table II shows the average absolute trajectory errors (ATE) in meters for full [horn1987closed] and 5frame subsequences (ATE5) [zhou2017unsupervised] from egomotion estimation. The multiview geometry based ORBSLAM w/ LC has the lowest ATE full. However, ORBSLAM w/o LC has a higher local agreement with the GT trajectory depicted by the lowest ATE5 mean of . Both ORBSLAM methods suffer from track failure for Seq 1, unlike Monodepth 2 and VITW. For Seq 1, VITW has a better performance than Monodepth 2. While Monodepth 2 has the lowest ATE5 Std, and an ATE5 mean similar to that of ORBSLAM w/o LC, its ATE full is much higher than that of the ORBSLAM methods.
Method  ATE Full  ATE5 Mean  ATE5 Std 

ORBSLAM (w/ LC)  17.034  0.015  0.017 
ORBSLAM (w/o LC)  37.631  0.014  0.015 
VITW (Learned)  85.478  0.031  0.026 
Monodepth2 (Average)  66.494  0.014  0.010 
Depth
Table III shows the performance of depth estimation based on the metrics defined by Zhou et al.[zhou2017unsupervised]. While Monodepth 2 outperforms VITW in all the metrics, its training uses the average camera parameters from the dataset being trained on, thereby necessitating some prior knowledge about the dataset.
Method  Abs Rel Diff  Sq Rel Diff  RMSE  RMSE (log)  

VITW  0.172  1.325  5.662  0.246  0.767  0.920  0.970 
Monodepth2  0.138  1.132  5.121  0.211  0.838  0.948  0.979 
Approach A  Approach B  

ORBSLAM w\LC  ORBSLAM w\o LC  VITW  Monodepth 2  
Calibration  /  /  /  /  /  /  
Colmap  1.01  0.32  4.2  0.24  0.09  2.05  0.28  4.1  0.58  0.07  5.51  3.3  1.98  2.97  3.6  1.64 
VITW (m)  7.94  6.19  2.3  2.74  1.67  10.03  2.85  1.8  5.71  1.94  7.07  3.4  3.49  4.17  3.6  2.61 
VITW turns (m)  5.21  2.70  3.4  1.29  0.72  4.09  0.67  3.4  1.00  0.22  5.93  3.3  2.12  3.53  3.5  1.98 
Seq  0  1  2  4  5  6  7  8  9  10  Average 

Rel  0.35  1.09  0.24  2.1  0.07  0.48  0.22  0.82  0.32  0.10  0.58 
Abs  0.79  1.56  0.84  4.62  0.20  1.19  0.34  4.32  0.92  0.60  1.54 
Thus, we conclude that for selfcalibration, Colmap, VITW (m), and VITW Turns (m) are the better choices. For sign positioning with Approach A using egomotion estimation, ORBSLAM (w/ and w/o LC) are the better choices. However, for sign positioning with Approach B using depth estimation, both Monodepth 2 and VITW need to be considered. Finally, it is hypothesized that a combination of multiview geometry and deep learning approaches is needed for successful sign positioning in all sequences.
IvD 3D Positioning Analysis
We compare the accuracy of 3D traffic sign positioning using Approach A against Approach B. We also compare the effect of multiview geometry and deep learning based selfcalibration on the 3D sign positioning accuracy. We compute the average relative sign positioning error normalized by the number of signs successfully positioned as the metric.
Table IV shows the comparison of the mean performance of 3D traffic sign positioning for the different combinations of selfcalibration and depth/egomotion estimation techniques. The average relative sign positioning error using the full and short trajectories is denoted by and respectively, while denotes the average number of successfully positioned signs. The relative sign positioning error using depth maps in Approach B is denoted by . Note that the best average performance is given by Approach A using Colmap for selfcalibration and short ORBSLAM (w/o LC) for egomotion estimation. The better performance of ORBSLAM w/o LC for relative sign positioning is explained by the lower ATE5 (Table II) as compared to ORBSLAM w/ LC. Therefore, approach A using short subsequences for triangulation generally performs better than approach B. However, it is not the case for all the sequences. For Seq 1, where ORBSLAM fails tracking, the best sign positioning error is given by Approach B using a combination of Colmap for selfcalibration and Monodepth 2 for depth estimation. For Seq 4 which does not contain any turns, calibration with Colmap or VITW Turns (m) is not feasible, and VITW (m) has to be used.
While ORBSLAM (short) w/o LC gives better relative positioning error than w/ LC, the average absolute positioning error is lower when using ORBSLAM w/ LC () than w/o LC (). This is because the loop closures help in correcting the accumulated trajectory drift, thereby improving the absolute positions of the traffic signs.
We therefore propose a scheme for crowdsourced 3D traffic sign positioning that combines the strengths of multiview geometry and deep learning techniques for selfcalibration, egomotion and depth estimation to increase the map coverage. This scheme is shown in Fig. 8. The mean relative and absolute 3D traffic sign positioning errors for each validation sequence, computed using this scheme are shown in Table V. With this approach, our single journey average relative and absolute sign positioning error per sequence is and respectively. The average relative positioning error for all frames is , while the absolute positioning error for all signs is . Our relative positioning accuracy is comparable to [dabeer2017end] which unlike our framework uses a camera with known intrinsics, GPS, as well as an IMU to estimate the traffic sign positions. Our absolute positioning accuracy is comparable to [welzel2014accurate], which also assumes prior knowledge of camera intrinsics as well as the size and height of traffic signs.
V Conclusion
In this paper, we proposed a framework for 3D traffic sign positioning using crowdsourced data from only monocular color cameras and GPS, without prior knowledge of camera intrinsics. We demonstrated that combining the strengths of multiview geometry and deep learning based approaches to selfcalibration, depth and egomotion estimation results in an increased map coverage. We validated our framework on traffic signs in the public KITTI dataset for single journey sign positioning. In the future, the sign positioning accuracy can be further improved with optimization for multiple journeys over the same path. We will also explore the effect of camera distortion and rolling shutter in the crowdsourced data to expand the scope of our method.
Comments
There are no comments yet.