1 Introduction
In contrast to rotating lidars, the SSL uses an optical phased array as a transmitter of depth measuring light pulses. Since the builtin electronics can independently steer pulses of light by shifting its phase as it is projected through the array, the SSL can focus its attention on the parts of the scene important for the current task. Taskdriven reactive control steering hundreds of thousands of rays per second using only an onboard computer is a challenging problem, which calls for highly efficient parallelizable algorithms. As a first step towards this goal, we propose an active mapping method for SSLlike sensors, which simultaneously (i) learns to reconstruct a dense 3D voxelmap from sparse depth measurements and (ii) optimize the reactive control of depthmeasuring rays, see Figure 1. The proposed method is evaluated on a subset of the KITTI dataset [5], where sparse SSL measurements are artificially synthesized from captured lidar scans, and compared to a stateoftheart 3D reconstruction approach [3].
The main contribution of this paper lies in proposing a computationally tractable approach for very highdimensional active perception task, which couples learning of the 3D reconstruction with the optimization of depthmeasuring rays. Unlike other approaches such as active object detection [6] or segmentation [7]
, SSLlike reactive control has significantly higher dimensionality of the stateaction space, which makes a direct application of unsupervised reinforcement learning
[6] prohibitively expensive. Keeping the onboard reactive control in mind, we propose prioritized greedy optimization of depth measuring rays, which in contrast to a naïve greedy algorithm reevaluates only rays in each iteration. We derive the approximation ratio of the proposed algorithm.The 3D mapping is handled by an iteratively learned convolution neural network (CNN), as CNNs proved their superior performance in
[3, 17]. The iterative learning procedure stems from the fact that both (i) the directions in which the depth should be measured and (ii) the weights of the 3D reconstruction network are unknown. We initialize the learning procedure by selecting depthmeasuring rays randomly to learn an initial 3D mapping network which estimates occupancy of each particular voxel. Then, using this network, depthmeasuring rays along the expected vehicle trajectory can be planned based on the expected reconstruction (in)accuracy in each voxel. To reduce the trainingplanning discrepancy, the mapping network is relearned on optimized sparse measurements and the whole process is iterated until validation error stops decreasing.
2 Previous work
High performance of imagebased models is demonstrated in [14], where a CNN pooling results from multiple rendered views outperforms commonly used 3D shape descriptors in object recognition task. Qi [10] compare several volumetric and multiview network architectures and propose an anisotropic probing kernel to close the performance gap between the two approaches. Our network architecture uses a similar design principle.
Choy [3] proposed a unified approach for single and multiview 3D object reconstruction which employs a recurrent neural architecture. Despite providing competitive results in the object reconstruction domain, the architecture is not suitable for dealing with highdimensional outputs due to its high memory requirements and would need significant modifications to train with fullresolution maps which we use. We provide a comparison of this method to ours in Sec. 6.2, in a limited setting.
Modelfitting methods such as [13, 15, 11] rely on a manuallyannotated dataset of models and assume that objects can be decomposed into a predefined set of parts. Besides that these methods are suited mostly for manmade objects of rigid structure, fitting of the models and their parts to the input points is computationally very expensive; e.g., minutes per input for [13, 15]. Decomposition of the scene into plane primitives as in [8] does not scale well with scene size (quadratically due to candidate pairs) and could not most likely deal with the level of sparsity we encounter.
Geometrical and physical reasoning comprising stability of objects in the scene is used by Zheng [18] to improve object segmentation and 3D volumetric recovery. Their assumption of objects being aligned with coordinate axes which seems unrealistic in practice. Moreover, it is not clear how to incorporate learned shape priors for complex realworld objects which were shown to be beneficial for many tasks (e.g., in [9]). Firman [4] use a structuredoutput regression forest to complete unobserved geometry of tabletopsized objects. A generative model proposed by Wu [17]
, termed Deep Belief Network, learns joint probability distribution
of complex 3D shapes across various object categories .Endtoend learning of stochastic motion control policies for active object and scene categorization is proposed by Jayaraman and Grauman [6]. Their CNN policy successively proposes views to capture with RGB camera to minimize categorization error. The authors use a lookahead error as an unsupervised regularizer on the classification objective. Andreopoulos [2] solve the problem of an active search for an object in a 3D environment. While they minimize the classification error of a single yet apriori unknown voxel containing the searched object, we minimize the expected reconstruction error of all voxels. Also, their action space is significantly smaller than ours because they consider only local viewpoint changes at the next position while the SSL planning chooses from tens of thousands of rays over a longer horizon.
3 Overview of the active 3D mapping
We assume that the vehicle follows a known path consisting of discrete positions and a depth measuring device (SSL) can capture at most rays at each position. The set of rays to be captured at position is denoted .
We denote the global groundtruth occupancy map, its estimate, and the map of the sparse measurements. All these map share common global reference frame corresponding to the first position in the path. For each of these maps there are local counterparts , and , respectively. Local maps corresponding to position all share a common reference frame which is aligned with the sensor and captures its local neighborhood. The global groundtruth map is used to synthesize sensor measurements and to generate local groundtruth maps for training.
The active mapping pipeline, consisting of a measurereconstructplan loop, is depicted in Fig. 1 and detailed in Alg. 1.
Neglecting sensor noise, the set of depthmeasuring rays obtained from the planning, the measurements , and the resulting reconstruction can all be seen as a deterministic function of mapping parameters and . If we assume that that groundtruth maps come from a probability distribution, both learning of and planning of the depthmeasuring rays approximately minimize common objective
(1) 
where is the weighted logistic loss, and denote the elements of and , respectively, corresponding to voxel . In learning, are used to balance the two classes, empty with and occupied with , and to ignore the voxels with unknown occupancy. We assume independence of measurements and use, for corresponding voxels , additive updates of the occupancy confidence with . denotes the conditional probability of voxel being occupied given measurements and is its current estimate.
4 Learning of 3D mapping network
The learning is defined as approximate minimization of Equation 1. Since (i) the result of planning is not differentiable with respect to and (ii) we want to reduce variability of training data^{1}^{1}1We introduce a canonical frame by using the local maps instead of the global ones, which helps the mapping network to capture local regularities., we locally approximate the criterion around a point as
(2) 
by fixing the result of planning in . We also introduce a canonical frame by using the local maps instead of the global ones, which helps the mapping network to capture local regularities. The learning then becomes the following iterative optimization
(3) 
where minimization in each iteration is tackled by Stochastic Gradient Descent. Learning is summarized in Alg.
2.Note, that in order to achieve (i) local optimality of the criterion and (ii) statistical consistency of the learning process (i.e., that the training distribution of sparse measurements corresponds to the one obtained by planning), one would have to find a fixed point of Equation 3. Since there are no guarantees that any fixed point exists, we instead iterate the minimization until validation error is decreasing.
The mapping network consists of 6 convolutional layers with kernels followed by linear rectifier units (elementwise
) and, in 2 cases, by max pooling layers with
kernels and stride
, see Fig. 2. In the end, there is an fourfold upsampling layer so that the output has same size as input. The network was implemented in MatConvNet [16].5 Planning of depth measuring rays
Planning at position searches for a set of rays , which approximately minimizes the expected logistic loss between ground truth map and reconstruction obtained from sparse measurements at the horizon . The result of planning is the set of rays , which will provide measurements for a sparse set of voxels. This set of voxels is referred to as covered by and denoted as . While the mapping network is trained offline on the groundtruth maps, the planning have to search the subset of rays online without any explicit knowledge of the groundtruth occupancy . Since it is not clear how to directly quantify the impact of measuring a subset of voxels on the reconstruction , we introduce simplified reconstruction model , which predicts the loss based on currently available map . This model conservatively assumes that the reconstruction in covered voxels is correct (i.e. ) and reconstruction of not covered voxels does not change (i.e. ). Given this reconstruction model, the expected loss simplifies to:
(4) 
Since the groundtruth occupancy of voxels is apriori unknown, neither the voxelwise loss nor the coverage are known. We model the expected loss in voxel as
(5) 
where
is the entropy of the Bernoulli distribution with parameter
, denoting the probability of outcome from the possible outcomes. The vector of concatenated losses
is denoted .The length of particular rays is also unknown, therefore coverage of voxels by particular rays cannot be determined uniquely. Consequently, we introduce probability that voxel will not be covered by ray . This probability is estimated from currently available map as the product of (i) the probability that the voxels on ray which lie between voxel and the sensor are unoccupied and (ii) the probability that at least one of the following voxels or the voxel itself are occupied. If ray does not intersect voxel , then . The vector of probabilities for ray is denoted . Assuming that rays are independent measurements, the expected loss is modeled as .
The planning searches for the set of subsets of depthmeasuring rays for the following positions, which minimize the expected loss, subject to budget constraints
(6) 
where denotes cardinality of the set .
This is a nonconvex combinatorial problem^{2}^{2}2In our experiments, the number of possible combinations is greater then . which needs to be solved online repeatedly for millions of potential rays. We tried several convex approximations, however the highdimensional optimization has been extremely time consuming and the improvement with respect to the significantly faster greedy algorithm was negligible. As a consequence of that, we have decided to use the greedy algorithm. We first introduce its simplified version (Alg. 3) and derive its properties, the significantly faster prioritized greedy algorithm (Alg. 4) is explained later.
We denote the list of available rays at position as . At the beginning, the list of all available rays is initialized as follows . Alg. 3 successively builds the set of selected rays . In each iteration the best ray is selected, added into and removed from . The position from which the ray is chosen is denoted . If the budget of is reached, all rays from are removed from .
In order to avoid multiplication of all selected rays at each iteration, we introduce the vector , which keeps voxel loss. Vector is initialized as and whenever ray is selected, voxel losses are updated as follows , where denotes elementwise multiplication.
The rest of this section is organized as follows: Section 5.1 shows the upper bound for the approximation ratio of the greedy algorithm. Section 5.2 introduces the prioritized greedy algorithm, which in each iteration needs to reevaluate the cost function only for a small fraction of rays.
5.1 Approximation ratio of the greedy algorithm
We define the approximation ratio of a minimization algorithm to be , where is the cost function achieved by the algorithm and opt is the optimal value of the cost function. Given , we know that the algorithm provides solution whose value is at most . In this section we derive the upper bound of the approximation ratio of Algorithm 3. Figure 3 shows values of for different number of positions .
The greedy algorithm successively selects rays that reduce the cost function the most. To show how cost function differs from opt, an upper bound on the cost function need to be derived. Let us suppose that in the beginning of an arbitrary iteration we have voxel losses given by vector , the following lemma states that for arbitrary voxel , there always exists a ray , that reduces the cost function to , where is the unknown optimum value of the cost function which is achievable by rays .
Lemma 5.1.
If for some rays then
(7) 
Proof: We know that there is optimal solution consisting from rays. Without loss of generality we assume that holds for first rays, then
(8) 
This holds for an arbitrary positive scaling factor , therefore
(9) 
We sum up inequalities over all voxels
(10) 
We switch sums in the left hand side of the inequality to obtain addition of terms as follows
(11) 
Hence, we know that at least one of these terms has to be smaller than or equal to of the right hand side
Especially, if there is only one position, all optimal rays are either already selected or still available. This assumption allows to derive the following upper bound on the cost function of the greedy algorithm after iterations for .
Theorem 5.1.
Upper bound of the greedy algorithm after iterations is
(13) 
where and is Euler number.
Proof: We prove the upper bound by complete induction. In the beginning no ray is selected, pervoxel loss is and the value of the cost function . Using Lemma 5.1, we know that there exists ray such that , therefore we know that
(14) 
Greedy algorithm continues by updating the pervoxel loss . In the second iteration there are two possible cases: (i) we have either used the optimal ray in the first iteration, then the situation is better and we know there is rays which achieves optimum, or (ii) we have not selected the optimal ray in the first iteration, therefore we have still rays which achieves the optimum. Since the cost function reduction in the latter case gives the upper bound on the cost function reduction in the former one, we assume that there is still optimal rays available, therefore there exists ray such that
(15)  
We assume that the following holds
(16) 
and prove the inequality for . Using the assumption (16) and Lemma 5.1, the following inequalities hold
(17) 
Since ^{3}^{3}3 for . and , the upper bound for cost function of the greedy algorithm in th iteration is
Theorem 5.1 reveals that the approximation ratio of the greedy algorithm after iterations has following upper bound
(18) 
We can simply find by considering for each voxel the best rays independently.
So far we have assumed that the greedy algorithm chooses only rays and that all rays are available in all iterations. Since there are positions and the greedy algorithm can choose only rays at each position, some rays may be no longer available when choosing th ray. In the worst case possible, the rays from the most promising position will become unavailable. Since we have not chosen optimal rays we can no longer achieve opt. Nevertheless, we can still choose from rays which achieve a new optimum.
We introduce as the optimum achievable after closing positions. Obviously . Let us assume that, when the first position is closed we cannot lose more than , therefore . Without any additional assumption, could be arbitrarily large. We discuss potential assumptions later. Similarly , and . The following theorem states the upper bound for as a function of .
Theorem 5.2.
Upper bound of the greedy algorithm after iterations is
(19) 
where
Proof.
We start from the result (17) shown in the proof of Theorem 5.1. Since there is rays achieving optimum , the cost function in th iteration is bounded as follows
(20) 
In the th iteration, there are two possible cases: (i) rays from some position become not available and there is rays available which can achieve a new optimum which is not higher than or (ii) all rays are available and there is still rays which achieve . Noticing that the upper bound is increasing in and , we can cover both cases by considering there is still rays which achieves , therefore
(21) 
We can now continue up to the iteration in which the upper bound is as follows
(22) 
For th iteration the situation is similar as for th iteration. In order to cover both cases, we consider that there is rays which achieves and continue up to the th iteration, which yields the following upper bound
(23)  
Finally after iterations the upper bound is
(24) 
The last inequality stems from the fact that and that . ∎
Finally we derive the upper bound of the approximation ratio .
Theorem 5.3.
Upper bound of the approximation ratio is
(25) 
where is lower bound of the opt.
Proof:
(26)  
The approximation ratio depends on the opt, if then , if then . If we make an assumption that each position covers only fraction of voxels, then . Figure 3 shows values of for different ratios of for this case.
5.2 Prioritized greedy planning
In practice we observed a significant speed up of the greedy planning (Alg. 3) by imposing prioritized search for . Namely, let us denote the decrease of the expected reconstruction error achieved by selecting ray in iteration , and show that it is nonincreasing. For and it follows that . Summing the inequalities for all voxels , we get
(27) 
for an arbitrary ray selected in iteration . Note that for any .
Now, when we search for maximizing in decreasing order of , , we can stop once for the next ray because none of the remaining rays can be better than . Moreover, we can take advantage of the fact that all the remaining rays including remained sorted when updating the priority for the next iteration. The proposed planning is detailed in Alg. 4.
6 Experiments
Dataset
All experiments were conducted on selected sequences from categories City and Residential from the KITTI dataset [5]. We first brought the point clouds (captured by the Velodyne HDL64E laser scanner) to a common reference frame using the localization data from the inertial navigation system (OXTS RT 3003 GPS/IMU) and created the groundtruth voxel maps from these. The voxels traced from the sensor origin towards each measured point were updated as empty except for the voxels incident with any of the end points which were updated as occupied for each incident end point. The dynamic objects were mostly removed in the process since the voxels belonging to these objects were also many times updated as empty while moving. All maps used axisaligned voxels of edge size .
For generating the sparse measurements, we consider an SSL sensor with the field of view of horizontally and vertically discretized in directions. At each position, we select rays and raytrace in these directions until an occupied voxel is hit or the maximum distance of is reached. Only the rays which end up hitting an occupied voxel produce valid measurements, as is the case with the timeofflight sensors. Local maps and contain volume of discretized into voxels.
6.1 Active 3D mapping
In this experiment, we used and sequences from the Residential category for training and validation, respectively, and sequences from the City category for testing. We evaluate the iterative planninglearning procedure described in Sec. 4. For learning the mapping networks, we used learning rate
based on epoch number
, batch size , and momentum . Networks were trained for epochs.The ROC curves shown in Fig. 4 (left) are computed using groundtruth maps and predicted global occupancy maps . The performance of the network (denoted Coupled) significantly outperforms the network (Random), which shows the benefit of the proposed iterative planningmapping procedure. Examples of reconstructed global occupancy maps are shown in Fig. 5. Note that the valid measurements covered around of the input voxels.
6.2 Comparison to a recurrent imagebased architecture
We provide a comparison with the imagebased reconstruction method of Choy [3]. Namely, we modify their residual Res3DGRU3 network to use sparse depth maps of size instead of RGB images. The sensor pose corresponding to the last received depth map was used for reconstruction. The number of views were fixed to , with randomly selected depthmeasuring rays in each image. For this experiment, we used 20 sequences from the Residential category—18 for training, 1 for validation and 1 for testing. Since the Res3DGRU3 architecture is not suited for highdimensional outputs due to its high memory requirements, we limit the batch size to and the size of the maps to , which corresponds to recurrent units. Our mapping network was trained and tested on voxel maps instead of depth images.
The corresponding ROC curves, computed from local maps and , are shown in Fig. 4 (right). Both and networks outperforms the Res3DGRU3 network. We attribute this result mostly to the fact that our method is implicitly provided the known trajectory, while the Res3DGRU3 network is not. Another reason may be the rayvoxel mapping which is also known implicitly in our case, compared to [3].
7 Conclusions
We have proposed a computationally tractable approach for the very highdimensional active perception task. The proposed 3Dreconstruction CNN outperforms a stateoftheart approach by in recall, and it is shown that when learning is coupled with planning, recall increases by additional on the same false positive rate. The proposed prioritized greedy planning algorithm seems to be a promising direction with respect to onboard reactive control since it is about faster and requires only of ray evaluations compared to a naïve greedy solution.
Acknowledgment
The research leading to these results has received funding from the European Union under grant agreement FP7ICT609763 TRADR and No. 692455 EnableS3, from the Czech Science Foundation under Project 1708842S, and from the Grant Agency of the CTU in Prague under Project SGS16/161/OHK3/2T/13.
References
 [1] E. Ackerman. Quanergy announces $250 solidstate LIDAR for cars, robots, and more. In IEEE Spectrum, January 2016.
 [2] A. Andreopoulos, S. Hasler, H. Wersing, H. Janssen, J. K. Tsotsos, and E. Körner. Active 3D object localization using a humanoid robot. IEEE Transactions on Robotics, 27(1):47–64, 2011.
 [3] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3DR2N2: A unified approach for single and multiview 3D object reconstruction. In Computer Vision – ECCV 2016: 14th European Conference on, pages 628–644, Cham, 2016. Springer International Publishing.

[4]
M. Firman, O. M. Aodha, S. Julier, and G. J. Brostow.
Structured prediction of unobserved voxels from a single depth image.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 5431–5440, June 2016.  [5] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research, 32(11):1231–1237, Sept. 2013.
 [6] D. Jayaraman and K. Grauman. Lookahead before you leap: Endtoend active recognition by forecasting the effect of motion. In Computer Vision – ECCV 2016: 14th European Conference on, pages 489–505. Springer International Publishing, Cham, 2016.
 [7] A. K. Mishra, Y. Aloimonos, L. F. Cheong, and A. Kassim. Active visual segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):639–653, April 2012.
 [8] A. Monszpart, N. Mellado, G. J. Brostow, and N. J. Mitra. RAPter: Rebuilding manmade scenes with regular arrangements of planes. ACM Trans. Graph., 34(4):103:1–103:12, July 2015.
 [9] D. T. Nguyen, B. S. Hua, M. K. Tran, Q. H. Pham, and S. K. Yeung. A field model for repairing 3d shapes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5676–5684, June 2016.
 [10] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview CNNs for object classification on 3D data. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5648–5656, June 2016.
 [11] J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and D. Hoiem. Completing 3d object shape from one depth image. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2484–2493, June 2015.
 [12] H. Seifa and X. Hub. Autonomous driving in the iCity—HD maps as a key challenge of the automotive industry. Autonomous Robots, 2(2):159–162, 2016.
 [13] C.H. Shen, H. Fu, K. Chen, and S.M. Hu. Structure recovery by part assembly. ACM Trans. Graph., 31(6):180:1–180:11, Nov. 2012.
 [14] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. Multiview convolutional neural networks for 3D shape recognition. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 945–953, Dec 2015.
 [15] M. Sung, V. G. Kim, R. Angst, and L. Guibas. Datadriven structural priors for shape completion. ACM Trans. Graph., 34(6):175:1–175:11, Oct. 2015.
 [16] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pages 689–692, New York, NY, USA, 2015. ACM.
 [17] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1912–1920, June 2015.

[18]
B. Zheng, Y. Zhao, J. C. Yu, K. Ikeuchi, and S. C. Zhu.
Beyond point clouds: Scene understanding by reasoning geometry and physics.
In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3127–3134, June 2013.
Comments
There are no comments yet.