Jekyll2019-07-28T18:38:46+00:00http://troynikov.io/feed.xmlDiscipline Beats MotivationRobots and Robot AccessoriesAnton TroynikovDynamic Photometric Odometry2019-05-04T00:00:00+00:002019-05-04T00:00:00+00:00http://troynikov.io/dynamic-photometric-odometry<p>Back in December 2015, I had the idea of using direct photometric odometry, which is used to estimate the motion of a camera through a static world, to track moving objects as well. There are two important core ideas:</p>
<ul>
<li>Motion of the camera relative to an object looks identical to the motion of an object relative to the camera in an image sequence (transform equivalence).</li>
<li>Moving objects create differences between consecutive image frames that cannot be explained by the assumption of a static world (photometric consistency violations).</li>
</ul>
<p>Doing tracking this way has several advantages over standard keypoint, template, optical flow, and other methods. By reducing the problem of rigid tracking to direct photometric alignment, our algorithm can operate on the same input data as direct photometric SLAMs. This means we get:</p>
<ul>
<li>3D tracking from just a monocular camera (in principle)</li>
<li>A unified, mathematically transparent optimization problem and all the tools that come with that.</li>
<li>A unified representation of the world we can use in whatever other applications we want (SLAM, loop closure).</li>
<li>The ability to apply advances from the direct SLAMs to object tracking.<br />
(and lots of others)</li>
</ul>
<p>In 2018 I started to work on this more seriously, and in March 2019 with the help of <a href="https://vision.in.tum.de/members/demmeln">Nikolaus Demmel</a> and <a href="https://www.is.mpg.de/person/jstueckler">Jörg Stückler</a> I completed the proof-of-concept and have some interesting results to share. The short version is a conditional ‘it works’. The p.o.c is pretty primitive and not at all performant, but I’ve shown the fundamental ideas to be sound. This write-up doesn’t go into the full depth of the work but should give the interested reader a solid overview.</p>
<h2 id="direct-photometric-alignment-primer">Direct photometric alignment primer</h2>
<p>Direct photometric approaches to odometry use image sequence to estimate rigid body transforms corresponding to the motion of a camera through a static world. Unlike keypoint-based approaches, the direct photometric approaches use all intensity information present in each image, by finding a transform that minimizes the so-called photometric residual.</p>
<p>The underlying idea of all the direct photometric methods is that a point in the world should have the same radiosity regardless of where it is relative to the observing camera, and so it should look the same in every image where it appears. This is expressed in the photometric consistency assumption;</p>
<script type="math/tex; mode=display">I_1\left(\mathbf{x}_1\right) = I_2\left(\tau\left(\mathbf{x}_1, \mathbf{T}\right)\right) = I_2\left(\mathbf{x}_2\right), \forall \mathbf{x}_1 \in \mathbf{I}_1, \tau\left(\mathbf{x}_1, \mathbf{T}\right) \in \mathbf{I}_2</script>
<p>Here <script type="math/tex">\mathbf{x}_i</script> is a point in image <script type="math/tex">\mathbf{I}_i</script>, <script type="math/tex">\tau</script> is the reprojection function (sometimes called the warping function) which transforms a point in image <script type="math/tex">i</script> to a point in image <script type="math/tex">j</script> given the rigid body transform <script type="math/tex">\mathbf{T}</script>. <script type="math/tex">\mathbf{T}</script> is the transform from the coordinate frame of the camera observing image 1 to that observing image 2, i.e. the pose of camera 2 in the coordinate frame of camera 1.</p>
<p>The photometric residual is then just the difference between what we saw in image 2 and what we expected to see given the <script type="math/tex">\mathbf{T}</script> we supplied to the reprojection function;</p>
<script type="math/tex; mode=display">r\left(\mathbf{x_1}, \mathbf{T}\right) := \mathbf{I}_2\left(\tau\left(\mathbf{x}_1, \mathbf{T}\right)\right) - \mathbf{I}_1\left(\mathbf{x}_1\right)</script>
<p>Finding an estimate for the most likely transform <script type="math/tex">\mathbf{T}^*</script> given observations on each pixel’s photometric residual <script type="math/tex">r_i</script> then reduces to a fairly standard nonlinear least-squares problem, for which Gauss-Newton works pretty well;</p>
<script type="math/tex; mode=display">\mathbf{T}^* = \arg \min_{\mathbf{T}} \sum_i w\left(r_i\right)\left(r_i\left(\mathbf{T}\right)\right)^2</script>
<p>It’s kind of nuts that you can take the derivative of an image with respect to a rigid body transform and come out with something sensible, but in 2019 just about anything is possible. A more in-depth starting point for direct photometric approaches is <a href="https://vision.in.tum.de/_media/spezial/bib/kerl13icra.pdf">this paper by Kerl</a>.</p>
<h2 id="transform-equivalence">Transform equivalence</h2>
<p>A camera moving relative to a static object will observe the same thing as a static camera observing a moving object.</p>
<p><img src="/assets/images/post_images/transform_equivalence.png" alt="Transform Equivalence" /></p>
<p>This means that any method we care to use to estimate rigid body transforms corresponding to the motion of a camera through a static world, we may also use to estimate the motion of dynamic objects. This is pretty intuitive but things get a little subtle in our choice of coordinate frames. Reading the next part isn’t really necessary unless you want to try this for yourself.</p>
<h3 id="torturous-math">Torturous math</h3>
<p>Consider the case where the camera is moving relative to the world, and independently an object is moving relative to the camera and the world.
Let <script type="math/tex">W</script> be the coordinate frame attached to the static world, and <script type="math/tex">O</script> the coordinate frame attached to the moving object.</p>
<p>Assuming we can separate points belonging to the world from points belonging to the object, we may estimate the motion of the camera through the world from pose <script type="math/tex">c_1</script> to pose <script type="math/tex">c_2</script> as seen in the world frame as <script type="math/tex">\mathbf{T}^{W}_{c_1,c_2}</script>. Similarly, we can estimate the motion of the camera relative to the object (i.e. in the object frame) as <script type="math/tex">\mathbf{T}^O_{c_1c_2}</script>.</p>
<p>Assume we have an estimate for the transform between the coordinate frame attached to the camera <script type="math/tex">C</script> and that attached to the object, for the first image: <script type="math/tex">\mathbf{T}^O_{c_1,o_1}</script>. Note that we may attach a coordinate frame to the object anywhere we like, including at a point outside the object itself. A natural choice for an arbitraty object may be the geometric centroid.</p>
<p>Assume also that we know the initial position of the camera in the world: <script type="math/tex">\mathbf{T}_{w,c_1}</script> — we can choose this arbitrarily to e.g. put the camera at the world origin.</p>
<p>Then to get the transform corresponding to the position of the object in the world frame at the time of the second image,</p>
<script type="math/tex; mode=display">\mathbf{T}_{w,o_2} = \mathbf{T}_{w_2,o_2}</script>
<p>(The world frame is the same for all times, i.e. <script type="math/tex">w = w_2</script>)</p>
<script type="math/tex; mode=display">=\mathbf{T}_{w_2,c_2}\mathbf{T}_{c_2,o_2}</script>
<p>(By chaining coordinate transforms: <script type="math/tex">\mathbf{T}_{a,c} = \mathbf{T}_{a,b}T_{b,c}</script>)</p>
<script type="math/tex; mode=display">=\mathbf{T}_{w_2,c_2}\mathbf{T}^O_{c_2,o_2}</script>
<p>(<script type="math/tex">\mathbf{T}_{c_2,o_2}</script> is the same in any coordinate frame, since we observe the relative pose at the same time <script type="math/tex">t = 2</script>).</p>
<p>Finally we have:</p>
<script type="math/tex; mode=display">\mathbf{T}_{w,o_2} = \mathbf{T}_{w,c_1}\mathbf{T}^{W}_{c_1,c_2}(\mathbf{T}^O_{c_1c_2})^{-1}\mathbf{T}^O_{c_1,o_1}</script>
<p>(Because <script type="math/tex">O</script> is fixed to the object, <script type="math/tex">\mathbf{T}^O_{c_1,o_2} = \mathbf{T}^O_{c_1,o_1}</script>. Also taking the inverse of a transform gives: <script type="math/tex">\mathbf{T}_{a,b}^{-1} = \mathbf{T}_{b,a}</script>.)</p>
<p>Now that we have done some torturous manipulations to get to an intuitive result, we can do what we already knew we could, but using math.</p>
<h2 id="photometric-consistency-violation-for-motion-detection">Photometric consistency violation for motion detection</h2>
<p>In the case of a static world and a moving camera, <script type="math/tex">\mathbf{T}</script> corresponding to the change in pose of the camera is enough to explain what we see between image frames. We reproject the static world from one image to the other, and everything lines up so the photometric residual is everywhere zero. Analogously, if there’s one moving object and a static camera, a single transform is sufficient.</p>
<p>In the case of two or more independently moving objects, one transform is no longer enough, and we get the situation illustrated below:</p>
<p><img src="/assets/images/post_images/moving_residuals.png" alt="Photometric consistency violation" />
Here two objects, A and B, are moving relative to a static camera. On the left we see the initial position of the objects in space, as well as where they’re projected into the image. In the center is what we observe in the second image. On the right is what we observe if we apply <script type="math/tex">\mathbf{T}_A</script> to points on both objects in the first frame. Wavy lines correspond to a large photometric residual.</p>
<p><script type="math/tex">\mathbf{T}_A</script> is the relative pose of the object A, <script type="math/tex">\mathbf{T}_B</script> is the relative pose of object B. If we apply the reprojection function <script type="math/tex">\tau(\mathbf{x},\mathbf{T}_A)</script> to points on both A and B, we get large photometric residuals at the image points marked by wavy lines; one set where B actually is but we didn’t expect it to be, and one set where we expected B to be but it isn’t.</p>
<p>Direct photometric approaches usually discard points with large photometric residuals and only use points from the static world. Instead we can use the large residuals created by moving objects as a cue that such an object might be present. Under the assumption that moving objects correspond to contiguous regions of pixels in the first image, we can use a Markov random field / graph-cut to segment it as in the classic approach of <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1989.tb01764.x">Greig et. al.</a>.</p>
<h3 id="some-problems">Some problems</h3>
<p>It would be nice if we could use this approach directly to find moving objects but there are some problems.</p>
<p>Large photometric residuals might be caused by things other than objects; reflections, lighting changes, sensor noise etc.</p>
<p>We can deal with differentiating between motion and other sources of large photometric residuals by just trying to estimate a motion for each high-residual region. If we can’t get an estimate, we can’t track anyway, and it’s likely this region conforms to some other source of high residual.</p>
<p>A trickier problem is that this method relies on object and background having enough texture. Unfortunately, this isn’t the case for many objects of interest.</p>
<p><img src="/assets/images/post_images/disambiguation.png" alt="Ambiguous photometric residuals." />
Here a white van is moving from left (a) to right (b), as observed from a static camera. The resulting photometric residuals are shown in (c); brighter pixels have a larger magnitude.</p>
<p>The important things to note are;</p>
<ul>
<li>Parts belonging to the moving object don’t have a large residual, because of the lack of texture (e.g. the center/front of the van).</li>
<li>Parts not belonging to the moving object do have a large residual, because the object occludes them (e.g. the vegetation to the left of the van).</li>
</ul>
<p>If we accumulate enough image frames from the sequence it might be possible to disambiguate between regions with large residuals that are due to occlusion and those belonging to the moving object. It may also be possible to accumulate all the pixels belonging to the moving object even if they don’t have much texture. We didn’t attempt that here but it would be interesting for future work. Additionally, this doesn’t admit real-time operation since we need to accumulate several, potentially many, frames before we’re able to accumulate sufficient information.</p>
<p>Instead, we took the easy way out and used a standard semantic instance segmentation approach, using pre-trained <a href="https://github.com/matterport/Mask_RCNN">Mask R-CNN</a>. The segmentation masks that Mask R-CNN produces aren’t perfect, but we can refine them. We first dilate them slightly (2-3px), then after tracking using direct alignment, we exclude pixels with high photometric residuals from the segmentation. This doesn’t improve things much frame-to-frame but can provide a more consistent segmentation over time. We didn’t explore this idea in this work but it would be would be interesting to look into in the future.</p>
<h2 id="overall-algorithm">Overall Algorithm</h2>
<p>The core algorithm is as follows:</p>
<ol>
<li>Initialize a Keyframe consisting of an image frame, registered depth values, and an instance segmentation mask.</li>
<li>For a new input image frame, estimate a rigid body transform using all pixels in the keyframe.</li>
<li>Compute regions with large photometric residuals for the estimated transform.</li>
<li>Determine whether each such region corresponds to an instance segmentation. We used a simple intersection-over-union threshold.
<ul>
<li>Regions with large residuals that have a corresponding segmentation mask are treated as independently moving objects</li>
<li>Regions without large residuals are treated as belonging to the static world</li>
<li>Regions with large residuals but no corresponding segmentation mask are discarded.</li>
</ul>
</li>
<li>Estimate the motion of each object instance using all the corresponding pixels in the keyframe.</li>
</ol>
<p>Note that unlike most approaches that use instance segmentation in the motion estimation pipeline, we don’t try to estimate an independent motion for every instance, only for those we have determined to be moving. This has several advantages: the required computation is reduced, and static objects are not assigned small (eroneuous) motions due to noisy data and estimates from a small number of points.</p>
<p>We further refine the segmentation and transform estimates in an alternating optimization (so-called ‘hard EM’, analogous to coordinate ascent). After computing outlying regions or pixels, we re-estimate the transform with only low residual pixels, initializing with the previously estimated transform. We repeat this process until convergence. This is not especially rigorous but does lead to fast convergence. A better approach for this refinement might be something like the multi-label joint segmentation-motion approach presented in <a href="http://ais.uni-bonn.de/papers/BMVC2013_Stueckler.pdf">Efficient Dense Rigid-Body Motion Segmentation and Estimation in RGB-D Video</a>, but this would generally be slower. Additionally, the segmentation is not guaranteed to find a global minimum, though we suspect that it would converge reasonably, except in pathological cases.</p>
<p>As presented, we track the camera motion and the moving objects against a single keyframe. It’s pretty straightforward to extend this approach to multiple keyframes; all that is required is to carefully handle the world coordinate frame, and to perform association between object instances in subsequent keyframes. We perform association through Intersection-Over-Union of the reprojected objects with instances in the new keyframe, but more sophisticated approaches are possible.</p>
<p>Though we don’t yet pursue it here, a natural extension is to perform graph based bundle adjustment / joint optimization with multiple observations of the same object from multiple keyframes.</p>
<h2 id="evaluation">Evaluation</h2>
<p>We performed experiments with the <a href="https://robotic-esp.com/datasets/omd/">Oxford Multimotion Dataset (OMMD)</a>.</p>
<p><img src="/assets/images/post_images/ommd_example.png" alt="OMMD example frame" />
An example frame from OMMD. The boxes move independently with translational, rotational, and compound motions.</p>
<p>This dataset provides ground truth motions for several independently moving objects, observed from both static and moving cameras, in a fairly representative indoor scene. It comes with several sequences with both calibrated stereo and RGB-D data, as well as IMU data and Vicon tracking as ground truth.</p>
<h3 id="preprocessing">Preprocessing</h3>
<p>The raw RGB-D data is unusable without considerable filtering over the depth images, and was at a lower resolution and frame rate than the stereo data. To get a usable depth we used <a href="https://github.com/siposcsaba89/sps-stereo">SPS-Stereo</a>, which gave good disparity estimates since most surfaces in the scene are planar. Converting disparity into depth is straightforward using the camera intrinsics provided with the dataset.</p>
<p><img src="/assets/images/post_images/ommd_disparity_comparison.png" alt="Dept/Disparity Comparison" />
Left: Raw depth data from the Intel Realsense stream included in the dataset. Right: Disparity estimate from SPS Stereo. Still noisy but much cleaner, and usable for our experiments.</p>
<p>Additionally, the colorful boxes that are the moving objects of interest in this dataset don’t belong to any classes that Mask R-CNN recognizes, so we need some other way to make the required instance segmentations. Fortunately, the faces of the boxes are high saturation and high contrast, and they’re far enough apart that simple color thresholding followed by connected components gets us a reasonable segmentation without much work. An MRF binary segmentation pass helps us smooth out the noise.</p>
<p><img src="/assets/images/post_images/ommd_instance_mask.png" alt="Instance Masking via Color Thesholds" /></p>
<p>We have color, depth, and an instance segmentation, which is everything we need to construct a keyframe.</p>
<h3 id="tracking-results">Tracking Results</h3>
<p>We evaluate our approach against the ground truth provided by OMMD. We apply the appropriate coordinate transforms to get everything into the same frame. Here are some results of our approach.</p>
<p><img src="/assets/images/post_images/ommd_trans_tracks.png" alt="OMMD Translational tracks" />
Tracks for objects with translational motions only. Box 1 (left) moves back and forth relative to the camera, Box 3 (right) moves side to side. Blue is ground truth, green is the naive motion estimate, orange is the joint segmentation-motion estimate.</p>
<p>We’re able to get reasonable tracks for objects with translational motions from a static camera. Unfortunately, tracking rotating objects with this approach doesn’t work well:</p>
<p><img src="/assets/images/post_images/ommd_rot_tracks.png" alt="OMMD Rotational tracks" />
Tracks for objects with rotational motion. Box 2 (left) rotates on the spot while Box 4 (right) both rotates and translates. The ground truth track for Box 4 is not visible because the tracked trajectory completely diverges very quickly.</p>
<p>Why is this so? This result is somewhat unexpected since rotational motions are not generally more difficult to track than translational motions where direct photometric alignment is applied to camera tracking. A clue is that direct photometric algorithms tend to be more susceptible to tracking loss for large motions between frames, especially translational motions.</p>
<h3 id="induced-virtual-translation">Induced Virtual Translation</h3>
<p>The (as yet not fully tested) theory I have about what’s going on is that tracking a rotating object introduces what can be thought of as a large ‘virtual’ translation. Consider the following diagram:</p>
<p><img src="/assets/images/post_images/cam_arc.png" alt="Induced Arc" /></p>
<p>An object (green) is rotating clockwise in space. The relative motion of the camera (black) has a rotational and a translational component, in the opposite direction to the rotation of the object. The translational component increases as the radius of the arc increases. The rotational components are equal and opposite for camera and object.</p>
<p>This ‘virtual’ translation will be large in the camera origin frame even though the relative rotation might be small, the distance between the camera center and the center of rotation of the object determines the radius of the induced arc, and hence the arc length itself is quite large.</p>
<p>Because the target image frames are discrete in time, the virtual motion between frames has as translational component the chord of the start and end points of the virtual arc. This is given as <script type="math/tex">2r \sin (\theta / 2)</script> where <script type="math/tex">r</script> is the radius, and <script type="math/tex">\theta</script> is the rotation angle. Taking the small angle approximation <script type="math/tex">\sin(\theta) = \theta</script> we see that even for small angles, the chord length, and hence the virtual translational component, is linear in the distance from camera to object.</p>
<p>I haven’t examined this theory rigorously yet, but intuitively it seems very plausible. Testing it would require computing cost surfaces over a synthetic dataset, which seems very error prone and tedious, so I haven’t set about doing it yet.</p>
<h3 id="mitigation">Mitigation</h3>
<p>In the direct photometric alignment context, large motions between image frames have been compensated for by using a coarse-to-fine approach, where alignment is first performed on a recursively subsampled (lower resolution) versions of the keyframe and target image frame, usually with three or four levels of subsampling forming a so-called ‘pyramid’. Doing so broadens the basin of attraction for our optimization algorithm, allowing it to arrive at the desired minimum even if we initially start further from it. This approach is unfortunately not available to us in the object tracking application, because our objects of interest represent only a small fraction of the total pixels in a given keyframe. We therefore can only use one or two pyramid levels, which are unfortunately not enough to compensate for the large induced translation.</p>
<p>Another approach is to instead reformulate the reprojection function so as to perform alignment in the object rather than the camera’s origin frame. This would work fine if the camera was stationary in the world, but doesn’t help us in the case of the moving camera, since the camera cannot disambiguate its own motion relative to the object from the object’s motion in the object’s origin frame. I’ll save you another round of tortuous coordinate manipulation, but demonstrating this is straightforward.</p>
<p>One final idea is to use a feature-based approach as an initialization for the direct alignment approach. <a href="http://www.robots.ox.ac.uk/~mobile/esp/Papers/judd_iros18.pdf">This paper</a> from Judd et. al. demonstrates such a feature based approach for multi-object and camera motion tracking. This should get us closer to the desired minimum in the basin of attraction for our optimization algorithm, but on the other hand initializing from features negates many of the advantages of direct photometric alignment, is kind of ugly, and we might as well return to the caves. That said, it’s likely that this will be the approach I take for the next steps until I can think of something better.</p>
<h3 id="some-additional-notes">Some additional notes</h3>
<p>Joint segmentation-motion estimation did not significantly improve our tracking results, but significantly increases the runtime for each instance since we need to perform the (expensive) direct alignment step multiple times. I suspect this is because we got reasonably good segmentations right away, and hence jointly optimizing over the segmentation and motion did not produce a significant improvement over just estimating the motion. We investigated the segmentation performance in depth but the results were not very interesting, so I omit them here. It would be interesting to find the point where this tradeoff starts to matter; one could imagine a very simple blob detector being good enough for initialization here, then further refined via the photometric cost term. This is something I also haven’t done yet due to the lack of a suitable dataset and the tedious experimentation required.</p>
<p>The consistent misalignment between ground truth and estimated frames is probably due to a fixed misalignment between the initial coordinate frames; we could evaluate our results by first doing e.g. Horn alignment between the estimated and ground truth trajectories as commonly used, and suggested in <a href="https://vision.in.tum.de/_media/spezial/bib/sturm12iros.pdf">Sturm et. al. (2012)</a>, but we felt showing the raw results is more representative of the performance of the system.</p>
<p>You have probably noticed that we don’t give any timing benchmarks; that’s because we did not attempt to produce a performant implementation, and a lot of the visualization cruft slows the runtime considerably. That said, we’re currently working on a GPU implementation and framework for the underlying algorithms that should speed things up considerably. Because each pixel in the keyframe is treated independently in the cost (it’s just a sum), the costs and Hessians etc. can be computed in parallel, making this approach particularly amenable to GPU implementations.</p>
<p>We also performed experiments on the <a href="https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds/">Virtual KITTI (VKITTI)</a> dataset which is a rendered version of the commonly used <a href="http://www.cvlibs.net/datasets/kitti/">KITTI</a> dataset, providing ground truth segmentation and depth, as well as varying weather and lighting conditions, and varying camera angles. The moving objects are various kinds of vehicles, moving on the road plane. We got good results but they’re less interesting than those discussed above, so I have omitted them in this write-up.</p>
<h2 id="future-work">Future Work</h2>
<p>So where to from here? We’ve demonstrated a proof of concept system, but we’ve also run into a fundamental problem with our approach here, for which I haven’t yet come up with a good solution. That said, there is still plenty to do;</p>
<ul>
<li>
<p>In the spirit of <a href="https://vision.in.tum.de/_media/spezial/bib/engel2016dso.pdf">DSO by Engel et. al.</a>, the next step should be to move to a full joint optimization framework using only photometric costs. Specifically, jointly estimating depth should become part of the cost function we minimize. This will probably require using the horrible feature-based initialization we discussed earlier, but would be a significant step forward; if nothing else we would be able to perform much richer, dense reconstructions which are impossible with feature based approaches alone.</p>
</li>
<li>
<p>Work like this requires not just better datasets, but better tools for creating synthetic datasets for computer vision applications. Currently these are limited to game engine based tools like <a href="http://carla.org/">CARLA</a> or else basically Blender scripts. I’d love a tool that gives me what I can do with Blender scripts with better ergonomics so that I can quickly generate datasets and variations on the fly.</p>
</li>
<li>
<p>Temporal consistency of the segmentations and photometric residuals is a direction we started down but did not fully exploit in this work. However, as mentioned previously, I don’t believe that we really need the instance segmentation, and that given a long enough sequence we should be able to find a consistent segmentation for entire moving objects if we perform joint segmentation-motion optimization over the whole sequence. This will probably come down to some discrete data association problem but would be interesting to tackle.</p>
</li>
<li>
<p>Put the whole thing on the GPU. The main reason this is hard is debugging tools for GPU code are pretty bad and it’s easy to make mistakes in the numerics.</p>
</li>
</ul>
<p>Thanks for reading. If you’re interested in learning about what we did in more depth, have comments, or would like to know more about my research interests in geometric computer vision please feel free to contact me.</p>Anton TroynikovApplying photometric odometry to rigid object tracking.Thoughts on the Autonomous Vehicle Industry2018-03-28T00:00:00+00:002018-03-28T00:00:00+00:00http://troynikov.io/thoughts-on-autonomy<p>These are some rough thoughts I’m having on the current autonomy landscape, having watched the industry develop over the last few years. It’s by no means sourced, definitive, or necessarily always accurate, and represents only my opinion. Nevertheless, as an active researcher in robotics in general and applied machine perception in particular, I enjoy a degree of <em>living in the future</em> that might make my viewpoint interesting to you.</p>
<h2 id="passenger-autonomy">Passenger Autonomy</h2>
<p>Passenger Autonomy is the big sexy topic getting most of the media attention, positive and negative. The <a href="https://www.theverge.com/2018/3/28/17174636/uber-self-driving-crash-fatal-arizona-update">recent Uber crash</a> has raised questions regarding the safety of autonomy testing programs, while Waymo <a href="https://www.wired.com/story/waymo-buys-jaguar-suvs/">continues its PR offensive</a> ahead of what will probably be some form of public launch in late 2018 or early 2019. Passenger autonomy is also receiving the majority of regulatory scrutiny, with legislators in various states either licensing passenger autonomy testing on public roads, or else giving autonomy companies free reign.</p>
<p>Cruise, Waymo, Uber and others have been saying a public launch of an autonomous taxi service is “imminent” for quite some time now, and Cruise seems to be closest to delivering at time of writing. Once one company launches, the rest will follow in short order. This is a market with untested business models and huge up-front R&D costs that will need to be recouped. Additionally, there are the untested behavioral factors - will the general public trust autonomous taxis? Will they treat them with care, or trash them because no one is watching? Will other road users interfere with them on purpose?</p>
<p>Technically speaking, many of the problems remaining to be solved are unknown-unknowns which autonomy teams will only encounter with on-road testing in real traffic conditions. Despite Waymo’s marketing materials, very little real testing has gone on so far. Additionally, I don’t believe any company has the clear best autonomy team. Though the consensus is that Waymo is furthest ahead technology-wise, many of the original core members have left to found most of the current crop of autonomy companies, and it’s unclear how much difference having a better technology will have pre-launch.</p>
<h3 id="the-launch">The Launch</h3>
<p>The truth is the launch of robotaxis is going to be underwhelming, and might be the reason we have’t quite seen a first mover yet - the first one in is going to receive much of the PR backlash when the service just kinda sucks. At launch, robotaxi services will be very limited; limited to a geofenced geography, limited to weather and other operational conditions, they will be slow, and if I had to ballpark it, roughly 10% of the time they won’t do what the user wants in some way (rough/scary ride, unexpected stop, weird road user interaction, other technical fault).</p>
<p>The novelty and hype surrounding the launch will get people excited for a few months initially, but ultimately even heavily subsidized (I fully expect the services to be either completely free initially, or with some nominal price so that the user feels they’re getting something of value) they’re not going to be competitive with rideshare services. Companies like Uber have the advantage here in that they can probably just keep their own drivers out of the autonomy geofenced areas, but that invites competitors in.</p>
<p>Who’s going to be first? Either Uber or Cruise. It’s still not clear to me whether Waymo intends to run their own service except in isolated markets as a demo - the actual robotaxi service doesn’t appear to be their core business. Since GM has Cruise, I’m guessing they’ll partner with someone like Ford or a European vehicle OEM to run their robotaxi service under license.</p>
<h3 id="the-hangover">The Hangover</h3>
<p>What happens in the landscape after the launch depends on what kind of company you are.</p>
<p>If you’ve got a big external entity funding you (Waymo, Cruise), you settle in for the grind, or else your parent panics and cuts you off. You slowly expand your geofence and your operating conditions, you minimize the PR damage from various big events like fatalities or major crashes or bizarre behaviors (someone is going to have sex in these things, someone is going to get killed by another person in one, someone is going to hack into one, the cops will want to pull one over), and you try to keep people interested enough so that an upstart competitor doesn’t eat your lunch. It will be ~5-10 years before this is a real business.</p>
<p>If you don’t have a big external entity paying your bills, but you’re going for the fully vertically integrated scenario like Zoox, this is the moment where you absolutely need to raise a gigantic amount of money to fund your warchest for the 5-10 years it’s going to take for your service to turn into a real business.Whether or not you can raise that money depends mainly on external market conditions, and at this point you need to bring in really big institutional investors. You probably have the highest risk out of all the passenger autonomy models, and it’s not clear that you’ll be rewarded with an outsize return in the face of e.g. GM Cruise.</p>
<p>This is also the peril that Uber faces - they have other businesses, but it’s unclear if they’re profitable enough to sustain an autonomy effort, nor whether the markets will bear giving Uber more money to set on fire. A possible outcome is that companies in this category significantly scale down their ambitions and focus on a niche aspect of autonomy, fade into irrelevance while hemorrhaging money, or are acquired by (most likely) a German OEM who can’t do Autonomy themselves.</p>
<p>The outcome for companies like Aurora which exist to license their technology, and other ‘Tier 1 autonomy suppliers’ with similar models, depends on the appetite of vehicle OEM’s to continue to pursue passenger autonomy in the wake of the lackluster launches. The OEM’s have considerably more markets they can attempt the service in, and have profitable businesses that can subsidize repeated small-scale roll-outs. How long companies in this category last as independent entities is going to depend on how panicked the OEM’s feel and how Aurora can spin the capabilities they can deliver.</p>
<p>These companies are significantly less capital intensive than their fully vertically integrated cousins, which forms part of their competitive advantage. It’s worth noting that there are a lot of traditional Tier-1 suppliers like Continental and Bosch working on various autonomy products, but from what I’ve seen first-hand, it’s unlikely that they have the internal expertise to deliver a complete system; they are experts on sensors and control loops, and have virtually no experience in modern machine perception.</p>
<p>The timeline looks something like:</p>
<ul>
<li>Launch, tons of hype, lasting ~2 months. Lots of companies announce a launch at once, including at least one OEM running Waymo technology.</li>
<li>Users enjoy the novelty for ~4-5 months.</li>
<li>Usage starts to drop, companies compensate by launching in more regions.</li>
<li>More launches increases the likelihood of a negative PR incident, and regulators coming down hard on the whole industry.</li>
<li>The music stops, and those caught without a backer or funding either collapse or are acquired.</li>
</ul>
<p>Look for a big wave of consolidation about a year after launch. My bets: Waymo survives, learns, and keeps going. Cruise survives unless GM has a really bad quarter for some other reason and spins them out, then they die. Uber botches the launch and focuses on other autonomy verticals thereafter less dependent on PR. About a thousand minor companies launch with incomplete products, fail to get any traction, and are then acquired or collapse. Zoox is late to the party.</p>
<p>One exceptional company to watch in this space is Voyage. Voyage has faced the reality of steep capital investments required to deliver a vertically integrated autonomy product, and has instead chosen to focus on the user first. This will allow them to continue to act as a nimble start-up, reaping the rewards of core autonomy technologies developed elsewhere while executing on a strategy of constrained roll-outs. Rolling out their robotaxis in retirement communities first was a masterful move on CEO Oliver Cameron’s part.</p>
<p>This approach should allow them to remain capital efficient, always at the edge but never exceeding the capabilities of the main-line of autonomy technology development. By not focusing on the core technology development, Voyage can think about the overall user experience, and since behavioral change is so important in this market, this should lead to a sustainable moat. They will need to raise more funds in the future, but I hope they can go the distance and last the 5-10 years necessary to become a general autonomous taxi company.</p>
<h2 id="autonomous-logistics">Autonomous Logistics</h2>
<p>Full disclosure: I’m totally enamored with logistics as a business and as a technology. I firmly believe that the multimodal shipping container is the most significant invention of the 20th century, and that everyone should read <a href="https://www.amazon.com/Box-Shipping-Container-Smaller-Economy/dp/0691170819">‘The Box’ by Marc Levinson</a>.</p>
<p>Logistics autonomy is getting relatively little coverage, though there were a rash of autonomous truck announcements from <a href="https://www.wired.com/story/starsky-robotics-truck-self-driving-florida-test/">Starsky</a>, <a href="http://money.cnn.com/2018/03/07/technology/uber-trucks-autonomous/index.html">Uber</a>, and <a href="https://www.theverge.com/2018/3/9/17100518/waymo-self-driving-truck-google-atlanta">Waymo</a> in recent months, possibly in response to Tesla’s electric semi tractor. The structure of this space is rather different to passenger autonomy; logistics is a commodity business, defined by the costs to move a given weight and volume of goods from point A to point B in a reasonable time. It is a wholly numbers defined business. Margins for overland shippers (companies who sell freight capacity and run the trucks) are extremely tight, around 5% if you’re operating close to peak efficiency. The value add for autonomy in trucking is in either reducing the need for drivers, or else making each driver more efficient.</p>
<p>Overland logistics is an industry that has operated under the same model for decades, but systemic global pressures are now forcing change. First, technology such as electronic logging is being mandated by regulators for safety. Second, there is an increasingly acute labor shortage. Fewer young people are entering the industry and many are aging out. The labor shortage is already resulting in reduced capacity for shippers, which has the potential to push many out of business as they simply cannot move the volumes they need to to cover fixed operating costs. Under these conditions, the industry is ripe for new entrants and for new technology-enabled processes.</p>
<p>Because there are direct economic advantages in a large industry, regulation is likely to proceed in a measured way. Being able to demonstrate concrete numbers in terms of economic gains tends to look good when arguing your case. Long-haul trucking is also a relatively nice environment for autonomy in comparison to busy urban centers, and logistics vehicle access is already legislated according to class.</p>
<p>Ironically the heavy regulation of the trucking industry provides an existing legal framework where none exists yet for passenger autonomy. The same is true for insurance and liability issues. There are already proposals in the U.S in Europe for autonomous and semi-autonomous trucking corridors and lanes on existing highways, that will act as proving grounds for the industry in general.</p>
<p>It’s unclear who the leader in logistics autonomy is, and whether there is one. Several startups are tackling the space, as well as larger companies like Uber and Waymo, as well as Tier 1 suppliers and OEM’s like MAN, Volvo and Daimler. I don’t have first hand experience of what these teams look like in practice, but I suspect a heavy bias toward electromechanical engineering approaches to autonomy. It’s likely that these companies can get along with this technical basis for quite some time to come, as the highway autonomy problem can be adequately tackled for many conditions with traditional automotive sensors including radar, and appropriate control and limited-horizon planning algorithms.</p>
<h3 id="full-autonomy-from-day-1">Full Autonomy from day 1?</h3>
<p>Embark, Uber, Waymo, MAN, Volvo and others are all working on a fully autonomous solution for highway driving. This is ambitious and risky for many of the same reasons as passenger autonomy, but does have the advantage of a clearer business case, in terms of reducing labor costs and improving operating efficiency. Labor costs represent roughly 50% of on-road operating costs for long-haul trucking, and reducing these would allow logistics companies to expand their margins.</p>
<p>That said, many trucking operations, including loading, parking, refueling and others, are currently performed by the driver - it’s unclear where the labor and efficiency savings would come from. Some of these operations can be simplified by building out autonomy infrastructure such a semi-automatic distribution and fueling centers, this would be a significant departure for most shippers, and require massive capital investment before real economic gains could be realized, negating some of the reasons for entering this space in the first place. That said, as autonomy makes its impact felt in logistics over time, it’s likely that operations will be restructured to take advantage.</p>
<p>The dark horse player is the Chinese TuSimple. It’s likely that the Chinese government has identified autonomous logistics as a strategic capability of national importance, and, like Baidu, DiDi and other companies, will bring implicit state subsidies their way. Whether they can deliver a viable system is yet to be seen, but I will be watching them closely.</p>
<p>Interestingly, unlike the passenger autonomy space, it does not appear that any company is taking the fully vertical full-autonomy approach, i.e. building their own trucks - all are presently modifying various truck makes, with or without OEM cooperation. Truck development cycles are usually approximately 6-8 years long, and must anticipate the future economics of the business as well as fleet depreciation and replacement rates, but with changing business models this may accelerate and we may see a fully autonomous, purpose-built truck arrive in the near future as the business case is further proved out.</p>
<h3 id="staged-autonomy">Staged Autonomy</h3>
<p>Other companies are aiming at a staged approach. Unlike the passenger car market, advanced driver assistance systems have made little impact in trucking - even basic lane-keeping is absent from most models, as driver comfort rarely factors into the commodity pricing for overland goods transportation, and shippers try to maximize efficiency over their fleets. However, Platooning, where one truck follows another very closely to create beneficial aerodynamics for the pair, has emerged as a technology where the semi-automated ‘driver assistance’ approach can translate into economic benefits for shippers.</p>
<p>Platooning saves roughly 6-8% on fuel over the paired trucks, and requires superhuman reaction speeds coupled with safe engagement of the platooning system. 6-8% may not seem like much, but fuel represents a further 40% of on-road operating costs - Platooning is therefore an instance where autonomy technologies contribute directly to the bottom line of shippers without requiring full autonomy or extensive new infrastructure.</p>
<p>There are remarkably few competitors in this space with Peloton Technologies the front runner. Daimler and some other OEM’s are working on this direction with some demos in the wild, but Peloton should have a system on the market by Q2 2018. Executing on Platooning creates a lot of the organizational technical knowledge that will be required in other highway trucking autonomy technologies; planning, control, perception, vehicle-vehicle communication and overall fleet management.</p>
<p>Additionally, as an aftermarket solution it should be an easier sell than requiring shippers to automate their entire fleet. There are also some very interesting platform effects that are possible; if the technology provider keeps a hold of the platform for e.g. matching trucks from different shippers into platooning pairs on the road, these relationships could provide for other opportunities. Think AWS for shipping.</p>
<p>Platooning requires one driver per truck. The next logical step is to have a single driver per several trucks, automatically ‘herding’ behind the human-crewed lead. Once you’ve removed all but one driver, the next step would be to operate the entire convoy remotely. Starsky’s approach, semi-autonomous teleoperated trucks, is intriguing but likely a little too early. The connectivity isn’t available yet over sufficient routes, which will require more infrastructure buildout (though with access to low earth orbit getting cheaper year on year, space based telecoms will likely be able to provide much of the needed bandwidth if not latency). However, air-traffic control style fleet management over highway zones between large freight-port style logistics hubs on the outskirts of major cities seems like a plausible future for trucking.</p>
<p>Important to note that at each stage of technology development, economic value is being created in terms of reduced costs and expanded capacity.</p>
<p>Here’s how I see the future of truck automation playing out:</p>
<ul>
<li>Platooning hits the road in late 2018, alongside limited autonomous trucks on some toy runs.</li>
<li>More routes open up, the industry advances conservatively and avoids too much hype</li>
<li>As benefits become clearer, more investment into required infrastructure is made over time, business models start to change</li>
<li>Legacy shippers are challenge by upstart tech-first shippers pushing autonomy and novel freight consolidation models</li>
<li>More advanced technologies (‘herding’, teleoperations, and finally full autonomy) appear in time</li>
</ul>
<p>Patience is probably the name of the game, but the economic realities of trucking make a hype-to-bust cycle as we are about to experience in passenger autonomy unlikely. It should be possible to build a durable, capital efficient, technologically advanced business in this space. Going straight to full autonomy isn’t likely to work (at least in Europe and the U.S), but there is the slim chance that the immediate economic benefits of even limited-area limited-condition full autonomy would be so great that at least some regulators and shippers would move quickly to capitalize.</p>
<h3 id="the-last-mile">The last mile</h3>
<p>Last mile logistics is mainly about bringing small packages to individuals, with speed and convenience being the key differentiator, both being a key pillar of e-commerce strategies.</p>
<p>Sidewalk robots and aerial drones delivering pizza are likely to be non-starters in this generation, for the same reasons that passenger autonomy isn’t likely to be a real business within 5 years. The initial service will be of limited deployment and poor quality, and it’s unlikely that the economics will beat paying gig-economy workers on bikes enough to pay back up-front R&D, especially given the deep venture capital subsidies these delivery companies enjoy. I expect a contracted version of the passenger autonomy timeline, but without any major company backing a serious team, the entire domain may just fade away without fanfare or much consolidation.</p>
<p>Another class of last-mile vehicle is more interesting; package delivery of the sort performed by UPS and FedEx in Grumman Long Life Vehicles (the doorless, boxy vans you see driving around your neighborhood if you live in the U.S). I am not as up-to date on the economics of this space as I’d like to be, but it seems like exactly the sort of niche where autonomy could make a strong economic case and provide a better user experience, while mitigating the issue of limited service areas and conditions.</p>
<p>One fundamental problem with automated last mile is it’s still not possible for a robot to come to your door (especially if you live in an apartment complex) and drop off your package. That’s an advantage the Postmates guy on a bike is going to enjoy for a while yet.</p>
<h2 id="closing-remarks">Closing remarks</h2>
<p>The autonomy industry is in for a wild ride over the next few years. An analogy I’ve been using is we are in the Altair / Apple I era of development. The technical advances are real, and made up of incremental advances in sensing, compute, and batteries coming together at the right time. It’s likely that autonomous vehicle will be a transformative economic force over the next decade. But it’s too early to buy into the passenger hype, and the benefits will take a while to come. Strap in.</p>Anton TroynikovA long, rambling brain-dump about what I think will happen.Shape Priors Part 2: Embedding Shapes2017-10-29T00:00:00+00:002017-10-29T00:00:00+00:00http://troynikov.io/shape-priors-pca<p>In <a href="/shape-priors-tsdf/">part 1</a> of this series, we discussed using truncated signed distance functions to represent shapes. The next step in the pipeline is to find a way to parametrize a whole class of shapes in a unified way. This will allow us to later reason about that class collectively, instead of only working with particular shape instances.</p>
<h2 id="introduction-what-is-a-shape-embedding">Introduction: What is a shape embedding?</h2>
<p>With TSDFs we have an accurate and efficient way of representing any shape - whether a car, a person, a building, or anything else. However, because they’re so general, TSDFs don’t allow us to reason about the common properties of the shapes of objects of a given class in a unified way. We’d like to find a way to constrain our shape representation even further, so that we can answer questions like ‘what is the car-like shape that best fits to the data we have observed’. We’d also like to generate and compare car-like shapes from just a few numbers.</p>
<p>Shape embeddings can be thought of as a parametrization of shapes based on their most salient features. For example; though the shapes of cars differ from one another in many ways, some of the biggest sources of difference are their lengths, heights, widths, whether or not they have a hatchback. We’d like to find a way to describe the shapes of all cars in terms of only a small number of their most salient properties. We’d like to ‘encode’ the shapes of all cars into only a small vector.</p>
<p>More formally, an embedding is a mapping from a high dimensional space onto a lower dimensional space. We would like to find such a mapping that is smooth and differentiable, i.e. one that, given smooth changes in the lower dimensional space, produces smooth changes in the higher dimensional space. We will see in a future post that this allows us to perform optimization in this lower dimensional space to efficiently approximate the shapes of objects.</p>
<h2 id="pca-basics">PCA Basics</h2>
<p>Principle Component Analysis (PCA) is a classic data analysis technique, originating from mechanical engineering. I will omit the rigorous mathematical treatment of PCA here as the <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">wikipedia article</a> covers it nicely. There are many intuitive ways to think about what PCA does, however the one I find most useful is to consider the resulting principal components as an orthogonal basis for the vectors of the higher dimensional space, with each basis vector corresponding to some ‘axis of variability’ in the data. The first axis has the greatest variability, the second has the next greatest, and so on (some readers will note that this seems like an orthogonal linear transformation, and it is!).</p>
<p>A convenient way to compute the principal components is via Eigendecomposition, also known as Spectral Decomposition. Suppose we have some <script type="math/tex">k</script> vectors <script type="math/tex">\mathbf{X}</script> from an <script type="math/tex">N</script> dimensional space, <script type="math/tex">\mathbf{x} \in \mathbb{R}^{N}</script>. We compute the <script type="math/tex">N\times N</script> <a href="https://en.wikipedia.org/wiki/Sample_mean_and_covariance#Sample_covariance">covariance matrix</a>, <script type="math/tex">\mathbb{\Sigma}</script> over our <script type="math/tex">k</script> vectors. <script type="math/tex">\mathbb{\Sigma}</script> has the nice property of being real valued and symmetric (<script type="math/tex">\mathbb{\Sigma} = \mathbb{\Sigma}^T</script>) by definition, and hence is always diagonalizable.</p>
<p>We may then factorize the covariance matrix as <script type="math/tex">\mathbb{\Sigma} = \mathbf{VDV}^T</script>, where <script type="math/tex">\mathbf{V}</script> is an <script type="math/tex">N \times N</script> matrix whose columns consist of the eigenvectors of <script type="math/tex">\mathbb{\Sigma}</script>, and <script type="math/tex">\mathbf{D}</script> the diagonal matrix of corresponding squared eigenvalues. The vectors in <script type="math/tex">\mathbf{V}</script> then form the basis in the space of the PCA. Projecting the points in <script type="math/tex">\mathbf{X}</script> to the new basis then amounts to computing;</p>
<script type="math/tex; mode=display">\mathbf{x'} = \mathbf{V}^T(\mathbf{x} - \mu(\mathbf{X}))</script>
<p>where <script type="math/tex">\mu(\mathbf{X})</script> is the mean over the <script type="math/tex">k</script> samples in <script type="math/tex">\mathbf{X}</script>.</p>
<p>Note that this projection also results in an N-dimensional space, <script type="math/tex">\mathbf{x'} \in \mathbb{R}^N</script>. In order to get a low dimensional mapping, <script type="math/tex">% <![CDATA[
\mathbf{x'} \in \mathbb{R}^M,\ M < N %]]></script>, all we need to do now is discard the basis vectors corresponding to low variability, i.e. construct <script type="math/tex">\mathbf{W}</script> as an <script type="math/tex">N \times M</script> matrix consisting of the eigenvectors in <script type="math/tex">\mathbf{V}</script> corresponding only to the largest <script type="math/tex">M</script> squared eigenvalues in <script type="math/tex">\mathbf{D}</script>. Hence <script type="math/tex">\mathbf{W} : \mathbb{R}^N \mapsto \mathbb{R}^M</script>, and we have acheived our dimensional reduction.</p>
<h2 id="from-tsdf-to-pca">From TSDF to PCA</h2>
<p>As a linear mapping, <script type="math/tex">\mathbf{W}</script> accomplishes our goal of finding a smooth, differentiable mapping to a lower dimensional space. Applying this mapping to our TSDF representation is straightforward. We need only unroll the 3-D TSDF to a 1-D vector representation. Which order we do this is unimportant, so long as we maintain the same ordering for each TSDF.</p>
<p>We then apply PCA to the resulting vector. <a href="https://www.vci.rwth-aachen.de/publication/00146/">Engelmann et. al.</a> choose a 5-component <script type="math/tex">\mathbf{W}</script>. The result is illustrated below;</p>
<p><img src="/assets/images/post_images/TSDF_PCA.png" alt="TSDF to PCA" /></p>
<p>We now have a low-dimenional shape embedding, through a linear mapping. This will allow us to later deal with efficiently optimizing over shapes using only a few variables that encode the most salient aspects, rather than over the entire high dimensional TSDF representation. Coincidentally, we also compute a ‘mean car shape’, which will be useful to us later on in initializing our optimization.</p>
<h2 id="getting-back-to-a-tsdf">Getting back to a TSDF</h2>
<p>We would also like to be able to reconstruct the TSDF, and hence a 3-D representation of our shape. Fortunately, since <script type="math/tex">\mathbf{W}</script> is a linear mapping, reprojecting back into the TSDF space is also simple and computation. To get back to <script type="math/tex">\mathbf{x} \in \mathbb{R}^N</script> from <script type="math/tex">\mathbf{x'} \in \mathbb{R}^M</script> we need only compute;</p>
<script type="math/tex; mode=display">\mathbf{x} = \mathbf{W}(\mathbf{x'} + \mu(\mathbf{X}))</script>
<p>Of course, in performing the projection, information about the shape will be lost. An intuitive way of thinking about this is that discarding eigenvectors corresponds to an ‘averaging’ over the dimension that was discarded, and encoding only this average over the remaining eigenvectors. The results of perfoming PCA over TSDFs, reducing to five components, then reprojecting back to the shape is visualized in the following;</p>
<p><img src="/assets/images/post_images/shape_PCA.png" alt="Shape to PCA and Back" /></p>
<p>Clearly some detail is lost. However, we can make this tradeoff in a principled way, since the magnitude of the discarded eigenvalues corresponds to the variance of the discarded components. Now that we have a reasonable low-dimensional shape representation, we may perform all sorts of manipulations in this space, including qualitatively observing the influence of the various components.</p>
<p>While other approaches to creating embeddings for shapes exist, including recent machine-learning approaches with autoencoders and generative adversarial networks, in practice PCA is suitable for our immediate needs. In part 3, we will examine how to use the TSDF and the low dimensional embedding to solve an important class of problem in photometric computer vision, using a principled optimization approach.</p>Anton TroynikovThe whys and hows of embedding shapes using PCA.Shape Priors Part 1: Representing Shapes2017-10-03T00:00:00+00:002017-10-03T00:00:00+00:00http://troynikov.io/shape-priors-tsdf<p>Since October 2016, I have been working on some research motivated by Engelmann et. al. from RTWH Aachen. Notably, the two papers <a href="https://www.vci.rwth-aachen.de/publication/00135/">Joint Object Pose Estimation and Shape Reconstruction in Urban Street Scenes Using 3D Shape Priors</a> and <a href="https://www.vci.rwth-aachen.de/publication/00146/">SAMP: Shape and Motion Priors for 4D Vehicle Reconstruction</a> were a great inspiration. My work has been to extend their approach to be faster, more robust, and work on-line. However, the fundamental ideas, particularly with respect to shape representation and optimization, are due to these works.</p>
<p>In this series of articles I intend to write up the step-by-step shape priors pipeline, before discussing my results and extensions of the algorithms detailed in these papers. The articles are intended for the enthusiast, with again, a more technical treatment coming in an upcoming scientific paper.</p>
<h2 id="introduction-what-are-shape-priors">Introduction: What are Shape Priors?</h2>
<p>‘Shape Priors’ is a fancy way of saying that our understanding of objects we can see is informed by our experiences with similar objects we’ve seen before.
Humans have an intuitive understanding of the shape of objects. We can easily infer the overall shape of a lamp, or a car or an aeroplane, with a high degree of accuracy - even if we only observe a small piece of it from only one angle. In effect, we can <em>predict</em> the shape of an object from very limited observations, because we have an <em>expectation</em> about that object’s shape. Being able to guess the overall shape of an object from partial information is a very useful skill that allows us to more easily predict and interact with the world.</p>
<p>In contrast, machines have no inherent notion of what objects look like; in computer vision we often regard the world as simply a collection of textured points in space. It is impossible for ‘naive’ machines to predict the overall shapes of objects, because no expectation about shape is encoded into these collections of points. In order for machines to have the same degree of autonomy when interacting with the world, they must have some idea of ‘shape’.</p>
<p>Key to giving machines this ‘idea of shape’ is having a way to represent shapes so that computers can understand and manipulate them.</p>
<h2 id="representing-shapes-with-signed-distance-functions">Representing Shapes with Signed Distance Functions</h2>
<p>Our goal is to find a way to represent arbitrary 3D shapes. In the real world, at least at the visual level, shapes are made up of continuous compound surfaces. To a certain degree of simplification, this means that in order to describe an arbitrary shape we would need to store an infinite number of points; clearly this isn’t feasible with finite memory.</p>
<p>Instead of storing every point on the shape, we seek an approximation that is sufficiently good for our purposes. There are many ways to approximate shapes, including as simple point clouds, meshes, implicit surfaces, and others. All of these have their own particular applications, advantages and drawbacks.</p>
<p>A representation of paricular interest is called a Signed Distance Function (SDF). Signed distance functions are very useful because they combine a compact representation of shape with high fidelity and lightweight computations. The key idea is that an arbitrary shape can be implicitly represented by measuring the distances from the surface of that shape to some chosen, finite, set of points.</p>
<h3 id="computing-the-signed-distance">Computing the signed distance</h3>
<p>Signed distance functions are relatively straightforward to understand and compute, but first let’s visualize them in 2D before we extend the idea to three dimensions. Let’s investigate how to represent a curve in 2D space using signed distance functions. First we must understand what is meant by <em>signed distance</em>.</p>
<p><img src="/assets/images/post_images/2D_sdf_io.png" alt="2D SDF - Inside and Out" /></p>
<p>Consider a plane divided into a grid of squares, with a curve passing through it. Label one side of the curve the ‘inside’ and one the ‘outside’ (for closed curves there is a rigorous idea of outside and inside, but for our purposes this is an arbitrary label).</p>
<p><img src="/assets/images/post_images/2D_sdf_outside_distance.png" alt="2D SDF - Distances" /></p>
<p>The signed distance from the curve to a point is then simply the shortest euclidian distance between the point and the curve, signed positively if the point is on the <em>outside</em> and negatively if it’s on the <em>inside</em>. Importantly, if a point lies <em>on</em> the curve, the signed distance will be zero.</p>
<h3 id="signed-distance-functions">Signed distance functions</h3>
<p>A signed distance <em>function</em> is a mapping from the position of a set of points in space, to their signed distance with respect to the curve we wish to represent. The curve can then be reconstructed as the <em>zero level set</em> of the signed distance function; that is, the set of points for which the signed distance function is 0.</p>
<p>We can compute these signed distances for arbitrary points relative to the curve, but we would like to avoid computing them everywhere; after all there are an infinite number of points to choose from. You can think of each point at which we compute the distance being like a <em>sample</em> on the shape of the curve; each distance contains some information about what the curve looks like. The more densely we sample the distances, the closer the zero level set of the SDF corresponds to the true curve.</p>
<p>Part of the power of signed distance functions comes from this abiliy to choose the best sampling for our needs. For example we can more densely sample areas of the curve that are somehow more ‘interesting’, or we can choose to sample the space near the curve randomly (which is like a monte-carlo approximation for the shape of the curve), or we can choose to sample in a regular grid.</p>
<p>Sampling on a regular grid gives us another advantage that leads to a more compact representation, which we discuss below.</p>
<h3 id="sdfs-and-interpolation">SDFs and Interpolation</h3>
<p>Suppose we sample the signed distance to the curve at each vertex of the squares of our regular grid;</p>
<p><img src="/assets/images/post_images/2D_sdf_vertices.png" alt="2D SDF - Vertex Samples" /></p>
<p>We would like to compute a single value for each grid cell, to <em>implicitly</em> represent the signed distance function over the entire space. To do this we linearly interpolate the values at each vertex to the center of the cell.</p>
<p><img src="/assets/images/post_images/2D_sdf_interpolate.png" alt="2D SDF - Interpolated" /></p>
<p>We repeat this process over the entire space to get the interpolated SDF at every cell. This then gives us an <em>implicit</em> signed distance function;</p>
<p><img src="/assets/images/post_images/2D_sdf_function.png" alt="2D SDF - Function Values" /></p>
<p>Though only relatively few values are needed, they are in practice sufficient to accurately find the level set of the implied SDF, and hence the original curve. Several algorithms are available, including the famous <a href="https://en.wikipedia.org/wiki/Marching_squares">marching squares</a>, and <a href="https://en.wikipedia.org/wiki/Ray_casting">raycasting</a>. This decomposition allows us to encode arbitrary closed shapes in a compact, computationally tractable way.</p>
<h3 id="a-note-on-truncation">A note on truncation</h3>
<p>Grid cells that are relatively distant from the shape don’t really add any extra information when we reconstruct from the implicit SDF to the explicit shape representation. We therefore <em>truncate</em> the values of the signed distance function beyond a certain threshold, in both positive and negative directions. This means that in practice we only need to store the values of the SDF close to the curve, making the representation even more compact and improving the computation efficiency.</p>
<p>This gives rise to the Truncated Signed Distance Function, or TSDF. In every other respect, they function just the same as the SDF.</p>
<h2 id="extending-to-3d">Extending to 3D</h2>
<p>Being able to compute an SDF and recover a shape in 2D is well and good, but in machine perception we wish to be able to represent shapes in the 3D world. Fortunately, SDFs can be readily extended to the 3D case using voxels. Voxels divide 3D space into a uniform lattice.</p>
<p><img src="/assets/images/post_images/3D_sdf.png" alt="3D SDF" /></p>
<p>We compute the signed distance from each vertex to the nearest point on the 3D surface of the shape, and interpolate in three rather than two dimensions. This results in an implicit SDF, with a signed distance value for each voxel.</p>
<h2 id="putting-it-together">Putting it Together</h2>
<p>We now have a method for encoding 3D shapes as a signed distance function, a convenient and compact representation.</p>
<p><img src="/assets/images/post_images/CAD_sdf.png" alt="SDF of some CAD models of cars" /></p>
<p>Here we see some CAD models of cars and their corresponding TSDFs (thanks to Engelmann et. al.). This is the first step in working with shape priors for 3D computer vision. We can easily convert between the explicit surface of the shape, and the implicit signed distance function. Furthermore, this compact numerical representation will allow us to analyze and manipulate the underlying shape in some very useful ways.</p>
<p>In part 2, we will discuss some of the ways we can decompose the shape representation even further, and find a common way to represent different shapes from the same class (in our case, road vehicles), using only a few parameters.</p>Anton TroynikovAn intro to TSDFs for representing 3D shapes.LSD Hallucinations - When SLAM Goes Wrong2017-09-14T00:00:00+00:002017-09-14T00:00:00+00:00http://troynikov.io/lsd-hallucinations<h2 id="introduction">Introduction</h2>
<p>While working for Iris Automation last fall <a href="http://irisonboard.com">(they’re hiring!)</a>, I encountered a particularly interesting failure case of direct visual odometry. Unlike most failures, this one was not due to sudden motions that prevent tracking between frames, nor hard-to-deal-with aspects of the world like moving objects or specularity. Nor was the failure mode random; the system generated an internally consistent representation of the world, and of the camera trajectory in it – one that was divorced from reality.</p>
<p>In effect, the system was hallucinating — seeing things that weren’t there.</p>
<p>This overview is intended for computer vision enthusiasts, and doesn’t assume a deep mathematical background. I will link out to a more technical report for experts when I’ve finished typesetting all the math.</p>
<h2 id="the-failure">The Failure</h2>
<p>Our input sequence is a video collected from a drone, flying straight and level (replicated here in Xplane):</p>
<p><img src="/assets/images/post_images/xplane_vid.gif" alt="The xplane video" /></p>
<p>To a human, the motion of the camera into the scene is obvious, and we could give a rough estimate of the 3D structure of the terrain we were flying over.</p>
<p><img src="/assets/images/post_images/sane_reconstruction.png" alt="A sane reconstruction" /></p>
<p>The SLAM system however, arrived at a surprisingly different reconstruction:</p>
<p><img src="/assets/images/post_images/lsd_hallucination.gif" alt="The hallucination" /></p>
<p>It may not be completely clear from the above animation, but the SLAM system arrives at a reconstruction which puts every point in the world that it can see onto a 2D plane at some distance from the camera. The reconstructed trajectory is arc-shaped, including a rotational component.</p>
<p><img src="/assets/images/post_images/crazy_reconstruction.png" alt="The strange reconstruction" /></p>
<p>Tracking is never lost, and the 3D points remain consistent over time. The reconstructed motion is smooth. Everything looks well behaved, but at the same time totally wrong. So what’s going on here?</p>
<h2 id="background">Background</h2>
<p>The causes of this failure are complex; they are partly a consequence of the environment (flying straight and level over mostly flat terrain), and partly a consequence of some of the properties of the algorithm being used.</p>
<h3 id="the-algorithm">The algorithm</h3>
<p>SLAM algorithms are a family of computer vision algorithms that attempt to use a video or sequence of images to simultaneously estimate the motion of a camera in space (and therefore e.g. the drone it’s attached to), as well as reconstructing a 3D representation of the world. Hence Simultaneous Localization And Mapping - SLAM.</p>
<p>There are many different SLAM approaches and variations. The algorithm used here is from the family of direct photometric SLAM algorithms. This means that it attempts to figure out the position of the camera for a given frame by aligning what it can see in that frame to what it <em>should</em> see based on what it already knows about the 3D structure of the world. Once it has an estimate for where the camera is (the localization), it adds to the reconstruction of the world (the map), and repeats the process for the next frame.</p>
<p>What differentiates this family of algorithms from other SLAMs is that it doesn’t require any hand-tuned features; it just uses what it can see. There are a few other neat tricks that different algorithms from this family use to get good, stable results over time, but this is the key advantage.</p>
<p>I orginally discovered this failure with Iris’ proprietary algorithms, but the failure mode is sufficiently general that I was able to replicate it with the famous <a href="https://github.com/tum-vision/lsd_slam">LSD-SLAM</a> from Engel. et. al. I think the failure will also occur with algorithms from other SLAM families.</p>
<h2 id="the-source-of-the-hallucination">The source of the hallucination</h2>
<p>It took me some time to figure out what exactly was happening. This is far from a conventional failure mode, especially on a ‘real’ dataset (the original failure was discovered when I was testing against some real flight data), but serves to illustrate how important it is to think about all aspects of the system when developing any kind of autonomy.</p>
<h3 id="where-the-plane-comes-from-and-why-it-stays-consistent">Where the plane comes from, and why it stays consistent</h3>
<p>Arguably the most unusual feature of this failure is that the algorithm reconstructs the world into a plane — almost as if it thinks it’s looking at a matte painting, on which the trees, hills, buildings etc. are ‘painted on’. Why this specific reconstruction? How is it possible that it remains consistent between all the frames?</p>
<p>To get an idea, we need to pay attention to how this class of algorithms is usually initialized. Because we don’t yet know anything about the world when we start up, photometric SLAM algorithms usually initialize their map of the world by setting the distance of every point it can see in the first frame to 1, plus some random noise. This looks like the fuzzy plane that we see in the reconstruction. That explains why we get a plane, but not why it stays consistent.</p>
<p>What usually happens after the system has seen a few more frames is the algorithm quickly ‘breaks out’ of this crude initialization; the random initialization correctly assigns a consistent <em>relative distance</em> betweem some of the points, which get reconstructed consistently across several frames. The algorithm then ‘locks on’ to these points, allowing it to estimate a more consistent trajectory, which in turn allows it to reconstruct other points, and so on in a virtuous cycle. So why doesn’t that happen here?</p>
<p>The key is the small <em>relative motion</em> of our drone and the <em>relative distance</em> of the objects in the scene. The difference between the distance from the drone to the nearest object that it can see in the scene (d1) is only slightly different from the distance to the farthest object it can see (d2), because both these distances are dominated by the height of the drone’s camera above the ground. Flying high means everything is about the same distance away, relatively speaking.</p>
<p><img src="/assets/images/post_images/distances.png" alt="Relative Distances" /></p>
<p>Additionally, relative to the distances between the drone and objects on the ground, the drone only moves a tiny amount in between frames. When some objects are close by and some are far away, there is a parallax effect. More distant objects seem to move relatively little in between images, and objects close by move a lot.</p>
<p><img src="/assets/images/post_images/parallax.gif" alt="Parallax - Courtesy of Wikipedia" /></p>
<p>This is the effect that usually allows this family of algorithm to break out from their initialization since some objects ‘pop out’ of the scene when the camera moves. But when everything is roughly equally far away, no parallax is observed, and so we have no new distance information.</p>
<p>This means that the plane <em>is</em> a consistent reconstruction of what the drone is seeing. Another situation where no parallax effect is observed is if we were moving relative to a flat object. With only tiny motions between frames, the plane will stay a plane.</p>
<p>This serves to explain the planar reconstruction, but where does the arc-shaped trajectory the drone takes come from?</p>
<h3 id="where-the-arc-shaped-trajectory-comes-from">Where the arc-shaped trajectory comes from</h3>
<p>Though there is very little parallax and little relative motion, what the drone sees still changes over time. The aim of every SLAM algorithm is to find a motion that is consistent with its knowledge of the world so far, and what it sees in each frame. We can use these clues to figure out why the trajectory is being reconstructed the way it is. Let’s take a look at the input video again and see what critical features might be influencing the trajectory reconstruction:</p>
<p><img src="/assets/images/post_images/xplane_vid.gif" alt="The xplane video, again" /></p>
<p>There are a few important things to note:</p>
<ul>
<li>The line between sky and ground always stays in roughly the same place in the image.</li>
<li>As the drone flies forward, objects in the bottom of the image ‘fall off’, and the drone can’t see them anymore.</li>
<li>There is also some falloff at the edges, but it’s not even across the image; closer to the bottom of the image, there is more falloff.</li>
</ul>
<p><img src="/assets/images/post_images/falloff.png" alt="Faloff diagram" /></p>
<p>We know all these effects come from the fact that we’re moving in a world with perspective, but remember — the drone thinks it’s flying relative to a flat plane painted to resemble objects in the world. What path would the drone have to take, if it had to reproduce all these effects from motion near a plane?</p>
<p>We could try rotating upwards in place (pitching up):</p>
<p><img src="/assets/images/post_images/motion_2.png" alt="Rotate" /></p>
<p>We’d see the falloff off the bottom, but we wouldn’t have any falloff at the edges. What’s more, the line between ground and sky would move in the image which isn’t what the drone observes.</p>
<p>We could try moving directly toward the plane:</p>
<p><img src="/assets/images/post_images/motion_1.png" alt="Directly towards" /></p>
<p>While this motion could replicate the falloff at the bottom of the image, and keep the horizon in the same position, we would not see quite the kind of falloff at the edges; things would fall off the sides more or less uniformly, which isn’t what the drone is seeing.</p>
<p>We need a compound combination of rotation and translation to replicate what we’re seeing in the images with motion relative to a plane — and this is exactly what the SLAM algorithm hallucinates:</p>
<p><img src="/assets/images/post_images/reconstructed_motion.png" alt="Reconstruction" /></p>
<p>This compound motion satisfies all three of the constraints we observe in our images; a certain rate of falloff off the bottom, due to the drone pitching up, a correct falloff off the edges due to the drone moving slightly toward the plane, and a constant position of the sky/ground line in the images compensated for by moving the drone downwards as it rotates.</p>
<p><img src="/assets/images/post_images/lsd_hallucination.gif" alt="The hallucination again" /></p>
<p>I intend to give a complete mathematical account of how the constraints we observe in the images give rise to this as the observed trajectory, but this gives the main intuition: If your reconstruction is to be consistently planar (as it must be from what we noted in the previous section), then this is the only possible motion.</p>
<p>Thus we have a full description of both the reconstruction and the trajectory — we know why our system hallucinates!</p>
<h2 id="coming-down-back-to-reality">Coming down (back to reality)</h2>
<p>Now that we understand the failure, what can we do about it? There are a few possible solutions, mainly focused on better initializing the algorithm so we don’t get this very strong plane initialization in the first place.</p>
<ul>
<li>
<p>We can use domain knowledge to initialize the point cloud. If we know we’re frequently flying high over flat terrain, we can initialize that way. Rather than using a mean distance of 1 from the camera, we could instead initialize as if the camera is distance 1 above a plane, and instead perturb that. This should provide a higher likelihood of providing some points to ‘lock‘ onto, and hence break out of the initialzation. Of course this kind of prior-based initialization may not work so well if we find ourselves initializing in other important situations, flying toward the ground for example.</p>
</li>
<li>
<p>We can try to use a feature based approach to initialize the direct photometric approach. What this means is we find some features in the scene we can easily track (for example, <a href="http://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_orb/py_orb.html">ORB features</a>, and estimate our motion using just these features using some <a href="https://en.wikipedia.org/wiki/Eight-point_algorithm">classical algorithms</a>. When we’ve moved ‘far enough’, to notice some parallax effects, we can start our reconstruction. While this approach is relatively simple, it has its own pitfalls; the danger that you lose tracking during initialization is very high, which puts you in the situation that you’ll never be able to initialize.</p>
</li>
<li>
<p>My personal favorite is to simply increase the variance of the random initialization of the distances of each point. This means that rather than a plane, we get a random volume of points. In this way it’s more likely that objects that are distant from one another will also be assigned correct <em>relative</em> distances. Of course the problem here is that because the initialization is stochastic, you don’t get a gurantee that it will be very good, and the algorithm may take many more steps to converge to a ‘good’ motion estimate and reconstruction.</p>
</li>
<li>
<p>Use other information, like inertial measurement units, to bias our motion estimates. The map-odometry feedback loop is what produces our ‘hallucination’. If we use other information about how we are moving through the world, including intertial measurement units, a motion model for the drone, and the drone’s control inputs, we can get another estimate for the motion. If we rely on this estimate (through Kalman Filtering, for example) while we’re not yet confident of our reconstruction of the world, we can help the algorithm ‘break out’. This of course relies on having access to all these extra measurements, which is not always the case.</p>
</li>
</ul>
<h2 id="a-note-on-stereo">A note on stereo</h2>
<p>One naturally wonders if having two cameras would save us. The obvious approach is to use a stereo system with parallel cameras. We would be able to estimate distances by using the relative position of the two cameras. Unfortunately, this won’t help in this particular case - the distance between the two cameras will be tiny compared to the distances between the cameras and objects in the world. For the same reason that the drone’s tiny relative motion doesn’t produce the parallax we need, the distance between the two cameras won’t produce a sufficient difference in the two images to break us out.</p>
<p>Another idea is to use a pair of cameras mounted orthogonally, which would introduce an additional constraint on the possible trajectories. If we added an additional side-facing camera, if this camera also initialized and became locked into a plane prior, it would observe motion parallel to the plane. This would be inconsistent with the motion we have described in the previous section, and hence force the system to reconstruct a trajectory more consistent with the real world.</p>
<p>Another approach is to use two systems that were relatively far apart from one another could communicate about what they’re seeing. Pairs of drones flying more or less independently could produce a ‘virtual’ stereo baseline, if they could be sufficiently well localized relative to one another. There is research in this direction, notably from the <a href="http://www.asl.ethz.ch/">autonomous systems lab</a> at ETH Zurich.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I hope you’ve enjoyed this account of an interesting failure mode I discovered in a common class of SLAM algorithms. The explanation in this article is intended for the enthusiast — a more rigorous mathematical treatment is coming soon.</p>
<p>Autonomy is a challenging problem space, and a deep understanding of the foundations of the algorithms being used is critical for this kind of debugging, especially when safety is a priority.</p>
<p>I’d like to acknowledge the help of my Iris colleagues, Alexander Harmsen and James Howard for helping me get some of the details straight and providing me with some simulated data for public consumption. <a href="www.irisonboard.com">(They’re hiring!)</a> across a range of positions, especially in computer vision. It’s a great team with a great product!</p>
<p>If you’re interested in trying this particular failure out for yourself, please contact me at anton [at] troynikov [dot] io and I’ll help you get started.</p>Anton TroynikovA peculiar failure in computer vision.