While working for Iris Automation last fall (they’re hiring!), I encountered a particularly interesting failure case of direct visual odometry. Unlike most failures, this one was not due to sudden motions that prevent tracking between frames, nor hard-to-deal-with aspects of the world like moving objects or specularity. Nor was the failure mode random; the system generated an internally consistent representation of the world, and of the camera trajectory in it – one that was divorced from reality.
In effect, the system was hallucinating — seeing things that weren’t there.
This overview is intended for computer vision enthusiasts, and doesn’t assume a deep mathematical background. I will link out to a more technical report for experts when I’ve finished typesetting all the math.
Our input sequence is a video collected from a drone, flying straight and level (replicated here in Xplane):
To a human, the motion of the camera into the scene is obvious, and we could give a rough estimate of the 3D structure of the terrain we were flying over.
The SLAM system however, arrived at a surprisingly different reconstruction:
It may not be completely clear from the above animation, but the SLAM system arrives at a reconstruction which puts every point in the world that it can see onto a 2D plane at some distance from the camera. The reconstructed trajectory is arc-shaped, including a rotational component.
Tracking is never lost, and the 3D points remain consistent over time. The reconstructed motion is smooth. Everything looks well behaved, but at the same time totally wrong. So what’s going on here?
The causes of this failure are complex; they are partly a consequence of the environment (flying straight and level over mostly flat terrain), and partly a consequence of some of the properties of the algorithm being used.
SLAM algorithms are a family of computer vision algorithms that attempt to use a video or sequence of images to simultaneously estimate the motion of a camera in space (and therefore e.g. the drone it’s attached to), as well as reconstructing a 3D representation of the world. Hence Simultaneous Localization And Mapping - SLAM.
There are many different SLAM approaches and variations. The algorithm used here is from the family of direct photometric SLAM algorithms. This means that it attempts to figure out the position of the camera for a given frame by aligning what it can see in that frame to what it should see based on what it already knows about the 3D structure of the world. Once it has an estimate for where the camera is (the localization), it adds to the reconstruction of the world (the map), and repeats the process for the next frame.
What differentiates this family of algorithms from other SLAMs is that it doesn’t require any hand-tuned features; it just uses what it can see. There are a few other neat tricks that different algorithms from this family use to get good, stable results over time, but this is the key advantage.
I orginally discovered this failure with Iris’ proprietary algorithms, but the failure mode is sufficiently general that I was able to replicate it with the famous LSD-SLAM from Engel. et. al. I think the failure will also occur with algorithms from other SLAM families.
The source of the hallucination
It took me some time to figure out what exactly was happening. This is far from a conventional failure mode, especially on a ‘real’ dataset (the original failure was discovered when I was testing against some real flight data), but serves to illustrate how important it is to think about all aspects of the system when developing any kind of autonomy.
Where the plane comes from, and why it stays consistent
Arguably the most unusual feature of this failure is that the algorithm reconstructs the world into a plane — almost as if it thinks it’s looking at a matte painting, on which the trees, hills, buildings etc. are ‘painted on’. Why this specific reconstruction? How is it possible that it remains consistent between all the frames?
To get an idea, we need to pay attention to how this class of algorithms is usually initialized. Because we don’t yet know anything about the world when we start up, photometric SLAM algorithms usually initialize their map of the world by setting the distance of every point it can see in the first frame to 1, plus some random noise. This looks like the fuzzy plane that we see in the reconstruction. That explains why we get a plane, but not why it stays consistent.
What usually happens after the system has seen a few more frames is the algorithm quickly ‘breaks out’ of this crude initialization; the random initialization correctly assigns a consistent relative distance betweem some of the points, which get reconstructed consistently across several frames. The algorithm then ‘locks on’ to these points, allowing it to estimate a more consistent trajectory, which in turn allows it to reconstruct other points, and so on in a virtuous cycle. So why doesn’t that happen here?
The key is the small relative motion of our drone and the relative distance of the objects in the scene. The difference between the distance from the drone to the nearest object that it can see in the scene (d1) is only slightly different from the distance to the farthest object it can see (d2), because both these distances are dominated by the height of the drone’s camera above the ground. Flying high means everything is about the same distance away, relatively speaking.
Additionally, relative to the distances between the drone and objects on the ground, the drone only moves a tiny amount in between frames. When some objects are close by and some are far away, there is a parallax effect. More distant objects seem to move relatively little in between images, and objects close by move a lot.
This is the effect that usually allows this family of algorithm to break out from their initialization since some objects ‘pop out’ of the scene when the camera moves. But when everything is roughly equally far away, no parallax is observed, and so we have no new distance information.
This means that the plane is a consistent reconstruction of what the drone is seeing. Another situation where no parallax effect is observed is if we were moving relative to a flat object. With only tiny motions between frames, the plane will stay a plane.
This serves to explain the planar reconstruction, but where does the arc-shaped trajectory the drone takes come from?
Where the arc-shaped trajectory comes from
Though there is very little parallax and little relative motion, what the drone sees still changes over time. The aim of every SLAM algorithm is to find a motion that is consistent with its knowledge of the world so far, and what it sees in each frame. We can use these clues to figure out why the trajectory is being reconstructed the way it is. Let’s take a look at the input video again and see what critical features might be influencing the trajectory reconstruction:
There are a few important things to note:
- The line between sky and ground always stays in roughly the same place in the image.
- As the drone flies forward, objects in the bottom of the image ‘fall off’, and the drone can’t see them anymore.
- There is also some falloff at the edges, but it’s not even across the image; closer to the bottom of the image, there is more falloff.
We know all these effects come from the fact that we’re moving in a world with perspective, but remember — the drone thinks it’s flying relative to a flat plane painted to resemble objects in the world. What path would the drone have to take, if it had to reproduce all these effects from motion near a plane?
We could try rotating upwards in place (pitching up):
We’d see the falloff off the bottom, but we wouldn’t have any falloff at the edges. What’s more, the line between ground and sky would move in the image which isn’t what the drone observes.
We could try moving directly toward the plane:
While this motion could replicate the falloff at the bottom of the image, and keep the horizon in the same position, we would not see quite the kind of falloff at the edges; things would fall off the sides more or less uniformly, which isn’t what the drone is seeing.
We need a compound combination of rotation and translation to replicate what we’re seeing in the images with motion relative to a plane — and this is exactly what the SLAM algorithm hallucinates:
This compound motion satisfies all three of the constraints we observe in our images; a certain rate of falloff off the bottom, due to the drone pitching up, a correct falloff off the edges due to the drone moving slightly toward the plane, and a constant position of the sky/ground line in the images compensated for by moving the drone downwards as it rotates.
I intend to give a complete mathematical account of how the constraints we observe in the images give rise to this as the observed trajectory, but this gives the main intuition: If your reconstruction is to be consistently planar (as it must be from what we noted in the previous section), then this is the only possible motion.
Thus we have a full description of both the reconstruction and the trajectory — we know why our system hallucinates!
Coming down (back to reality)
Now that we understand the failure, what can we do about it? There are a few possible solutions, mainly focused on better initializing the algorithm so we don’t get this very strong plane initialization in the first place.
We can use domain knowledge to initialize the point cloud. If we know we’re frequently flying high over flat terrain, we can initialize that way. Rather than using a mean distance of 1 from the camera, we could instead initialize as if the camera is distance 1 above a plane, and instead perturb that. This should provide a higher likelihood of providing some points to ‘lock‘ onto, and hence break out of the initialzation. Of course this kind of prior-based initialization may not work so well if we find ourselves initializing in other important situations, flying toward the ground for example.
We can try to use a feature based approach to initialize the direct photometric approach. What this means is we find some features in the scene we can easily track (for example, ORB features, and estimate our motion using just these features using some classical algorithms. When we’ve moved ‘far enough’, to notice some parallax effects, we can start our reconstruction. While this approach is relatively simple, it has its own pitfalls; the danger that you lose tracking during initialization is very high, which puts you in the situation that you’ll never be able to initialize.
My personal favorite is to simply increase the variance of the random initialization of the distances of each point. This means that rather than a plane, we get a random volume of points. In this way it’s more likely that objects that are distant from one another will also be assigned correct relative distances. Of course the problem here is that because the initialization is stochastic, you don’t get a gurantee that it will be very good, and the algorithm may take many more steps to converge to a ‘good’ motion estimate and reconstruction.
Use other information, like inertial measurement units, to bias our motion estimates. The map-odometry feedback loop is what produces our ‘hallucination’. If we use other information about how we are moving through the world, including intertial measurement units, a motion model for the drone, and the drone’s control inputs, we can get another estimate for the motion. If we rely on this estimate (through Kalman Filtering, for example) while we’re not yet confident of our reconstruction of the world, we can help the algorithm ‘break out’. This of course relies on having access to all these extra measurements, which is not always the case.
A note on stereo
One naturally wonders if having two cameras would save us. The obvious approach is to use a stereo system with parallel cameras. We would be able to estimate distances by using the relative position of the two cameras. Unfortunately, this won’t help in this particular case - the distance between the two cameras will be tiny compared to the distances between the cameras and objects in the world. For the same reason that the drone’s tiny relative motion doesn’t produce the parallax we need, the distance between the two cameras won’t produce a sufficient difference in the two images to break us out.
Another idea is to use a pair of cameras mounted orthogonally, which would introduce an additional constraint on the possible trajectories. If we added an additional side-facing camera, if this camera also initialized and became locked into a plane prior, it would observe motion parallel to the plane. This would be inconsistent with the motion we have described in the previous section, and hence force the system to reconstruct a trajectory more consistent with the real world.
Another approach is to use two systems that were relatively far apart from one another could communicate about what they’re seeing. Pairs of drones flying more or less independently could produce a ‘virtual’ stereo baseline, if they could be sufficiently well localized relative to one another. There is research in this direction, notably from the autonomous systems lab at ETH Zurich.
I hope you’ve enjoyed this account of an interesting failure mode I discovered in a common class of SLAM algorithms. The explanation in this article is intended for the enthusiast — a more rigorous mathematical treatment is coming soon.
Autonomy is a challenging problem space, and a deep understanding of the foundations of the algorithms being used is critical for this kind of debugging, especially when safety is a priority.
I’d like to acknowledge the help of my Iris colleagues, Alexander Harmsen and James Howard for helping me get some of the details straight and providing me with some simulated data for public consumption. (They’re hiring!) across a range of positions, especially in computer vision. It’s a great team with a great product!
If you’re interested in trying this particular failure out for yourself, please contact me at anton [at] troynikov [dot] io and I’ll help you get started.