Back in December 2015, I had the idea of using direct photometric odometry, which is used to estimate the motion of a camera through a static world, to track moving objects as well. There are two important core ideas:
- Motion of the camera relative to an object looks identical to the motion of an object relative to the camera in an image sequence (transform equivalence).
- Moving objects create differences between consecutive image frames that cannot be explained by the assumption of a static world (photometric consistency violations).
Doing tracking this way has several advantages over standard keypoint, template, optical flow, and other methods. By reducing the problem of rigid tracking to direct photometric alignment, our algorithm can operate on the same input data as direct photometric SLAMs. This means we get:
- 3D tracking from just a monocular camera (in principle)
- A unified, mathematically transparent optimization problem and all the tools that come with that.
- A unified representation of the world we can use in whatever other applications we want (SLAM, loop closure).
- The ability to apply advances from the direct SLAMs to object tracking.
(and lots of others)
In 2018 I started to work on this more seriously, and in March 2019 with the help of Nikolaus Demmel and Jörg Stückler I completed the proof-of-concept and have some interesting results to share. The short version is a conditional ‘it works’. The p.o.c is pretty primitive and not at all performant, but I’ve shown the fundamental ideas to be sound. This write-up doesn’t go into the full depth of the work but should give the interested reader a solid overview.
Direct photometric alignment primer
Direct photometric approaches to odometry use image sequence to estimate rigid body transforms corresponding to the motion of a camera through a static world. Unlike keypoint-based approaches, the direct photometric approaches use all intensity information present in each image, by finding a transform that minimizes the so-called photometric residual.
The underlying idea of all the direct photometric methods is that a point in the world should have the same radiosity regardless of where it is relative to the observing camera, and so it should look the same in every image where it appears. This is expressed in the photometric consistency assumption;
Here is a point in image , is the reprojection function (sometimes called the warping function) which transforms a point in image to a point in image given the rigid body transform . is the transform from the coordinate frame of the camera observing image 1 to that observing image 2, i.e. the pose of camera 2 in the coordinate frame of camera 1.
The photometric residual is then just the difference between what we saw in image 2 and what we expected to see given the we supplied to the reprojection function;
Finding an estimate for the most likely transform given observations on each pixel’s photometric residual then reduces to a fairly standard nonlinear least-squares problem, for which Gauss-Newton works pretty well;
It’s kind of nuts that you can take the derivative of an image with respect to a rigid body transform and come out with something sensible, but in 2019 just about anything is possible. A more in-depth starting point for direct photometric approaches is this paper by Kerl.
A camera moving relative to a static object will observe the same thing as a static camera observing a moving object.
This means that any method we care to use to estimate rigid body transforms corresponding to the motion of a camera through a static world, we may also use to estimate the motion of dynamic objects. This is pretty intuitive but things get a little subtle in our choice of coordinate frames. Reading the next part isn’t really necessary unless you want to try this for yourself.
Consider the case where the camera is moving relative to the world, and independently an object is moving relative to the camera and the world. Let be the coordinate frame attached to the static world, and the coordinate frame attached to the moving object.
Assuming we can separate points belonging to the world from points belonging to the object, we may estimate the motion of the camera through the world from pose to pose as seen in the world frame as . Similarly, we can estimate the motion of the camera relative to the object (i.e. in the object frame) as .
Assume we have an estimate for the transform between the coordinate frame attached to the camera and that attached to the object, for the first image: . Note that we may attach a coordinate frame to the object anywhere we like, including at a point outside the object itself. A natural choice for an arbitraty object may be the geometric centroid.
Assume also that we know the initial position of the camera in the world: — we can choose this arbitrarily to e.g. put the camera at the world origin.
Then to get the transform corresponding to the position of the object in the world frame at the time of the second image,
(The world frame is the same for all times, i.e. )
(By chaining coordinate transforms: )
( is the same in any coordinate frame, since we observe the relative pose at the same time ).
Finally we have:
(Because is fixed to the object, . Also taking the inverse of a transform gives: .)
Now that we have done some torturous manipulations to get to an intuitive result, we can do what we already knew we could, but using math.
Photometric consistency violation for motion detection
In the case of a static world and a moving camera, corresponding to the change in pose of the camera is enough to explain what we see between image frames. We reproject the static world from one image to the other, and everything lines up so the photometric residual is everywhere zero. Analogously, if there’s one moving object and a static camera, a single transform is sufficient.
In the case of two or more independently moving objects, one transform is no longer enough, and we get the situation illustrated below:
Here two objects, A and B, are moving relative to a static camera. On the left we see the initial position of the objects in space, as well as where they’re projected into the image. In the center is what we observe in the second image. On the right is what we observe if we apply to points on both objects in the first frame. Wavy lines correspond to a large photometric residual.
is the relative pose of the object A, is the relative pose of object B. If we apply the reprojection function to points on both A and B, we get large photometric residuals at the image points marked by wavy lines; one set where B actually is but we didn’t expect it to be, and one set where we expected B to be but it isn’t.
Direct photometric approaches usually discard points with large photometric residuals and only use points from the static world. Instead we can use the large residuals created by moving objects as a cue that such an object might be present. Under the assumption that moving objects correspond to contiguous regions of pixels in the first image, we can use a Markov random field / graph-cut to segment it as in the classic approach of Greig et. al..
It would be nice if we could use this approach directly to find moving objects but there are some problems.
Large photometric residuals might be caused by things other than objects; reflections, lighting changes, sensor noise etc.
We can deal with differentiating between motion and other sources of large photometric residuals by just trying to estimate a motion for each high-residual region. If we can’t get an estimate, we can’t track anyway, and it’s likely this region conforms to some other source of high residual.
A trickier problem is that this method relies on object and background having enough texture. Unfortunately, this isn’t the case for many objects of interest.
Here a white van is moving from left (a) to right (b), as observed from a static camera. The resulting photometric residuals are shown in (c); brighter pixels have a larger magnitude.
The important things to note are;
- Parts belonging to the moving object don’t have a large residual, because of the lack of texture (e.g. the center/front of the van).
- Parts not belonging to the moving object do have a large residual, because the object occludes them (e.g. the vegetation to the left of the van).
If we accumulate enough image frames from the sequence it might be possible to disambiguate between regions with large residuals that are due to occlusion and those belonging to the moving object. It may also be possible to accumulate all the pixels belonging to the moving object even if they don’t have much texture. We didn’t attempt that here but it would be interesting for future work. Additionally, this doesn’t admit real-time operation since we need to accumulate several, potentially many, frames before we’re able to accumulate sufficient information.
Instead, we took the easy way out and used a standard semantic instance segmentation approach, using pre-trained Mask R-CNN. The segmentation masks that Mask R-CNN produces aren’t perfect, but we can refine them. We first dilate them slightly (2-3px), then after tracking using direct alignment, we exclude pixels with high photometric residuals from the segmentation. This doesn’t improve things much frame-to-frame but can provide a more consistent segmentation over time. We didn’t explore this idea in this work but it would be would be interesting to look into in the future.
The core algorithm is as follows:
- Initialize a Keyframe consisting of an image frame, registered depth values, and an instance segmentation mask.
- For a new input image frame, estimate a rigid body transform using all pixels in the keyframe.
- Compute regions with large photometric residuals for the estimated transform.
- Determine whether each such region corresponds to an instance segmentation. We used a simple intersection-over-union threshold.
- Regions with large residuals that have a corresponding segmentation mask are treated as independently moving objects
- Regions without large residuals are treated as belonging to the static world
- Regions with large residuals but no corresponding segmentation mask are discarded.
- Estimate the motion of each object instance using all the corresponding pixels in the keyframe.
Note that unlike most approaches that use instance segmentation in the motion estimation pipeline, we don’t try to estimate an independent motion for every instance, only for those we have determined to be moving. This has several advantages: the required computation is reduced, and static objects are not assigned small (eroneuous) motions due to noisy data and estimates from a small number of points.
We further refine the segmentation and transform estimates in an alternating optimization (so-called ‘hard EM’, analogous to coordinate ascent). After computing outlying regions or pixels, we re-estimate the transform with only low residual pixels, initializing with the previously estimated transform. We repeat this process until convergence. This is not especially rigorous but does lead to fast convergence. A better approach for this refinement might be something like the multi-label joint segmentation-motion approach presented in Efficient Dense Rigid-Body Motion Segmentation and Estimation in RGB-D Video, but this would generally be slower. Additionally, the segmentation is not guaranteed to find a global minimum, though we suspect that it would converge reasonably, except in pathological cases.
As presented, we track the camera motion and the moving objects against a single keyframe. It’s pretty straightforward to extend this approach to multiple keyframes; all that is required is to carefully handle the world coordinate frame, and to perform association between object instances in subsequent keyframes. We perform association through Intersection-Over-Union of the reprojected objects with instances in the new keyframe, but more sophisticated approaches are possible.
Though we don’t yet pursue it here, a natural extension is to perform graph based bundle adjustment / joint optimization with multiple observations of the same object from multiple keyframes.
We performed experiments with the Oxford Multimotion Dataset (OMMD).
An example frame from OMMD. The boxes move independently with translational, rotational, and compound motions.
This dataset provides ground truth motions for several independently moving objects, observed from both static and moving cameras, in a fairly representative indoor scene. It comes with several sequences with both calibrated stereo and RGB-D data, as well as IMU data and Vicon tracking as ground truth.
The raw RGB-D data is unusable without considerable filtering over the depth images, and was at a lower resolution and frame rate than the stereo data. To get a usable depth we used SPS-Stereo, which gave good disparity estimates since most surfaces in the scene are planar. Converting disparity into depth is straightforward using the camera intrinsics provided with the dataset.
Left: Raw depth data from the Intel Realsense stream included in the dataset. Right: Disparity estimate from SPS Stereo. Still noisy but much cleaner, and usable for our experiments.
Additionally, the colorful boxes that are the moving objects of interest in this dataset don’t belong to any classes that Mask R-CNN recognizes, so we need some other way to make the required instance segmentations. Fortunately, the faces of the boxes are high saturation and high contrast, and they’re far enough apart that simple color thresholding followed by connected components gets us a reasonable segmentation without much work. An MRF binary segmentation pass helps us smooth out the noise.
We have color, depth, and an instance segmentation, which is everything we need to construct a keyframe.
We evaluate our approach against the ground truth provided by OMMD. We apply the appropriate coordinate transforms to get everything into the same frame. Here are some results of our approach.
Tracks for objects with translational motions only. Box 1 (left) moves back and forth relative to the camera, Box 3 (right) moves side to side. Blue is ground truth, green is the naive motion estimate, orange is the joint segmentation-motion estimate.
We’re able to get reasonable tracks for objects with translational motions from a static camera. Unfortunately, tracking rotating objects with this approach doesn’t work well:
Tracks for objects with rotational motion. Box 2 (left) rotates on the spot while Box 4 (right) both rotates and translates. The ground truth track for Box 4 is not visible because the tracked trajectory completely diverges very quickly.
Why is this so? This result is somewhat unexpected since rotational motions are not generally more difficult to track than translational motions where direct photometric alignment is applied to camera tracking. A clue is that direct photometric algorithms tend to be more susceptible to tracking loss for large motions between frames, especially translational motions.
Induced Virtual Translation
The (as yet not fully tested) theory I have about what’s going on is that tracking a rotating object introduces what can be thought of as a large ‘virtual’ translation. Consider the following diagram:
An object (green) is rotating clockwise in space. The relative motion of the camera (black) has a rotational and a translational component, in the opposite direction to the rotation of the object. The translational component increases as the radius of the arc increases. The rotational components are equal and opposite for camera and object.
This ‘virtual’ translation will be large in the camera origin frame even though the relative rotation might be small, the distance between the camera center and the center of rotation of the object determines the radius of the induced arc, and hence the arc length itself is quite large.
Because the target image frames are discrete in time, the virtual motion between frames has as translational component the chord of the start and end points of the virtual arc. This is given as where is the radius, and is the rotation angle. Taking the small angle approximation we see that even for small angles, the chord length, and hence the virtual translational component, is linear in the distance from camera to object.
I haven’t examined this theory rigorously yet, but intuitively it seems very plausible. Testing it would require computing cost surfaces over a synthetic dataset, which seems very error prone and tedious, so I haven’t set about doing it yet.
In the direct photometric alignment context, large motions between image frames have been compensated for by using a coarse-to-fine approach, where alignment is first performed on a recursively subsampled (lower resolution) versions of the keyframe and target image frame, usually with three or four levels of subsampling forming a so-called ‘pyramid’. Doing so broadens the basin of attraction for our optimization algorithm, allowing it to arrive at the desired minimum even if we initially start further from it. This approach is unfortunately not available to us in the object tracking application, because our objects of interest represent only a small fraction of the total pixels in a given keyframe. We therefore can only use one or two pyramid levels, which are unfortunately not enough to compensate for the large induced translation.
Another approach is to instead reformulate the reprojection function so as to perform alignment in the object rather than the camera’s origin frame. This would work fine if the camera was stationary in the world, but doesn’t help us in the case of the moving camera, since the camera cannot disambiguate its own motion relative to the object from the object’s motion in the object’s origin frame. I’ll save you another round of tortuous coordinate manipulation, but demonstrating this is straightforward.
One final idea is to use a feature-based approach as an initialization for the direct alignment approach. This paper from Judd et. al. demonstrates such a feature based approach for multi-object and camera motion tracking. This should get us closer to the desired minimum in the basin of attraction for our optimization algorithm, but on the other hand initializing from features negates many of the advantages of direct photometric alignment, is kind of ugly, and we might as well return to the caves. That said, it’s likely that this will be the approach I take for the next steps until I can think of something better.
Some additional notes
Joint segmentation-motion estimation did not significantly improve our tracking results, but significantly increases the runtime for each instance since we need to perform the (expensive) direct alignment step multiple times. I suspect this is because we got reasonably good segmentations right away, and hence jointly optimizing over the segmentation and motion did not produce a significant improvement over just estimating the motion. We investigated the segmentation performance in depth but the results were not very interesting, so I omit them here. It would be interesting to find the point where this tradeoff starts to matter; one could imagine a very simple blob detector being good enough for initialization here, then further refined via the photometric cost term. This is something I also haven’t done yet due to the lack of a suitable dataset and the tedious experimentation required.
The consistent misalignment between ground truth and estimated frames is probably due to a fixed misalignment between the initial coordinate frames; we could evaluate our results by first doing e.g. Horn alignment between the estimated and ground truth trajectories as commonly used, and suggested in Sturm et. al. (2012), but we felt showing the raw results is more representative of the performance of the system.
You have probably noticed that we don’t give any timing benchmarks; that’s because we did not attempt to produce a performant implementation, and a lot of the visualization cruft slows the runtime considerably. That said, we’re currently working on a GPU implementation and framework for the underlying algorithms that should speed things up considerably. Because each pixel in the keyframe is treated independently in the cost (it’s just a sum), the costs and Hessians etc. can be computed in parallel, making this approach particularly amenable to GPU implementations.
We also performed experiments on the Virtual KITTI (VKITTI) dataset which is a rendered version of the commonly used KITTI dataset, providing ground truth segmentation and depth, as well as varying weather and lighting conditions, and varying camera angles. The moving objects are various kinds of vehicles, moving on the road plane. We got good results but they’re less interesting than those discussed above, so I have omitted them in this write-up.
So where to from here? We’ve demonstrated a proof of concept system, but we’ve also run into a fundamental problem with our approach here, for which I haven’t yet come up with a good solution. That said, there is still plenty to do;
In the spirit of DSO by Engel et. al., the next step should be to move to a full joint optimization framework using only photometric costs. Specifically, jointly estimating depth should become part of the cost function we minimize. This will probably require using the horrible feature-based initialization we discussed earlier, but would be a significant step forward; if nothing else we would be able to perform much richer, dense reconstructions which are impossible with feature based approaches alone.
Work like this requires not just better datasets, but better tools for creating synthetic datasets for computer vision applications. Currently these are limited to game engine based tools like CARLA or else basically Blender scripts. I’d love a tool that gives me what I can do with Blender scripts with better ergonomics so that I can quickly generate datasets and variations on the fly.
Temporal consistency of the segmentations and photometric residuals is a direction we started down but did not fully exploit in this work. However, as mentioned previously, I don’t believe that we really need the instance segmentation, and that given a long enough sequence we should be able to find a consistent segmentation for entire moving objects if we perform joint segmentation-motion optimization over the whole sequence. This will probably come down to some discrete data association problem but would be interesting to tackle.
Put the whole thing on the GPU. The main reason this is hard is debugging tools for GPU code are pretty bad and it’s easy to make mistakes in the numerics.
Thanks for reading. If you’re interested in learning about what we did in more depth, have comments, or would like to know more about my research interests in geometric computer vision please feel free to contact me.