Visual SLAM from monocular cameras is accomplished through approaches from two broad classes of algorithms:
The feature-based methods such as ORB SLAM which rely on identifying corresponding points across a sequence of images.
The direct alignment approaches I’ll discuss here, which use intensity information from the entire image.
Feature-based approaches are typically more robust to common effects like lighting changes, motion blur, and rolling shutter artifacts due to the choice of (usually hand-tuned) features used to construct correspondences. The use of feature descriptors also allows for relatively straightforward loop-closure detection and relocalization using bag-of-words approaches. Feature-based approaches are also usually less computationally intensive than direct alignment. For these reasons and others, feature-based approaches are the most widely deployed, particularly to camera tracking as in Apple’s ARKit.
In contrast, under favorable conditions (fixed lighting/compensated exposure, global shutter, high frame-rates), direct methods outperform feature-based methods in tracking accuracy. Additionally, because information from the entire image is used, direct methods generate a far denser map, and tend to be more robust to low-texture environments where feature-based methods cannot extract sufficient feature points. A further in-depth technical comparison is available in Yang et. al.
The direct methods also have the advantage of mathematical consistency and transparency, because the model and subsequent optimization problem are constructed in the same domain as the data. This leads to interpretable results and straightforward parametrization. As a side-effect, we have an end-to-end differentiable residual (loss), which is of interest in machine-learning applications. Because our optimization problem is constructed as a functional of each pixel in the image sequence, it is inherently highly parallelizable.
Recent advances have started to overcome some of the shortcomings on the direct approaches in the presense of image formation artifacts, and advances in commodity hardware have reduced the impact of the required computation.
What follows is a publication genealogy of direct alignment approaches to visual SLAM. These papers can be considered a primer or curriculum with respect to getting familiar with the state of the art in direct photometric SLAM.
This paper details the fundamental algorithm of all direct photometric approaches to monocular SLAM. It’s mandatory reading. The Lucas-Kanade algorithm is originally used to track planar objects using affine or translational transforms, but it is readily extended to rigid body motion.
All direct methods include a cost based on the photometric error, which is the difference in intensity between a pixel in some reference image frame (often referred to as the keyframe) and a corresponding pixel in some target image frame. The correspondence is computed using a ‘warp’ function which maps coordinates in the reference to coordinates in the target, with a corresponding parametrization.
Lukas-Kanade (LK) tracking seeks to minimize the sum of squared photometric errors between reference image frame (sometimes called template) frame, and target image frame. This is accomplished through an iterative non-linear least-squares approach, and the Gauss-Newton algorithm, to find the parametrization of the warp function which minimizes the sum of squared photometric errors.
You should really try to understand this paper before moving on. The translational and affine warps presented are the simplest parametrizations of the warp function and make the rest relatively straightforward to understand.
DTAM is to my knowledge the first example of a direct photometric approach to the complete monocular SLAM problem. The paper focuses on the then novel approach to constructing a dense 3D map and subsequently performing tracking against it via direct alignment.
Given a reference frame, a ‘cost volume’ is constructed by computing an average photometric error over many target frames as a function of depth in the reference frame, for each reference pixel. The optimal depth is then computed for each pixel based on minimizing this average error.
In regions with little texture, the cost for a large subset of possible depths for a given pixel will be equal, and hence it is difficult to find a minimum. Since DTAM attempts to construct a dense 3D map (i.e. one in which each reference pixel is assigned some optimal depth), the authors add a geometric prior that results in smooth depths (and hence smooth 3D surfaces) except at edges. This is effective, but adds a coupling between neighboring pixels which we will see creates problems later on.
Optimizing the cost volume functional is not trivial, and the authors go into depth on their specific approach using a dual form. The optimization reduces to sampling within an increasingly constrained region based on a quadratic approximation
The model is initialized from a feature-based method in order to boot-strap the first set of frames to construct the cost volume. Tracking is performed against the dense model using direct alignment. Here the warp corresponds to a reprojection from the reference to the target frame, given the relative pose of the camera and the dense depth model. The warp is thus parametrized by the relative camera pose of the target frame, using elements of the Lie group SE(3), with updates in the corresponding Lie algebra.
Important to note is that tracking and mapping are performed as separate steps, as in the venerable feature-based PTAM paper. This is not very neat, since tracking and mapping are in reality coupled. Generally this coupling is addressed in feature-based methods using full bundle adjustment, where the map and camera poses are jointly optimized. Full bundle adjustment is however typically infeasible for real-time operation, especially for dense direct methods which would add a very large number of factors into the optimization problem.
We shall later see different approaches to addressing this problem.