Incremental Structure from Motion: Core Implementation Guide with Ceres Optimization

Introduction: Building on Simulation Foundations

In the previous installment of this series, we completed a crucial preparatory step—constructing a "virtual world" that adheres to strict physical optics laws and photogrammetry specifications. By simulating a UAV aerial survey workflow, we generated a synthetic dataset containing real terrain (DEM), camera poses, and feature point observations (tracks). This dataset grants us a "god's eye view": before the algorithm even runs, we already know the ground truth of the reconstruction.

With reliable test data in hand, our next objective targets the core optimization workflow of Incremental Structure from Motion (Incremental SFM). SFM is fundamentally a massive nonlinear least squares problem, aiming to simultaneously recover the camera's three-dimensional motion trajectory (Motion) and the scene's geometric structure (Structure) from an unordered sequence of two-dimensional images.

We chose incremental SFM as our implementation target for this series because its logic is intuitive and aligns with human cognitive habits—"starting simple and progressing step by step." As we introduced in prerequisite articles, sub-problems like PnP (Perspective-n-Point), triangulation, and bundle adjustment form the core building blocks of incremental SFM. Therefore, before reading this article, it's highly recommended to review the prerequisite articles in this series covering these sub-problems.

On the other hand, we must clearly recognize that an industrial-grade SFM system needs to handle massive amounts of real-world data, riddled with feature matching errors (outliers), lighting variations, dynamic objects, and image sequences with extremely short baselines. To ensure beginners can grasp the essence of the problem without being overwhelmed by complex robustness processing logic early on, this implementation will temporarily strip away these interfering factors.

Based on the "perfect" synthetic data generated in Article 17 (meaning: no outlier mismatches, no missed matches, and ideal baseline conditions between images), we will focus on implementing the core skeleton logic of incremental SFM.

Preparation: Establishing Test Benchmarks and Constraints

Before diving into the core logic of incremental SFM, we need to establish the test benchmarks and constraints for this implementation.

Based on the synthetic data generated in the previous chapter, we could construct various test scenarios. To ensure readers can focus their attention on the core SFM workflow while avoiding distraction from complex robustness processing and geometric degeneration problems, this article adopts a Visual-Inertial fusion scenario relying on 2D-2D + INS/GNSS data as the primary test environment.

Why not use a pure vision scenario requiring only 2D-2D data for testing? There are two main reasons:

Initialization Difficulties

Pure vision workflows typically require solving epipolar geometry (essential matrix/fundamental matrix) first, which itself is a complex sub-problem involving RANSAC and nonlinear optimization. Furthermore, the computed poses have multiple possible values (ambiguity), requiring careful selection.

Scale Ambiguity

Pure vision recovers structure with only "metric scale." This scale drift and uncertainty introduces additional complexity into pose propagation during incremental reconstruction, increasing debugging and comprehension difficulty.

In contrast, introducing GNSS/IMU priors with noise (the "visual-inertial fusion" scenario) provides reliable initial guesses and absolute scale constraints for the optimization algorithm. This not only allows nonlinear optimization (Bundle Adjustment) to converge faster but also lets us skip the tedious "structure-from-motion" difficult startup phase and jump directly into the core implementation of "incremental expansion and optimization."

Three Key Sub-Problem Implementations

Before formally implementing the incremental SFM core workflow, let's prepare implementations for three critical sub-problems:

1. PnP Problem Implementation

The PnP (Perspective-n-Point) problem solves for camera pose given 3D world points and their 2D image observations. Our implementation uses Ceres Solver for nonlinear optimization:

#pragma once
#include <Eigen/Core>
#include <Eigen/Geometry>
#include <vector>

class PnPProblem {
 public:
  PnPProblem(const Eigen::Matrix3d& K,
             const std::vector<Eigen::Vector3d>& points_world,
             const std::vector<Eigen::Vector2d>& observations);

  bool Solve(Eigen::Matrix3d& R_opt, Eigen::Vector3d& t_opt);

 private:
  Eigen::Quaterniond CeresQuatToEigen(const double* q_ceres);
  void EigenQuatToCeres(const Eigen::Quaterniond& q_eigen, double* q_ceres);

  double fx, fy, cx, cy;
  const std::vector<Eigen::Vector3d>& points_world;
  const std::vector<Eigen::Vector2d>& observations;
};

The core of the PnP solver uses reprojection error minimization through Ceres:

struct ReprojectionError {
  ReprojectionError(const Eigen::Vector3d& X_, const Eigen::Vector2d& obs_,
                    double fx_, double fy_, double cx_, double cy_)
    : X(X_), obs(obs_), fx(fx_), fy(fy_), cx(cx_), cy(cy_) {}

  template <typename T>
  bool operator()(const T* const q, const T* const t, T* residuals) const {
    T p_camera[3];
    T X_world[3] = {T(X.x()), T(X.y()), T(X.z())};

    // Ceres expects quaternion order [w, x, y, z]
    ceres::QuaternionRotatePoint(q, X_world, p_camera);

    p_camera[0] += t[0];
    p_camera[1] += t[1];
    p_camera[2] += t[2];

    T xp = p_camera[0] / p_camera[2];
    T yp = p_camera[1] / p_camera[2];

    T u = T(fx) * xp + T(cx);
    T v = T(fy) * yp + T(cy);

    residuals[0] = u - T(obs.x());
    residuals[1] = v - T(obs.y());

    return true;
  }
};

The solver initializes with an estimated pose, builds the Ceres problem with quaternion manifold constraints, adds residual blocks for each observation, and solves using DENSE_QR linear solver. Success is determined by convergence type and final RMSE below a threshold (typically 6.0 pixels for synthetic data).

2. Triangulation Problem Implementation

Triangulation computes 3D point positions from multiple 2D observations and known camera poses:

#pragma once
#include <Eigen/Core>

class TriangulateProblem {
 public:
  TriangulateProblem(const Eigen::Matrix3d& K,
                     const std::vector<Eigen::Matrix3d>& Rs,
                     const std::vector<Eigen::Vector3d>& ts,
                     const std::vector<Eigen::Vector2d>& observations);

  bool Solve(Eigen::Vector3d& X_est);

 private:
  Eigen::Vector3d TriangulateDLT();

  const Eigen::Matrix3d& K;
  const std::vector<Eigen::Matrix3d>& Rs;
  const std::vector<Eigen::Vector3d>& ts;
  const std::vector<Eigen::Vector2d>& observations;
};

The triangulation process uses a two-stage approach:

Stage 1: DLT Initial Solution

The Direct Linear Transform (DLT) method provides an initial estimate by solving a homogeneous linear system. For N observations, we construct a 2N×4 matrix A where each observation contributes two rows:

A.row(2*i)   = u_i * P_i.row(2) - P_i.row(0)
A.row(2*i+1) = v_i * P_i.row(2) - P_i.row(1)

SVD decomposition yields the solution as the right singular vector corresponding to the smallest singular value.

Stage 2: Nonlinear Refinement

The DLT solution is refined through nonlinear optimization minimizing reprojection error across all views. The optimization uses Ceres with automatic differentiation, typically converging within 20 iterations.

Sign Correction:

Since homogeneous coordinates X and -X represent the same point, we must choose the sign that places the point in front of the most cameras (positive depth in camera coordinate system).

3. Bundle Adjustment Problem Implementation

Bundle Adjustment jointly optimizes all camera poses and 3D point positions:

#pragma once
#include <Eigen/Core>
#include <vector>

class BAProblem {
 public:
  struct CameraExtrinsics {
    double q[4];  // quaternion
    double t[3];  // translation
  };

  struct ObjectPoint {
    double xyz[3];
  };

  struct Observation {
    int cam_id;
    int pt_id;
    double x;
    double y;
  };

  BAProblem(const Eigen::Matrix3d& K, const std::vector<Observation>& obs,
            std::vector<CameraExtrinsics>& est_cams,
            std::vector<ObjectPoint>& est_pts);

  bool Solve();

 private:
  double ComputeRMSE();
  // ... member variables
};

The BA implementation uses sparse Schur complement solver for efficiency, with the first camera fixed to eliminate gauge freedom. Quaternion manifold constraints ensure valid rotation representations throughout optimization.

Core Implementation: The Incremental SFM Pipeline

Incremental SfM is fundamentally an iterative closed loop of pose registration → structure expansion → global optimization. Leveraging prior information from visual-inertial fusion data, the system starts from an initial seed pair, continuously finds common viewpoints between unregistered images and the current model (PnP), estimates new image poses, then triangulates to generate new 3D points, and uses Global Bundle Adjustment to eliminate accumulated errors. This workflow continues iterating until all visible images are reconstructed.

The core logic consists of four phases:

Phase 1: Initialization

Select the image pair with highest overlap to establish the baseline model:

void Init(const Eigen::Matrix3d& K, const sfm::Bundle& bundle,
          vector<cl::CameraExtrinsics>& cameraExtrinsics,
          vector<cl::ObjectPoint>& objectPoints) {
  // Find two images with highest matching point count
  map<pair<int, int>, int> imagePairCounter;
  for (const auto& track : tracks) {
    if (track.obs.size() < 2) continue;
    
    for (size_t i = 0; i < track.obs.size(); ++i) {
      for (size_t j = i + 1; j < track.obs.size(); ++j) {
        pair<int, int> imagePair{track.obs[i].imgId, track.obs[j].imgId};
        imagePairCounter[imagePair]++;
      }
    }
  }
  
  // Select pair with maximum matches
  int bestCount = 0;
  int image1 = -1, image2 = -1;
  for (const auto& [imagePair, count] : imagePairCounter) {
    if (count > bestCount) {
      bestCount = count;
      image1 = imagePair.first;
      image2 = imagePair.second;
    }
  }
  
  // Mark both cameras as registered
  cameraExtrinsics[image1].registered = true;
  cameraExtrinsics[image2].registered = true;
  
  // Triangulate initial points and run global BA
  TriangulateNewPoints(K, bundle, cameraExtrinsics, objectPoints);
  GlobalBundleAdjustment(K, bundle, cameraExtrinsics, objectPoints);
}

Phase 2: Pose Registration

Use PnP algorithm to "anchor" new images into the current model:

bool RegisterNextImage(const Eigen::Matrix3d& K, const sfm::Bundle& bundle,
                       vector<cl::CameraExtrinsics>& cameraExtrinsics,
                       vector<cl::ObjectPoint>& objectPoints) {
  // Build map from image ID to 3D-2D correspondences
  map<int, vector<pair<int, int>>> mapImgTo3dAnd2d;
  for (const auto& track : bundle.tracks) {
    for (const auto& ob : track.obs) {
      if (cameraExtrinsics[ob.imgId].registered) continue;
      if (objectPoints[track.pointId].triangulated) {
        mapImgTo3dAnd2d[ob.imgId].push_back({track.pointId, ob.kpId});
      }
    }
  }
  
  // Select image with most 3D-2D correspondences
  int bestImgId = -1;
  int maxPoints = 0;
  for (const auto& [imgId, ids] : mapImgTo3dAnd2d) {
    if (ids.size() > maxPoints) {
      bestImgId = imgId;
      maxPoints = ids.size();
    }
  }
  
  if (bestImgId == -1) return false;
  
  // Prepare PnP input
  const auto& idsMap = mapImgTo3dAnd2d[bestImgId];
  vector<Eigen::Vector3d> points;
  vector<Eigen::Vector2d> observations;
  for (const auto& [id3d, id2d] : idsMap) {
    points.emplace_back(bundle.points[id3d]);
    observations.emplace_back(bundle.views[bestImgId][id2d]);
  }
  
  // Solve PnP
  Eigen::Matrix3d R_est = cameraExtrinsics[bestImgId].R;
  Eigen::Vector3d t_est = cameraExtrinsics[bestImgId].t;
  PnPProblem problem(K, points, observations);
  
  if (problem.Solve(R_est, t_est)) {
    cameraExtrinsics[bestImgId].R = R_est;
    cameraExtrinsics[bestImgId].t = t_est;
    cameraExtrinsics[bestImgId].registered = true;
    return true;
  }
  
  return false;
}

Phase 3: Structure Expansion

Triangulate new 3D points using newly registered cameras:

void TriangulateNewPoints(const Eigen::Matrix3d& K, const sfm::Bundle& bundle,
                          vector<cl::CameraExtrinsics>& cameraExtrinsics,
                          vector<cl::ObjectPoint>& objectPoints) {
  for (auto track : bundle.tracks) {
    if (objectPoints[track.pointId].triangulated) continue;
    
    // Collect observations from registered cameras
    vector<sfm::Observation> registerObservation;
    for (const auto& ob : track.obs) {
      if (cameraExtrinsics[ob.imgId].registered) {
        registerObservation.push_back(ob);
      }
    }
    
    // Need at least 2 views for triangulation
    if (registerObservation.size() < 2) continue;
    
    // Prepare triangulation input
    std::vector<Eigen::Matrix3d> Rs;
    std::vector<Eigen::Vector3d> ts;
    std::vector<Eigen::Vector2d> observations;
    for (const auto& ob : registerObservation) {
      Rs.push_back(cameraExtrinsics[ob.imgId].R);
      ts.push_back(cameraExtrinsics[ob.imgId].t);
      observations.push_back(bundle.views[ob.imgId][ob.kpId]);
    }
    
    // Attempt triangulation
    Eigen::Vector3d X = Eigen::Vector3d::Zero();
    TriangulateProblem triangulateProblem(K, Rs, ts, observations);
    if (triangulateProblem.Solve(X)) {
      objectPoints[track.pointId].pos = X;
      objectPoints[track.pointId].triangulated = true;
    }
  }
}

Phase 4: Global Optimization

Run Bundle Adjustment to jointly optimize all parameters:

bool GlobalBundleAdjustment(const Eigen::Matrix3d& K, const sfm::Bundle& bundle,
                            vector<cl::CameraExtrinsics>& cameras,
                            vector<cl::ObjectPoint>& objectPoints) {
  // Build mapping from original IDs to BA problem IDs
  map<int, int> cameraIdMap;
  vector<BAProblem::CameraExtrinsics> est_cams;
  for (size_t ci = 0; ci < cameras.size(); ++ci) {
    if (cameras[ci].registered) {
      BAProblem::CameraExtrinsics cam;
      Eigen::Quaterniond quaternion(cameras[ci].R);
      cam.q[0] = quaternion.w();
      cam.q[1] = quaternion.x();
      cam.q[2] = quaternion.y();
      cam.q[3] = quaternion.z();
      cam.t[0] = cameras[ci].t.x();
      cam.t[1] = cameras[ci].t.y();
      cam.t[2] = cameras[ci].t.z();
      cameraIdMap[ci] = est_cams.size();
      est_cams.push_back(cam);
    }
  }
  
  // Build observation list for BA
  vector<BAProblem::ObjectPoint> est_pts;
  vector<BAProblem::Observation> obs;
  map<int, int> pointIdMap;
  
  for (const auto& track : bundle.tracks) {
    if (objectPoints[track.pointId].triangulated) {
      BAProblem::ObjectPoint pt;
      pt.xyz[0] = bundle.points[track.pointId].x();
      pt.xyz[1] = bundle.points[track.pointId].y();
      pt.xyz[2] = bundle.points[track.pointId].z();
      
      for (auto ob : track.obs) {
        auto iter = cameraIdMap.find(ob.imgId);
        if (iter != cameraIdMap.end()) {
          BAProblem::Observation observation;
          observation.cam_id = iter->second;
          observation.pt_id = est_pts.size();
          observation.x = bundle.views[ob.imgId][ob.kpId].x();
          observation.y = bundle.views[ob.imgId][ob.kpId].y();
          obs.push_back(observation);
        }
      }
      pointIdMap[track.pointId] = est_pts.size();
      est_pts.push_back(pt);
    }
  }
  
  // Solve BA problem
  BAProblem problem(K, obs, est_cams, est_pts);
  if (!problem.Solve()) return false;
  
  // Extract optimized results
  for (const auto& [oldId, id] : cameraIdMap) {
    Eigen::Quaterniond q(est_cams[id].q[0], est_cams[id].q[1],
                         est_cams[id].q[2], est_cams[id].q[3]);
    cameras[oldId].R = q.toRotationMatrix();
    cameras[oldId].t.x() = est_cams[id].t[0];
    cameras[oldId].t.y() = est_cams[id].t[1];
    cameras[oldId].t.z() = est_cams[id].t[2];
  }
  
  for (auto track : bundle.tracks) {
    if (objectPoints[track.pointId].triangulated) {
      auto iter = pointIdMap.find(track.pointId);
      if (iter != pointIdMap.end()) {
        objectPoints[track.pointId].pos.x() = est_pts[iter->second].xyz[0];
        objectPoints[track.pointId].pos.y() = est_pts[iter->second].xyz[1];
        objectPoints[track.pointId].pos.z() = est_pts[iter->second].xyz[2];
      }
    }
  }
  
  return true;
}

The Main Iteration Loop

The complete incremental SFM workflow ties everything together:

void Iterate(const Eigen::Matrix3d& K, const sfm::Bundle& bundle,
             vector<cl::CameraExtrinsics>& cameraExtrinsics,
             vector<cl::ObjectPoint>& objectPoints) {
  while (true) {
    cout << "--- New Iteration ---" << endl;
    
    // 1. Register next image
    bool success = RegisterNextImage(K, bundle, cameraExtrinsics, objectPoints);
    if (!success) {
      cout << "No more images to register, finishing." << endl;
      break;
    }
    
    // 2. Triangulate new points
    TriangulateNewPoints(K, bundle, cameraExtrinsics, objectPoints);
    
    // 3. Global Bundle Adjustment
    cout << "----GlobalBundleAdjustment----" << endl;
    if (!GlobalBundleAdjustment(K, bundle, cameraExtrinsics, objectPoints)) {
      break;
    }
  }
  
  cout << "=== SfM Complete ===" << endl;
  // Report statistics...
}

void IncrementalSFM(const Eigen::Matrix3d& K,
                    vector<cl::CameraExtrinsics>& cameraExtrinsics,
                    const sfm::Bundle& bundle,
                    vector<cl::ObjectPoint>& objectPoints) {
  Init(K, bundle, cameraExtrinsics, objectPoints);
  Iterate(K, bundle, cameraExtrinsics, objectPoints);
}

Accuracy Evaluation and Results

The implementation includes comprehensive reprojection error evaluation:

void EvaluateReprojectionError(
    const Eigen::Matrix3d& K, const sfm::Bundle& bundle,
    const std::vector<cl::CameraExtrinsics>& cameras,
    const std::vector<cl::ObjectPoint>& objectPoints,
    const std::string& reportFilePath) {
  
  double sum_squared_error = 0.0;
  double max_error = 0.0;
  int error_count = 0;
  
  // Collect per-camera statistics
  map<int, vector<double>> camera_errors;
  map<int, int> camera_obs_counts;
  
  for (const auto& track : bundle.tracks) {
    if (!objectPoints[track.pointId].triangulated) continue;
    
    Eigen::Vector3d P_world = objectPoints[track.pointId].pos;
    
    for (const auto& ob : track.obs) {
      int imgId = ob.imgId;
      int kpId = ob.kpId;
      
      if (!cameras[imgId].registered) continue;
      
      // Compute reprojection
      Eigen::Vector2d observed_point = bundle.views[imgId][kpId];
      Eigen::Matrix3d R = cameras[imgId].R;
      Eigen::Vector3d t = cameras[imgId].t;
      Eigen::Vector3d P_cam = R * P_world + t;
      
      if (P_cam.z() <= 0) continue;
      
      Eigen::Vector2d p_normalized(P_cam.x() / P_cam.z(),
                                   P_cam.y() / P_cam.z());
      Eigen::Vector2d projected_point =
          K.topLeftCorner<2, 2>() * p_normalized + K.topRightCorner<2, 1>();
      
      double error = (observed_point - projected_point).norm();
      sum_squared_error += error * error;
      max_error = std::max(max_error, error);
      error_count++;
      
      camera_errors[imgId].push_back(error);
      camera_obs_counts[imgId]++;
    }
  }
  
  // Compute RMSE
  double rmse = std::sqrt(sum_squared_error / error_count);
  
  // Generate report...
}

Performance Results

The final accuracy report demonstrates excellent reconstruction quality:

=== SfM Reconstruction Accuracy Report ===

=== Overall Reprojection Error Evaluation ===
Total valid observation points: 213,825
Root Mean Square Reprojection Error (RMSE): 1.308902 pixels
Maximum reprojection error: 4.736810 pixels

Evaluation Result: Good (RMSE < 2.0 px)

=== Per-Camera Reprojection Error Statistics ===
CameraID  Observations  Mean Error(px)  Max Error(px)
------    ----------    ------------    -----------
0         669           1.139404        3.513298
1         800           1.143394        3.342467
...

Analysis and Interpretation

From the generated accuracy report, we can observe that the overall root mean square error (RMSE) of reprojection is only 1.31 pixels. This is remarkably close to our theoretical noise floor, considering we introduced Gaussian noise with standard deviation of 1 pixel to the input data along with biased initial poses.

This result indicates that the algorithm, after experiencing the complete optimization workflow from initial pose estimation through global bundle adjustment, successfully minimized the impact of both initial errors and random noise. The per-camera statistics show consistent performance across all 108 cameras, with mean errors ranging from 1.07 to 1.20 pixels—well within acceptable bounds for high-quality reconstruction.

Key Insights and Lessons Learned

1. Visual-Inertial Fusion Simplifies Initialization

By leveraging GNSS/IMU priors, we bypassed the challenging pure-vision initialization phase. This allowed us to focus on the core incremental SFM logic without getting bogged down in epipolar geometry estimation and scale ambiguity resolution.

2. Incremental Approach Provides Natural Error Control

The iterative registration-triangulation-optimization cycle naturally controls error accumulation. Each global BA step corrects accumulated drift before proceeding to the next image, maintaining reconstruction consistency throughout.

3. Ceres Solver Enables Efficient Optimization

The use of Ceres Solver with appropriate manifold constraints (quaternion manifold for rotations) and sparse linear algebra (SPARSE_SCHUR for BA) enables efficient optimization even with hundreds of cameras and thousands of 3D points.

4. Synthetic Data Validates Core Algorithm

The "perfect" synthetic data—free from outliers and matching errors—allowed us to validate the core algorithm without the complexity of robust estimation. This provides a solid foundation for future work on handling real-world data challenges.

Future Directions

While this implementation successfully demonstrates the core incremental SFM workflow, several important extensions remain for production systems:

Robust Estimation: Incorporating RANSAC or other robust methods to handle outlier matches
Loop Closure: Detecting and correcting drift when revisiting previously mapped areas
Scalability: Implementing hierarchical or distributed approaches for large-scale reconstructions
Real-time Processing: Optimizing for streaming input and incremental updates
Multi-scale Features: Integrating feature detection and matching into the pipeline

Conclusion

This implementation provides a complete, working incremental SFM system that achieves sub-2-pixel reprojection error on synthetic data. The modular design—separating PnP, triangulation, and bundle adjustment into reusable components—facilitates understanding and future extension.

The key takeaway is that incremental SFM, while conceptually straightforward, requires careful attention to numerical details: proper initialization, manifold-constrained optimization, and systematic error evaluation. With these foundations in place, the path to handling real-world data becomes much clearer.

The complete source code for this implementation is available in the associated code repository, providing a solid starting point for further exploration and development in structure from motion and 3D reconstruction.