I have a set of RGB-D scans of the same scene, corresponding camera poses, intrinsics and extrinsics for both color and depth cameras. RGB’s resolution is 1290×960, depth’s is 640×480.

Depth is pretty accurate, but still noisy, the same with camera poses. However, both intrinsics and extrinsics are not very accurate: for example, intrinsics does not introduce any distortion terms, while images are definitely distorted. Extrinsics tell that both cameras take place exactly at the same point, since they are just identity matrices. It could be decent approximation though.

When I try to compute a point cloud from a frame by downsampling RGB, unprojecting depth, transforming it with RGB extrinsics and projecting with RGB’s intrinsics, it looks like RGB frame is shifted by a few cm from depth – for example, an object’s border could be colored as background.

One might assume that such artifacts could be eliminated by aligning depth to RGB with improved extrinsics. Is there a way to estimate more accurate transformation between RGB and depth in such scenario?

I’m familiar with pretty relevant question, but there you have required Rt matrix for both cameras.

In case of well-performing monocular depth estimation, one could align two point clouds (computed from depth and estimated from RGB) directly and determine required transformation.

Am I missing something and there is well-known approach for this problem?