Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPTH VALUE OF THE EACH PIXEL #261

Open
AdnanErdogn opened this issue Dec 31, 2023 · 12 comments
Open

DEPTH VALUE OF THE EACH PIXEL #261

AdnanErdogn opened this issue Dec 31, 2023 · 12 comments

Comments

@AdnanErdogn
Copy link

How can i reach this information on midas ?

@heyoeyo
Copy link

heyoeyo commented Jan 4, 2024

The midas models output inverse depth maps (or images). So each pixel of the output corresponds to a value like: 1/depth

However, the mapping is also only relative, it doesn't tell you the exact (absolute) depth. Aside from noise/errors, the true depth value is shifted/scaled compared to the result you get from the midas output after inverting, so more like:

true depth = A + B * (1 / midas output)
(see post below)

Where A is some offset and B is some scaling factor, that generally aren't knowable using the midas models alone. You can try something like ZoeDepth to get actual depth values or otherwise try fitting the midas output to some other reference depth map, like in issue #171

@JoshMSmith44
Copy link

JoshMSmith44 commented Jan 7, 2024

According to #171 I believe the equation is:
(1.0 / true_depth) = A + (B * midas_output)
so then
true_depth = 1.0 / (A + (B * midas_output))

@heyoeyo
Copy link

heyoeyo commented Jan 7, 2024

so then true_depth = 1.0 / (A + (B * midas_output))

Good point! I was thinking these are the same mathematically, but there is a difference, and having the shifting done before inverting makes more sense.

@Eyshika
Copy link

Eyshika commented Feb 5, 2024

How are A and B calculated for a video ? @JoshMSmith44

@JoshMSmith44
Copy link

How are A and B calculated for a video ? @JoshMSmith44

I believe MiDas is a single-image method and therefore there is a different A and B for each frame in the video sequence.

@Eyshika
Copy link

Eyshika commented Feb 6, 2024

How are A and B calculated for a video ? @JoshMSmith44

I believe MiDas is a single-image method and therefore there is a different A and B for each frame in the video sequence.

but in MIDAS it calculates using, true depth and calculated depth comparison. What if we have completely new images and want to find metric depth ?

@JoshMSmith44
Copy link

JoshMSmith44 commented Feb 6, 2024

In order to get the true depth using the above method you need to know two true depth pixel values for each relative depth image you correct (realistically you want many more). This could come from a a sensor, sparse structure-from-motion point cloud, etc. if you don't have access to true depth and you need access to metric depth then you should look into Metric depth estimation methods like ZoeDepth, Depth-Anything,and ZeroDepth.

@puyiwen
Copy link

puyiwen commented Apr 17, 2024

midas 模型输出逆深度图(或图像)。因此输出的每个像素对应一个值,例如:1/depth

然而,映射也只是_相对的_,它并不能告诉你确切的(绝对)深度。除了噪声/误差之外,与反转后从 midas 输出获得的结果相比,真实深度值会发生偏移/缩放,因此更像:

真实深度 = A + B * (1 / midas 输出) (见下面的帖子)

哪里A是一些偏移量,哪里B是一些比例因子,通常仅使用 midas 模型是无法得知的。您可以尝试类似ZoeDepth 的方法来获取实际深度值,或者尝试将 midas 输出拟合到其他参考深度图,如问题#171中所示

hi, if I just use midas output, which you said the inverse depth, to train my model. I want to get the relative depth for an image. Am I do something wrong?

@puyiwen
Copy link

puyiwen commented May 7, 2024

The midas models output inverse depth maps (or images). So each pixel of the output corresponds to a value like: 1/depth

However, the mapping is also only relative, it doesn't tell you the exact (absolute) depth. Aside from noise/errors, the true depth value is shifted/scaled compared to the result you get from the midas output after inverting, so more like:

true depth = A + B * (1 / midas output) (see post below)

Where A is some offset and B is some scaling factor, that generally aren't knowable using the midas models alone. You can try something like ZoeDepth to get actual depth values or otherwise try fitting the midas output to some other reference depth map, like in issue #171

Hi, @heyoeyo, I want to know how the metric depth dataset(like DIML) and relative depth dataset(like RedWeb) to train together? Dose change metric depth dataset to relative depth dataset first? Can you help me? Thank you very much!!

@heyoeyo
Copy link

heyoeyo commented May 7, 2024

One of the MiDaS papers describes how the data is processed for training. The explanation starts on page 5, under the section: Training on Diverse Data

There they describe several approaches they considered, which are later compared on plots (see page 7) showing that the combination of the 'ssitrim + reg' loss functions worked the best. These loss functions are both described on page 6 (equations 7 & 11).

The explanation just above the 'ssitrim' loss is where they describe how different data sets are handled. The basic idea is that they first run their model on an input image to get a raw prediction, which is then normalized (using equation 6 in the paper). They repeat the same normalization procedure for the ground truth, and then calculate the error as:
abs(normalized_prediction - normalized_ground_truth_disparity)
Which is calculated for each 'pixel' in the prediction and summed together. For the 'ssitrim' loss specifically, they ignore the top 20% largest errors when calculating the sum.

So due to the normalization step, both relative & metric depth data sources should be able to be processed/trained using the same procedure.

@puyiwen
Copy link

puyiwen commented May 15, 2024

@heyoeyo , thank you for your reply. And I have another question about relative depth evaluation. Why the output of model( relative depth) should be converted to metric depth, and evaluate at metric depth dataset, like NYU, KITTY, using the rmse、abs_rel.eg.? Why not just use the relative depth dataset for evaluation?

@heyoeyo
Copy link

heyoeyo commented May 15, 2024

I think it depends on what the evaluation is trying to show. Converting to metric depth would have the effect of more heavily weighting errors on scenes that have wider depth ranges. For example a 10% error on an indoor scene with elements that are only 10m away would be a 1m error, whereas a 10% error on an outdoor scene with objects 100m away would have a 10m error, and that might be something the authors want to prioritize (i.e. model accuracy across very large depth ranges).

It does some strange to me that the MiDaS paper converted some results to metric depth for their experiments section though. Since it seems they just used a least squares fit to align the relative depth results with the metric ground truth (described on pg 7), it really feels like this just over-weights the performance of the model on outdoor scenes.

It makes a lot more sense to do the evaluation directly in absolute depth for something like ZoeDepth, where the model is directly predicting the metric values and therefore those 1m vs 10m errors are actually relevant to the model's capability.
(but I might be missing something, I haven't really worked with metric depth data myself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants