Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results using ST-P3 and VAD metrics #166

Open
wljungbergh opened this issue Feb 27, 2024 · 10 comments
Open

Results using ST-P3 and VAD metrics #166

wljungbergh opened this issue Feb 27, 2024 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@wljungbergh
Copy link
Contributor

Hi,

As mentioned here you are not using the same definition of the L2 metric as other works (e.g., VAD or ST-P3).

However, you present UniADs numbers using your new metric against ST-P3s numbers using the old metric. This caused us (and surely many others) to assume that UniAD uses the same definition as in ST-P3 (for example VAD makes this assumption when comparing to your method - see here).

This greatly underrepresents UniADs performance, unfortunately. To reduce confusion, it would greatly help if you present your numbers using both definitions.

We ran the numbers for you, and not so surprisingly, your scores improved a lot. Especially at the longer horizons. We include VAD's numbers to highlight that the definition makes a big difference when comparing methods against each other.

Here are your displacement values when using both metric definitions:

Method L2 (m) 1s L2 (m) 2s L2 (m) 3s
ST-P3 1.33 2.11 2.90
UniAD (your metric) 0.48 0.96 1.65
UniAD (old ST-P3 metric) 0.42 0.64 0.91
VAD-Tiny 0.46 0.76 1.12
VAD-Base 0.41 0.70 1.05

FYI, we simply changed the code from

for i in range(len(value)):
    row_value.append("%.4f" % float(value[i]))

to

for i in range(len(value)):
    row_value.append("%.4f" % float(value[:i+1].mean()))
@YTEP-ZHI
Copy link
Collaborator

Hi @wljungbergh, thank you so much for sharing this valuable information. Could you please make a pull request for the code change? I'd like to present your verification in the README of this project for clarification, especially the table about the performance displacement. Your efforts are greatly appreciated. Thanks again.

@wljungbergh
Copy link
Contributor Author

No worries, glad to help. I'd be happy to make a pull request, but I will take up this again sometime after March 7th for obvious reasons. ✍️ 📄

@wljungbergh
Copy link
Contributor Author

Alright! Now I have some time on my hands. Do you want to change the evaluation to match those of SPT3 and VAD? Or do you want it to be configurable?

Also, where would be the most appropriate place to present the results? The TRAIN_EVAL.md file or in the main README?

@YTEP-ZHI YTEP-ZHI self-assigned this Mar 18, 2024
@YTEP-ZHI YTEP-ZHI added the enhancement New feature or request label Mar 18, 2024
@YTEP-ZHI
Copy link
Collaborator

YTEP-ZHI commented Mar 18, 2024

Hi @wljungbergh ! I prefer it to be configurable. How about distinguishing the two evaluation protocols by the paper name? One can specify the evaluation strategy using evaluation_strategy='stp3' or 'uniad' in their config, and the default value is 'uniad'.

@YTEP-ZHI
Copy link
Collaborator

I will update the README and other instruction files accordingly. No need to worry about it.

@YTEP-ZHI
Copy link
Collaborator

YTEP-ZHI commented Mar 18, 2024

Also, I want to know whether the misalignment issue also happens on the collision rate metric. Is the stp3 metric cumulative while uniad is non-cumulative? Have you looked into that before?

@wljungbergh
Copy link
Contributor Author

Yes. They do the same for L2 and collision-rate (i.e., cumulative). As you can see here, they create one planning metric for each of the timesteps they are using and then take the mean over the values they have added (see here). You do this non-cumulative, just as with the L2 definition.

And yes, if we change that definition too, UniAD achives lower CR than other works.

@YTEP-ZHI
Copy link
Collaborator

Understood. After modifying the code, could you please make a complete table for both L2 and CR metrics under the stp3 and uniad strategies like the one you posted?
image

It would be of great help if the table included L2 and CR results for STP3, UniAD, and VAD under the two different evaluation strategies.

@wljungbergh
Copy link
Contributor Author

wljungbergh commented Mar 19, 2024

Added a PR #173. Cross-posting the table here:

Method L2 1s (m) L2 2s (m) L2 3s (m) CR 1s (%) CR 2s (%) CR 3s (%)
ST-P3 (legacy ST-P3 metric) 1.33 2.11 2.90 0.23 0.62 1.27
UniAD (your current metric) 0.51 0.98 1.65 0.10 0.15 0.61
UniAD (legacy ST-P3 metric) 0.42 0.64 0.91 0.07 0.10 0.22
VAD-Tiny (legacy ST-P3 metric) 0.46 0.76 1.12 0.21 0.35 0.58
VAD-Base (legacy ST-P3 metric) 0.41 0.70 1.05 0.07 0.17 0.41

@YTEP-ZHI
Copy link
Collaborator

Great! Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants