Potential misleading parts in the paper and code #161

PointsCoder · 2024-01-25T02:50:23Z

Hello,

Thanks for your work and contribution to the community.

I noticed that your paper may have information that misleads the researchers who have followed your work.

Involving ego-status from can_bus and historical trajectory of ego-vehicle in planning.

As shown in [1][2][3], ego status and historical trajectory have short-cut effects on planning. From the method diagram, this paper doesn't show they have included such information. However, after carefully checking the provided code, this paper indeed includes both ego status and ego historical trajectory in the implementation:

As noticed by [4], this paper includes can bus information into the bevformer encoder. As shown in Table 1 of [4], removing the can bus information will result in a significant performance drop in planning.
This paper has used bevformer, which stacks multi-frame BEV features by removing the historical ego-motion. Historical ego-motion is mainly used for generating ego historical trajectory. Hence here is an implicit effect on planning.
In the motion head, in addition to the prediction of detected objects, this paper also predicts the ego-vehicle's future trajectory based on the ego-vehicle's historical trajectory (tracks). This process will involve the ego's historical trajectory during training, and the trained ego-embedding will be used for the subsequent motion planning, which causes a leakage of ego-historical information to motion planning. See code here for stacking sdc_track as input, code here for predicting ego future based on ego track, and code here for using sdc query in planning head.

Incorporating ego status and historical trajectory explicitly and implicitly will have huge effects on the motion planning performance. Removing 1 can cause a significant performance drop, and I believe 2 and 3 also play a significant role in boosting the planning performance. I am curious what will happen if removing all of them. I would suggest the authors should notify readers where they have included ego information into the system in this paper, and how it would affect the final performance.

Incorrect evaluation metrics and unfair comparison with previous approaches.

This paper compared with ST-P3 in Table 7. However, after carefully checking the code of both ST-P3 and this repo, I found the evaluation metrics of ST-P3 and this repo are implemented differently, and this will make the comparison not fair (I am actually very curious why you chose different implementations since both works are from your group).

Specifically, I summarized the differences below:

Validation samples are different: this paper evaluates 6019 frames with masking, while ST-P3 drops the first and last several frames in a log (around 4800 test frames). This will have marginal effects on the performance.
GT-Occupancy: this paper only cares about vehicles in the collision calculation, while ST-P3 takes both vehicles and pedestrians in their computation, and the vehicle numbers are also different in the two implementations. This will cause non-negligible impact on collision.
Average over agverage. this paper computes L2 and collision at each time-step, while ST-P3 considers the average of all previous time-steps. This will cause a significant difference in all results.

Given the differences, the comparison in Table 7 looks very tricky and unfair.

Unclear experimental results.

Tables 2, 7, and 10 have different reported planning scores for the same model, but in which condition you are testing your model is not mentioned. The differences in experimental settings should be highlighted in the paper.

Overall, your group may be the first to try open-loop planning on nuScenes. It's possible that these unintentional mistakes and misleading parts will cause a lot of trouble for researchers to faithfully follow your work. I believe the use of ego and historical information in your system and the different implementations in metrics should be highlighted in the paper to avoid misleading other researchers.

Thank you again for your work and efforts. Looking forward to your response.

[1] Codevilla et al. Exploring the Limitations of Behavior Cloning for Autonomous Driving.
[2] Dauner et al. Parting with Misconceptions about Learning-based Vehicle Motion Planning.
[3] Zhai et al. Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes.
[4] Li et al. Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

ilnehc · 2024-01-26T09:47:11Z

Hello @PointsCoder ,

Thank you for raising this issue.

In general, I personally believe that research is a mixup of successes and mistakes in the exploration of novel things, particularly for new areas lacking handy or mature tools and platforms. While we acknowledge the possibility of flaws in our projects, I assure you that we never intentionally make mistakes, and we prioritize academic integrity. That’s why we choose open-source almost all our projects. We greatly appreciate the efforts of researchers in identifying any shortcomings and working towards building more robust and comprehensive platforms for future use.

For your questions,

Ego-status and historical trajectory:

This topic has been widely discussed in the papers you listed, and we have also engaged in a discussion on this matter Here. In real-world driving scenarios, incorporating ego status and historical information is beneficial.
As you said, historical trajectory in motion head is used during training but is not provided during testing. This is different than other works that explicitly incorporate ego status.

Evaluation metrics:

Some setups need re-implementation when using the two different codebases (FIERY and mmDet3D), which caused some inconsistencies. Most concerns are also raised in Li’s BEV-Planner paper. We also found and reported the average-over-average issue Here. We think the final-average setting in this project makes more sense. You have also updated related results in your paper. We recognize the importance of consistent implementation for fair comparisons, and I think the findings in BEV-Planner do not undermine this paper's main contribution and performance superiority.

Experimental results:

We mention in the supplementary material (above Eq. 14) that ablations were trained for shorter epochs for efficiency. Besides, due to this project's huge amount of work, we had to utilize different computational resources, such as V100/A100, which might also lead to some inconsistencies in the final results.

Misc: We are not the first group to try open-loop planning on nuScenes. You can find pioneering results on FF [a], EE [b], etc.
[a] Safe Local Motion Planning with Self-Supervised Freespace Forecasting. CVPR, 2021.
[b] Differentiable Raycasting for Self-supervised Occupancy Forecasting. ECCV, 2022.

—
Lastly, my personal opinion is that we evaluate the value of research works from comprehensive perspectives. It is disheartening to witness the tendency in the CV/ML community nowadays to seek a more easy way to judge works, solely on the number of final results. We acknowledge that the current evaluation setup has some internal drawbacks, and we sincerely hope the research community could devise some practical ways to alleviate the issue, which would be immensely valuable.

PointsCoder · 2024-01-26T18:57:58Z

Hi,

Thanks for your response.

I understand that researchers may make mistakes sometimes, and I believe your motivation behind those unintentional mistakes. I raised this issue not to criticize your work (your work is great in my opinion). My request is that you need to clarify the above-mentioned misleading parts by updating your paper.

After reading the response, I believe that we all agree with the fact that the following misleading parts exist in the paper:

Including ego states and historical info in your system, i.e., in the BEV backbone, in the temporal frame-stacking, in the motion-planning head (this may need further discussion, see below).
Incorrect evaluation and comparison with ST-P3 in Table 7.
Misalignments of planning scores in Tables 2, 7, 10. I noticed in the supp. that ablations were trained for 8 epochs. However, Table 2 and Table 10 are both ablations but the results are still not consistent.

No matter the intentions behind (I generally believe your good intentions), I believe the above-mentioned misleading parts should be fixed by updating your paper (at least the arXiv version), for the following reasons:

Your work has long-term impacts (it is a cvpr best paper). Many researchers have been following your work and used your Table 7 for comparison, and more researchers in the following many years will read and follow your work. If Table 7 is not correct, then the following works are all incorrect.
The paper has more visibility than the github issues. You cannot expect every researcher/reviewer to read the github issues to understand this. The clarifications should be in the paper so that all researchers in this community are on the same page.
Current paper has misleading effects, especially in whether or not ego-information was used in this paper. I have talked to many researchers and most of them didn't think UniAD has utilized ego-information. However, this is not correct. Assuming a researcher tries to follow your work, should he/she also use the ego-information? If he/she used ego, reviewers judge this as unfair based on the misunderstanding, and if he/she didn't use this, he/she would not be able to fairly compare with UniAD, especially when ego-information has a huge performance boost in planning.

Here are my suggestions for the paper revision:

Using ego-states and historical information should be clearly stated in the method part of the main paper. The effects of using ego-information are expected to be ablated in Table 10, by adding 3 rows for removing can bus in BEV, removing frame stacking, and removing ego-forecasting in the motion head.
ST-P3's performances in Table 7 should be rectified under the same metric, with an additional statement in the supp. to show why the prior metrics are incorrect.
Tables 2, 7, and 10 should be highlighted for which condition they were testing their method. At least the planning scores in Tables 2 and 10 should be the same as they are all ablations.

Again, thanks for your contribution to the community. I am not judging or criticizing your work, but I believe these misleading parts should be clarified in the paper.

PointsCoder · 2024-01-26T19:06:07Z

For the use of historical ego-information in motion and planning head, from my understanding, you have still used sdc_embedding in your track query here, and this will still cause leakage of sdc_tracks (ego-history) to planning. Also, ego-forecasting in motion head is unnatural as you planned twice in your system. Let me know if my understanding is correct.

ilnehc · 2024-01-27T03:26:18Z

The tracking status is estimated during testing. Using predicted history for planning is not a leakage IMO. Ego-forecasting is a common practice when employing motion forecasting methods for planning, as can be seen in works in motion fields and nuPlan challenges.

As I said, I appreciate the following works that have initiated the discussions and presented additional results. For your requests and suggestions, I will discuss with the team.

I would accept to update evaluation setup clarification. We will try to align Table 2&10 when we are available in the future.
However, I do not think it is appropriate to update further information in the original paper. Discussions on ego status are contributions and findings of the papers afterward. I would not say the original paper is misleading.

PointsCoder · 2024-01-27T05:13:09Z

Thanks for your reply.

Ego-historical information in motion prediction then planning:
It is essentially ego-forecasting in motion prediction, and then you feed both ego-forecasting and ego-embedding for planning. Using online tracking results or the localization data has very little difference, as (1) they all provide a historical trajectory of the ego-vehicle (2) online trackers still need localization data to estimate ego-motions, see the use of l2g_r, l2g_t (lidar to global transform) in uniad_track.py.
Whether or not the paper has misleading parts:
Researchers have different interpretations of an academic paper. I respect your opinion. At least from my point of view, the above-mentioned problems are quite misleading.
Whether or not the paper needs to be updated:
It is totally up to you, but I believe an update would greatly help researchers understand the problem. It is not an afterward discussion of the original paper. On the contrary, the problem (e.g. incorrect comparison) has existed and will always exist in your paper.

Overall, I appreciate your feedback, and the discussion makes things clearer. Good luck to your team and hope to see more influential works from your group : )

YTEP-ZHI closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential misleading parts in the paper and code #161

Potential misleading parts in the paper and code #161

PointsCoder commented Jan 25, 2024

ilnehc commented Jan 26, 2024 •

edited

Loading

PointsCoder commented Jan 26, 2024 •

edited

Loading

PointsCoder commented Jan 26, 2024 •

edited

Loading

ilnehc commented Jan 27, 2024

PointsCoder commented Jan 27, 2024

Potential misleading parts in the paper and code #161

Potential misleading parts in the paper and code #161

Comments

PointsCoder commented Jan 25, 2024

ilnehc commented Jan 26, 2024 • edited Loading

PointsCoder commented Jan 26, 2024 • edited Loading

PointsCoder commented Jan 26, 2024 • edited Loading

ilnehc commented Jan 27, 2024

PointsCoder commented Jan 27, 2024

ilnehc commented Jan 26, 2024 •

edited

Loading

PointsCoder commented Jan 26, 2024 •

edited

Loading

PointsCoder commented Jan 26, 2024 •

edited

Loading