Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan for obj, cls and 0 for P/R/mAP #7740

Closed
1 of 2 tasks
saumitrabg opened this issue May 9, 2022 · 9 comments
Closed
1 of 2 tasks

nan for obj, cls and 0 for P/R/mAP #7740

saumitrabg opened this issue May 9, 2022 · 9 comments
Labels
bug Something isn't working Stale

Comments

@saumitrabg
Copy link

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

No response

Bug

Our training/mAP score is really low with the latest code and hence, we are now stuck with release v6.0, where mAP trends like before. Our Yolov5m is in production for some time and we are familiar.

However, recently with v6.0 and Python3.7, Cuda 11.4 in Google Cloud (4x T4), at exactly 26th epochs, all our obj, cls become nan and mAP/P/R become 0 from say 0.6 mAP@0.5:0.95. We are in production and need to turn around a new tuned model but now we are stuck.

So, we move to Python3.9 with mini conda, Cuda 11.4 in Google Cloud (4xT4). Now, at 4th epoch, we see nan for obj/cls and 0 for mAP/P/R etc.

We are stuck because if we move to the latest Yolov5 code, the training is incredibly slow with very slow mAP score and now v6.0 is showing nans. Any clue? Btw, we did bunch of steps already (#7027) to reproduce the slow/low mAP with the latest code and without that fixed, we need to figure out why v6.0 is going nan.

image

Environment

-YOLOv5-6.0 (version 6.0)
-Google Cloud 4x T4
-Python3.7 and Python3.9
-Cuda 11.4 & 11.6

Minimal Reproducible Example

python -m torch.distributed.launch --master_port 1234 --nproc_per_node=4 train.py --rect --name 640_m_xxx_6_May_22 --img 640 --batch 384 --epochs 450 --data coco128.yaml --weights /home/engineering_borde_io/yolov5-v6.0/runs/train/640_m_xxx_1_May_222_best.pt --device 0,1,2,3

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@saumitrabg saumitrabg added the bug Something isn't working label May 9, 2022
@glenn-jocher
Copy link
Member

@saumitrabg your training is unstable as your losses are obviously increasing.

@saumitrabg
Copy link
Author

saumitrabg commented May 10, 2022

@glenn-jocher they settle down and we get to mAP in high 60s always. However, it is painfully low for the latest code. Anyhow, we are seeing the issue with the latest code as well. In 3rd epoch (epoch 2), it goes to nan. We also see NMS timeout and based on a old thread, we changed the timeout value. Any clue how to make progress?
image

@glenn-jocher
Copy link
Member

@saumitrabg this is pretty standard training instability. You might want to sit down with your ML team and have an engineer review the hyps and warmup etc.

@saumitrabg saumitrabg changed the title nan for obj, cls and 0 for P/R/mAP with v6.0 release nan for obj, cls and 0 for P/R/mAP May 11, 2022
@saumitrabg
Copy link
Author

saumitrabg commented May 11, 2022

@glenn-jocher thank you for the response. We are not looking at the slow training with the latest code (btw, we used the exact same hyperparameters from previous yolo versions (where training was faster) in the VOC yaml file as we are using the old model as a baseline and that didn't speed).

Our issue is a logical one where we are hitting nans in all different combos:

  1. yolov5 6.0/4x T4/Cuda 11.4 or 11.6: nan
  2. yolov5 latest/4x T4/Cuda 11.4 or 11.6: nan
  3. Tried above combos with amp enabled=False based on earlier thread. Btw, T4 has tensor cores: nan
  4. yolo5 v6.0/4x T4/Cuda 10.2: nan

We pretty much eliminated all environment related variables out and can't make progress.

@glenn-jocher
Copy link
Member

@saumitrabg we don't provide support for custom dataset training etc. As I said this is standard training instability with an ML engineer on your team should know how to handle.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 11, 2022

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@ghost
Copy link

ghost commented Jun 30, 2022

I am also facing the same problem;
But I noticed that if I am using only one GPU than it have some loss values, other wise in case of multiple GPU's it gives nan
1
2

@saumitrabg
Copy link
Author

We ended up playing with the learning rates lr0 (slower rate) to get around.

@glenn-jocher
Copy link
Member

@saumitrabg i'm glad to hear that you were able to make progress by adjusting the learning rates. Training with multiple GPUs can introduce complexities that may require some fine-tuning. If you have any other issues or questions, feel free to ask!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

2 participants