nan for obj, cls and 0 for P/R/mAP #7740

saumitrabg · 2022-05-09T21:36:44Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

No response

Bug

Our training/mAP score is really low with the latest code and hence, we are now stuck with release v6.0, where mAP trends like before. Our Yolov5m is in production for some time and we are familiar.

However, recently with v6.0 and Python3.7, Cuda 11.4 in Google Cloud (4x T4), at exactly 26th epochs, all our obj, cls become nan and mAP/P/R become 0 from say 0.6 mAP@0.5:0.95. We are in production and need to turn around a new tuned model but now we are stuck.

So, we move to Python3.9 with mini conda, Cuda 11.4 in Google Cloud (4xT4). Now, at 4th epoch, we see nan for obj/cls and 0 for mAP/P/R etc.

We are stuck because if we move to the latest Yolov5 code, the training is incredibly slow with very slow mAP score and now v6.0 is showing nans. Any clue? Btw, we did bunch of steps already (#7027) to reproduce the slow/low mAP with the latest code and without that fixed, we need to figure out why v6.0 is going nan.

Environment

-YOLOv5-6.0 (version 6.0)
-Google Cloud 4x T4
-Python3.7 and Python3.9
-Cuda 11.4 & 11.6

Minimal Reproducible Example

python -m torch.distributed.launch --master_port 1234 --nproc_per_node=4 train.py --rect --name 640_m_xxx_6_May_22 --img 640 --batch 384 --epochs 450 --data coco128.yaml --weights /home/engineering_borde_io/yolov5-v6.0/runs/train/640_m_xxx_1_May_222_best.pt --device 0,1,2,3

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

glenn-jocher · 2022-05-10T08:27:27Z

@saumitrabg your training is unstable as your losses are obviously increasing.

saumitrabg · 2022-05-10T15:50:14Z

@glenn-jocher they settle down and we get to mAP in high 60s always. However, it is painfully low for the latest code. Anyhow, we are seeing the issue with the latest code as well. In 3rd epoch (epoch 2), it goes to nan. We also see NMS timeout and based on a old thread, we changed the timeout value. Any clue how to make progress?

glenn-jocher · 2022-05-10T16:32:13Z

@saumitrabg this is pretty standard training instability. You might want to sit down with your ML team and have an engineer review the hyps and warmup etc.

saumitrabg · 2022-05-11T00:50:10Z

@glenn-jocher thank you for the response. We are not looking at the slow training with the latest code (btw, we used the exact same hyperparameters from previous yolo versions (where training was faster) in the VOC yaml file as we are using the old model as a baseline and that didn't speed).

Our issue is a logical one where we are hitting nans in all different combos:

yolov5 6.0/4x T4/Cuda 11.4 or 11.6: nan
yolov5 latest/4x T4/Cuda 11.4 or 11.6: nan
Tried above combos with amp enabled=False based on earlier thread. Btw, T4 has tensor cores: nan
yolo5 v6.0/4x T4/Cuda 10.2: nan

We pretty much eliminated all environment related variables out and can't make progress.

glenn-jocher · 2022-05-11T09:09:21Z

@saumitrabg we don't provide support for custom dataset training etc. As I said this is standard training instability with an ML engineer on your team should know how to handle.

github-actions · 2022-06-11T00:22:24Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

ghost · 2022-06-30T23:29:21Z

I am also facing the same problem;
But I noticed that if I am using only one GPU than it have some loss values, other wise in case of multiple GPU's it gives nan

saumitrabg · 2022-07-01T01:16:46Z

We ended up playing with the learning rates lr0 (slower rate) to get around.

glenn-jocher · 2023-11-15T09:20:03Z

@saumitrabg i'm glad to hear that you were able to make progress by adjusting the learning rates. Training with multiple GPUs can introduce complexities that may require some fine-tuning. If you have any other issues or questions, feel free to ask!

saumitrabg added the bug Something isn't working label May 9, 2022

saumitrabg changed the title ~~nan for obj, cls and 0 for P/R/mAP with v6.0 release~~ nan for obj, cls and 0 for P/R/mAP May 11, 2022

github-actions bot added the Stale label Jun 11, 2022

github-actions bot closed this as completed Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan for obj, cls and 0 for P/R/mAP #7740

nan for obj, cls and 0 for P/R/mAP #7740

saumitrabg commented May 9, 2022

glenn-jocher commented May 10, 2022

saumitrabg commented May 10, 2022 •

edited

Loading

glenn-jocher commented May 10, 2022

saumitrabg commented May 11, 2022 •

edited

Loading

glenn-jocher commented May 11, 2022

github-actions bot commented Jun 11, 2022 •

edited by glenn-jocher

Loading

ghost commented Jun 30, 2022

saumitrabg commented Jul 1, 2022

glenn-jocher commented Nov 15, 2023

nan for obj, cls and 0 for P/R/mAP #7740

nan for obj, cls and 0 for P/R/mAP #7740

Comments

saumitrabg commented May 9, 2022

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

glenn-jocher commented May 10, 2022

saumitrabg commented May 10, 2022 • edited Loading

glenn-jocher commented May 10, 2022

saumitrabg commented May 11, 2022 • edited Loading

glenn-jocher commented May 11, 2022

github-actions bot commented Jun 11, 2022 • edited by glenn-jocher Loading

ghost commented Jun 30, 2022

saumitrabg commented Jul 1, 2022

glenn-jocher commented Nov 15, 2023

saumitrabg commented May 10, 2022 •

edited

Loading

saumitrabg commented May 11, 2022 •

edited

Loading

github-actions bot commented Jun 11, 2022 •

edited by glenn-jocher

Loading