-
-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan for obj, cls and 0 for P/R/mAP #7740
Comments
@saumitrabg your training is unstable as your losses are obviously increasing. |
@glenn-jocher they settle down and we get to mAP in high 60s always. However, it is painfully low for the latest code. Anyhow, we are seeing the issue with the latest code as well. In 3rd epoch (epoch 2), it goes to nan. We also see NMS timeout and based on a old thread, we changed the timeout value. Any clue how to make progress? |
@saumitrabg this is pretty standard training instability. You might want to sit down with your ML team and have an engineer review the hyps and warmup etc. |
@glenn-jocher thank you for the response. We are not looking at the slow training with the latest code (btw, we used the exact same hyperparameters from previous yolo versions (where training was faster) in the VOC yaml file as we are using the old model as a baseline and that didn't speed). Our issue is a logical one where we are hitting nans in all different combos:
We pretty much eliminated all environment related variables out and can't make progress. |
@saumitrabg we don't provide support for custom dataset training etc. As I said this is standard training instability with an ML engineer on your team should know how to handle. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
We ended up playing with the learning rates lr0 (slower rate) to get around. |
@saumitrabg i'm glad to hear that you were able to make progress by adjusting the learning rates. Training with multiple GPUs can introduce complexities that may require some fine-tuning. If you have any other issues or questions, feel free to ask! |
Search before asking
YOLOv5 Component
No response
Bug
Our training/mAP score is really low with the latest code and hence, we are now stuck with release v6.0, where mAP trends like before. Our Yolov5m is in production for some time and we are familiar.
However, recently with v6.0 and Python3.7, Cuda 11.4 in Google Cloud (4x T4), at exactly 26th epochs, all our obj, cls become nan and mAP/P/R become 0 from say 0.6 mAP@0.5:0.95. We are in production and need to turn around a new tuned model but now we are stuck.
So, we move to Python3.9 with mini conda, Cuda 11.4 in Google Cloud (4xT4). Now, at 4th epoch, we see nan for obj/cls and 0 for mAP/P/R etc.
We are stuck because if we move to the latest Yolov5 code, the training is incredibly slow with very slow mAP score and now v6.0 is showing nans. Any clue? Btw, we did bunch of steps already (#7027) to reproduce the slow/low mAP with the latest code and without that fixed, we need to figure out why v6.0 is going nan.
Environment
-YOLOv5-6.0 (version 6.0)
-Google Cloud 4x T4
-Python3.7 and Python3.9
-Cuda 11.4 & 11.6
Minimal Reproducible Example
python -m torch.distributed.launch --master_port 1234 --nproc_per_node=4 train.py --rect --name 640_m_xxx_6_May_22 --img 640 --batch 384 --epochs 450 --data coco128.yaml --weights /home/engineering_borde_io/yolov5-v6.0/runs/train/640_m_xxx_1_May_222_best.pt --device 0,1,2,3
Additional
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: