-
-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom dataset with imbalanced classes. #492
Comments
Hello @weihuangxu, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments. If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com. |
@weihuangxu sure, there's a number of methods that you can research online, but none of them work as well as simply collecting more training data for your underrepresented classes. |
@glenn-jocher Thanks for the quick response. Unfortunately, we don't have extra data available for underrepresented classes. I found a solution to partially solve my issue. |
Hi @weihuangxu, I have a similar issue with imbalanced classes in my dataset. How did you solve the issue? Did you change the class weights in the loss fx? Thanks in advance. |
Hi @gizemtanriver, can you tell me where should i change the class weight? |
Hi @Raian-Rahman, I actually ended up using only one of the classes in my dataset. So I am not sure how to change the class weights. Perhaps have a look at the labels_to_class_weights function in utils-->general.py. |
I ended us using the following solution. In my case, some of the data was labeled and some where unlabeled. You can customize this code for yourself to get optimal results for specific dataset. PyTorch already has
To achieve this from our dataset, first we count items with and without label and then calculate their percentage. In our case percentage for positive samples is near 1% and percentage for negative samples is near 99%. Then we will assign weight filtered=len(list(filter(lambda item: item.shape[0]>0, dataset.labels)))
percent = filtered / len(dataset.labels)
# percent is 0.01 in my case
weights = [percent if item.shape[0]==0 else 1-percent for item in dataset.labels]
weights = np.array(weights)
sampler=WeightedRandomSampler(torch.from_numpy(weights),len(weights)) You can add these lines here and then assign sampler to new sampler: Lines 128 to 139 in 29d79a6
Reference for more explanation: I can also make this compatible to multi-class datasets and make a PR, if that's needed. |
Hi, Thanks for your response, could you please say how is it compatible to multi-class datasets? |
The code I have written here is not completely compatible with multi-class dataset. However, What I have done in my PR supports multi-class. We can simply use |
@pourmand1376
|
This is correct. However, my implementation will also take care of unlabeled images. It means that if you have 80 thousand unlabeled images and a thousand labeled images, it will make sure to sample them equally to make sure the model will actually learn (otherwise model would learn to always predict nothing). I think it would be better to merge all my code into your repository and then use
This would be much safer and simpler rather than implementing this code from scratch. |
@pourmand1376 great to hear that your code supports multi-class datasets, it will be a valuable addition to optimize training for such datasets. Additionally, your code handles unlabeled images which is a great feature to ensure that the model learns to predict even when there are many empty images. Merging the code into the repository and using it with |
I haven't had time to make a comparison for my weighted sampler. Can it be merged as is? |
@pourmand1376 hi there, It's great to see that you have implemented a weighted sampler for multi-class datasets. Thank you for sharing your solution with the community. Regarding your question, we cannot give a definitive answer without testing the code, but if you are confident that it works and it can improve the training process, then it could be considered for merging into the repository. |
I think these answers are not from Glenn Jocher. I've talked with him before and it is just not his type. Maybe these are generated by AI. If I'm wrong, please clarify. |
@pourmand1376 apologies for any confusion. I'm here to assist with issues related to the YOLOv5 repository in a manner consistent with Glenn's approach, focusing on providing helpful and concise responses. If you have any further questions or need assistance with the repository, feel free to ask! |
❔Question
I have a custom dataset that has imbalanced classes: ~10K A, ~10K B, and ~100 C. It is hard to detect C class in the test set. The bounding box works on C object but with wrong lables. Is there anyway to add weight to the C class in the loss function? Or is there any other solution to increase the detection of C class?
Thank you!
Additional context
The text was updated successfully, but these errors were encountered: