Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss Scaling by Unmasked Token Count #508

Closed
5 tasks done
grimulkan opened this issue Aug 30, 2023 · 4 comments
Closed
5 tasks done

Loss Scaling by Unmasked Token Count #508

grimulkan opened this issue Aug 30, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@grimulkan
Copy link

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

I have a data set with 2 types of data items:

  • Type 1 data items have one Q & A instruct pair (long question, short answer)
  • Type 2 data items have many Q and A pairs, with each answer about the same length as the Type 1 answer.
    Both types of items have the same total number of tokens, approximately.

train_on_inputs = false, so the Qs don't contribute to training.

I would like Type 2 items to contribute more to the loss/gradient computation, because they have more training tokens (or equivalently, more Q&A entries), irrespective of the batching.

This is somewhat random if I understand the default behavior correctly.
Egs., I may get a batch with all Type 1 entries, and another batch with all Type 2 entries, and both batches contribute equally to the gradient update (vs the 2nd batch contributing more).
On the other hand, if I get a batch with a mix of both types of entries, then the higher token count of Type 2 will correctly give it more contribution to the loss (I think).

If I had a giant batch, it would likely smooth this variation out statistically, but I don't have the VRAM for that. I don't think gradient accumulation normalizes the gradient updates at the end of the accumulation step (each mini-batch update is still normalized, right?)

I am not sure how this can be done or the best way to handle this, or if it should even be done, so any suggestions/discussion welcome!

I don't know if this is even the correct way to compute the loss for this situation, but it feels like it should be. Happy to be corrected.

✔️ Solution

I think the trainer_weighted_loss() function in trainer.py does the opposite of what I want, in that would normalize the contributions from Type 1 and Type 2 entries by the number of Q & A entries (which is roughly equivalent to normalizing by unmasked token length for my case).

This would remove the randomness with the batching, but it corrects for it in the other direction. I'd like to have Type 2 contribute more, not the same as Type 1, no matter what it is batched with. Also, the weighted loss feature seems to be currently unused (it was added in PR# possibly related to sample_packing, but not currently utilized?)

This function currently does (I think):

  • Type 1: loss scale = 1
  • Type 2: loss scale = 1/(# of Q&A entries)

If implemented this way instead: loss scale = (# of Q&A entries)/max_entries, for some pre-defined number max_entires, then

  • Type 1: loss scale = 1/max_entries
  • Type 2: loss scale = in proportion to the number of entries it has, with a scale of 1 if it has max_entries

I am not sure if this messes anything up, since the Type 1 entries will have potentially much smaller loss, and there is this extra max_entries hyper parameter which is annoying.

A completely different approach would be to somehow not normalize the sum of the gradients from each data item per batch, and only normalize the gradients at the gradient accumulation stage. With high gradient accumulation, statistics will do the work for us. That said, I could be wrong in my understanding of how gradient accumulation works...

❓ Alternatives

  • I could split my Type 1 entries into separate data items with only a single Q&A each. I am not sure how sample_packing works: if I understand correctly it might concatenate these split entries to create Type 1 entries on the fly - though I am not sure how it computes the loss in that case.
  • Another option that will work with no code modification is to split Type 1 entries in the same way as above, and then do not enable sample_packing. In that case, each entry has only a single Q&A pair, and everything scales correctly. This is what I'm currently doing, but it greatly increases the number of data items I have, and creates a lot of padded entries. group_by_length = true gives me some of the speed back, but it is still not as fast as not having to split my Type 1 entries in the first place.

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
@grimulkan grimulkan added the enhancement New feature or request label Aug 30, 2023
@grimulkan
Copy link
Author

I realized it can be done by modifying the loss function. Let me do some testing and see if any actual feature support is needed.

@grimulkan
Copy link
Author

Forgot about this, so returning to close it.

Turns out I am dumb and Llama overloads cross-entropy loss computation to already be scaled by the # of tokens, as opposed to some other implementations that average it out over the sequence length.

@winglian
Copy link
Collaborator

@grimulkan thanks for updating!

@grimulkan
Copy link
Author

Just wanted to point to further (past) discussion: huggingface/transformers#24725
It's still an upstream issue IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants