-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle undefined model max length #219
Handle undefined model max length #219
Conversation
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Codecov Report
@@ Coverage Diff @@
## main #219 +/- ##
==========================================
+ Coverage 91.23% 91.25% +0.01%
==========================================
Files 37 37
Lines 4131 4138 +7
==========================================
+ Hits 3769 3776 +7
Misses 362 362
|
@@ -791,6 +791,13 @@ def save_as_pt( | |||
zip_file_name = str(model_id.split("/")[-1] + ".zip") | |||
zip_file_path = os.path.join(model_output_path, zip_file_name) | |||
|
|||
# handle undefined model_max_length in model's tokenizer (e.g. "intfloat/e5-small-v2" ) | |||
if model.tokenizer.model_max_length == 1000000000000000019884624838656: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assign a static variable for this magic number and also add more comments including the github issue why we need this.
In addition, I assume this large is same for every cases, but I was wondering if we should do something like:
model.tokenizer.model_max_length = min (model.tokenizer.model_max_length, model.get_max_seq_length())
Can you please check if that's applicable for every cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I can add a static variable for this magic number and more comments.
From what I've seen, the number is the same.
For model.tokenizer.model_max_length = min (model.tokenizer.model_max_length, model.get_max_seq_length())
, I agree that it is more generalizable. And from what I've checked, it should be applicable and beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made some changes. I use if model.tokenizer.model_max_lengt > model.get_max_seq_length()
instead of min
function, so that we can add print()
to let user know what's happening.
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
69f5f75
to
1402bf4
Compare
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Thanks for raising this PR. |
* Handle undefined model max length Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Remove scratch notebook Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Fix bug Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Use compare instead Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> --------- Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> (cherry picked from commit 9b558c7)
* Handle undefined model max length Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Remove scratch notebook Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Fix bug Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Use compare instead Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> --------- Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> (cherry picked from commit 9b558c7) Co-authored-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Description
For some model,
model.tokenizer.model_max_length
is not properly defined (i.e., set to be a very large number). This led to truncation problems and caused model deployment to fail.Issues Resolved
Helped in tracing
intfloat/e5-small-v2
in #218Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.