Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle undefined model max length #219

Merged

Conversation

thanawan-atc
Copy link
Contributor

@thanawan-atc thanawan-atc commented Aug 11, 2023

Description

For some model, model.tokenizer.model_max_length is not properly defined (i.e., set to be a very large number). This led to truncation problems and caused model deployment to fail.

Screenshot 2023-08-10 at 9 26 51 PM Screenshot 2023-08-10 at 9 27 03 PM

Issues Resolved

Helped in tracing intfloat/e5-small-v2 in #218

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Merging #219 (f783502) into main (bd50b8b) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #219      +/-   ##
==========================================
+ Coverage   91.23%   91.25%   +0.01%     
==========================================
  Files          37       37              
  Lines        4131     4138       +7     
==========================================
+ Hits         3769     3776       +7     
  Misses        362      362              
Files Changed Coverage Δ
...search_py_ml/ml_models/sentencetransformermodel.py 74.35% <100.00%> (+0.42%) ⬆️

@@ -791,6 +791,13 @@ def save_as_pt(
zip_file_name = str(model_id.split("/")[-1] + ".zip")
zip_file_path = os.path.join(model_output_path, zip_file_name)

# handle undefined model_max_length in model's tokenizer (e.g. "intfloat/e5-small-v2" )
if model.tokenizer.model_max_length == 1000000000000000019884624838656:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assign a static variable for this magic number and also add more comments including the github issue why we need this.

In addition, I assume this large is same for every cases, but I was wondering if we should do something like:

model.tokenizer.model_max_length = min (model.tokenizer.model_max_length, model.get_max_seq_length())

Can you please check if that's applicable for every cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I can add a static variable for this magic number and more comments.

From what I've seen, the number is the same.

For model.tokenizer.model_max_length = min (model.tokenizer.model_max_length, model.get_max_seq_length()), I agree that it is more generalizable. And from what I've checked, it should be applicable and beneficial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made some changes. I use if model.tokenizer.model_max_lengt > model.get_max_seq_length() instead of min function, so that we can add print() to let user know what's happening.

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
@dhrubo-os
Copy link
Collaborator

Thanks for raising this PR.

@dhrubo-os dhrubo-os merged commit 9b558c7 into opensearch-project:main Aug 15, 2023
13 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 15, 2023
* Handle undefined model max length

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Update CHANGELOG.md

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Remove scratch notebook

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Fix bug

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Use compare instead

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Update sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

---------

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
(cherry picked from commit 9b558c7)
dhrubo-os pushed a commit that referenced this pull request Aug 15, 2023
* Handle undefined model max length

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Update CHANGELOG.md

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Remove scratch notebook

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Fix bug

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Use compare instead

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

* Update sentencetransformermodel.py

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

---------

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
(cherry picked from commit 9b558c7)

Co-authored-by: Thanawan Atchariyachanvanit <latchari@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants