Handle undefined model max length #219

thanawan-atc · 2023-08-11T04:19:14Z

Description

For some model, model.tokenizer.model_max_length is not properly defined (i.e., set to be a very large number). This led to truncation problems and caused model deployment to fail.

For more context, see this huggingface/transformer issue.
For some models, the problem has already been fixed directly to the model repo (e.g. intfloat/multilingual-e5-small), but some still have that problem (e.g. intfloat/e5-small-v2, gpt]

Issues Resolved

Helped in tracing intfloat/e5-small-v2 in #218

Check List

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

codecov · 2023-08-11T05:37:36Z

Codecov Report

Merging #219 (f783502) into main (bd50b8b) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #219      +/-   ##
==========================================
+ Coverage   91.23%   91.25%   +0.01%     
==========================================
  Files          37       37              
  Lines        4131     4138       +7     
==========================================
+ Hits         3769     3776       +7     
  Misses        362      362

Files Changed	Coverage Δ
...search_py_ml/ml_models/sentencetransformermodel.py	`74.35% <100.00%> (+0.42%)`	⬆️

dhrubo-os · 2023-08-15T14:15:25Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

@@ -791,6 +791,13 @@ def save_as_pt(
            zip_file_name = str(model_id.split("/")[-1] + ".zip")
        zip_file_path = os.path.join(model_output_path, zip_file_name)

+        # handle undefined model_max_length in model's tokenizer (e.g. "intfloat/e5-small-v2" )
+        if model.tokenizer.model_max_length == 1000000000000000019884624838656:


Let's assign a static variable for this magic number and also add more comments including the github issue why we need this.

In addition, I assume this large is same for every cases, but I was wondering if we should do something like:

model.tokenizer.model_max_length = min (model.tokenizer.model_max_length, model.get_max_seq_length())

Can you please check if that's applicable for every cases?

Okay! I can add a static variable for this magic number and more comments.

From what I've seen, the number is the same.

For model.tokenizer.model_max_length = min (model.tokenizer.model_max_length, model.get_max_seq_length()), I agree that it is more generalizable. And from what I've checked, it should be applicable and beneficial.

I just made some changes. I use if model.tokenizer.model_max_lengt > model.get_max_seq_length() instead of min function, so that we can add print() to let user know what's happening.

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

dhrubo-os · 2023-08-15T17:18:18Z

Thanks for raising this PR.

* Handle undefined model max length Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Remove scratch notebook Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Fix bug Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Use compare instead Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> --------- Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> (cherry picked from commit 9b558c7)

* Handle undefined model max length Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update CHANGELOG.md Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Remove scratch notebook Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Fix bug Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Use compare instead Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> * Update sentencetransformermodel.py Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> --------- Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com> (cherry picked from commit 9b558c7) Co-authored-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

Handle undefined model max length

5291292

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

thanawan-atc requested review from dhrubo-os, greaa-aws, ylwu-amzn, b4sjoo, jngz-es and rbhavna as code owners August 11, 2023 04:19

thanawan-atc added 3 commits August 10, 2023 21:21

Update CHANGELOG.md

5cb16e9

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

Remove scratch notebook

f6bb87f

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

Fix bug

b9ae9f1

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

thanawan-atc mentioned this pull request Aug 10, 2023

[Automating Model Tracing and Uploading] PR 1 Model Auto-tracing & Uploading #209

Merged

1 task

dhrubo-os reviewed Aug 15, 2023

View reviewed changes

Use compare instead

1402bf4

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

thanawan-atc force-pushed the fix-undefined-model-max-length branch from 69f5f75 to 1402bf4 Compare August 15, 2023 16:40

Update sentencetransformermodel.py

f783502

Signed-off-by: Thanawan Atchariyachanvanit <latchari@amazon.com>

dhrubo-os approved these changes Aug 15, 2023

View reviewed changes

rbhavna approved these changes Aug 15, 2023

View reviewed changes

dhrubo-os merged commit 9b558c7 into opensearch-project:main Aug 15, 2023
13 checks passed

dhrubo-os added the backport 1.x label Aug 15, 2023

opensearch-trigger-bot bot mentioned this pull request Aug 15, 2023

[Backport 1.x] Handle undefined model max length #222

Merged

dhrubo-os mentioned this pull request Feb 15, 2024

add cross-encoder tracing, config-generating, and uploading #375

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle undefined model max length #219

Handle undefined model max length #219

thanawan-atc commented Aug 11, 2023 •

edited

Loading

codecov bot commented Aug 11, 2023 •

edited

Loading

dhrubo-os Aug 15, 2023

thanawan-atc Aug 15, 2023

thanawan-atc Aug 15, 2023

dhrubo-os commented Aug 15, 2023

Handle undefined model max length #219

Handle undefined model max length #219

Conversation

thanawan-atc commented Aug 11, 2023 • edited Loading

Description

Issues Resolved

Check List

codecov bot commented Aug 11, 2023 • edited Loading

Codecov Report

dhrubo-os Aug 15, 2023

Choose a reason for hiding this comment

thanawan-atc Aug 15, 2023

Choose a reason for hiding this comment

thanawan-atc Aug 15, 2023

Choose a reason for hiding this comment

dhrubo-os commented Aug 15, 2023

thanawan-atc commented Aug 11, 2023 •

edited

Loading

codecov bot commented Aug 11, 2023 •

edited

Loading