[Failing Test]: beam_Inference_Python_Benchmarks_Dataflow failing with RuntimeError: Error(s) in loading state_dict for BertForMaskedLM. #27647

tvalentyn · 2023-07-24T23:33:32Z

What happened?

Test suite: https://ci-beam.apache.org/job/beam_Inference_Python_Benchmarks_Dataflow/

...
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 139, in acquire
    result = constructor_fn()
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 834, in load
    model = self._model_handler.load_model()
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 270, in load_model
    return self._unkeyed.load_model()
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 501, in load_model
    model, device = _load_model(
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 119, in _load_model
    raise e
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 102, in _load_model
    model.load_state_dict(state_dict)
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BertForMaskedLM:
	Unexpected key(s) in state_dict: "bert.embeddings.position_ids".  [while running 'PyTorchRunInference/BeamML_RunInference-ptransform-81']

Issue Failure

Failure: Test is continually failing

Issue Priority

Priority: 2 (backlog / disabled test but we think the product is healthy)

Issue Components

The text was updated successfully, but these errors were encountered:

tvalentyn · 2023-07-24T23:34:14Z

cc: @AnandInguva @damccorm in case you are aware of recent changes.

AnandInguva · 2023-07-25T04:20:23Z

could be due to the new release of transformers library: https://pypi.org/project/transformers/#history

riteshghorse · 2023-07-31T12:48:48Z

Check #27734 . I updated the apache-beam-ml/models/huggingface.BertForMaskedLM.bert-base-uncased.pth file. I'll update the other saved file as well on GCS

damccorm · 2023-07-31T13:13:24Z

Thanks Ritesh! I kicked off a run https://ci-beam.apache.org/job/beam_Inference_Python_Benchmarks_Dataflow/325/ we can close when that passes

riteshghorse · 2023-07-31T18:02:27Z

language_modeling with BertForMaskedLM builds were successful.

benchmark-tests-pytorch-imagenet-python-gpu0731125250 failed with CUDA error.

  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 135, in _convert_to_device
    examples = examples.to(device)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
 [while running 'PyTorchRunInference/BeamML_RunInference-ptransform-73']

riteshghorse · 2023-07-31T18:57:17Z

Could be a one off. Let's see in this run - https://ci-beam.apache.org/job/beam_Inference_Python_Benchmarks_Dataflow/326/

Also, we could wrap .to(device) in try-catch to avoid pipeline failure in case there is no memory left on GPU

riteshghorse · 2023-08-01T16:53:45Z

There is also a quota exceeded error QUOTA_EXCEEDED: Instance 'benchmark-tests-pytorch-i-07311243-i2rg-harness-f2x1' creation failed: Quota 'NVIDIA_T4_GPUS' exceeded. Limit: 32.0 in region us-central1.

riteshghorse · 2023-08-02T13:52:43Z

Latest run #327 passed. Looks like it was problem with GPU not being available which ultimately result in RuntimeError: CUDA error: misaligned address. Confirmed that with this #27785 where I checked for available GPU before moving tensors. It failed at the QUOTA_EXCEEDED error without reaching the .to(device) step.

tvalentyn added bug failing test awaiting triage python run-inference labels Jul 24, 2023

github-actions bot added tests P2 permared labels Jul 24, 2023

AnandInguva self-assigned this Jul 25, 2023

github-actions bot removed the awaiting triage label Jul 25, 2023

damccorm mentioned this issue Jul 28, 2023

Pin to specific transformers version #27736

Closed

3 tasks

riteshghorse closed this as completed Aug 2, 2023

github-actions bot added this to the 2.50.0 Release milestone Aug 2, 2023

damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Failing Test]: beam_Inference_Python_Benchmarks_Dataflow failing with RuntimeError: Error(s) in loading state_dict for BertForMaskedLM. #27647

[Failing Test]: beam_Inference_Python_Benchmarks_Dataflow failing with RuntimeError: Error(s) in loading state_dict for BertForMaskedLM. #27647

tvalentyn commented Jul 24, 2023

tvalentyn commented Jul 24, 2023

AnandInguva commented Jul 25, 2023

riteshghorse commented Jul 31, 2023 •

edited

Loading

damccorm commented Jul 31, 2023

riteshghorse commented Jul 31, 2023 •

edited

Loading

riteshghorse commented Jul 31, 2023 •

edited

Loading

riteshghorse commented Aug 1, 2023

riteshghorse commented Aug 2, 2023 •

edited

Loading

[Failing Test]: beam_Inference_Python_Benchmarks_Dataflow failing with RuntimeError: Error(s) in loading state_dict for BertForMaskedLM. #27647

[Failing Test]: beam_Inference_Python_Benchmarks_Dataflow failing with RuntimeError: Error(s) in loading state_dict for BertForMaskedLM. #27647

Comments

tvalentyn commented Jul 24, 2023

What happened?

Issue Failure

Issue Priority

Issue Components

tvalentyn commented Jul 24, 2023

AnandInguva commented Jul 25, 2023

riteshghorse commented Jul 31, 2023 • edited Loading

damccorm commented Jul 31, 2023

riteshghorse commented Jul 31, 2023 • edited Loading

riteshghorse commented Jul 31, 2023 • edited Loading

riteshghorse commented Aug 1, 2023

riteshghorse commented Aug 2, 2023 • edited Loading

riteshghorse commented Jul 31, 2023 •

edited

Loading

riteshghorse commented Jul 31, 2023 •

edited

Loading

riteshghorse commented Jul 31, 2023 •

edited

Loading

riteshghorse commented Aug 2, 2023 •

edited

Loading