Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test]: beam_Inference_Python_Benchmarks_Dataflow failing with RuntimeError: Error(s) in loading state_dict for BertForMaskedLM. #27647

Closed
15 tasks
tvalentyn opened this issue Jul 24, 2023 · 8 comments
Assignees
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test P2 permared python run-inference tests

Comments

@tvalentyn
Copy link
Contributor

What happened?

Test suite: https://ci-beam.apache.org/job/beam_Inference_Python_Benchmarks_Dataflow/

...
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/utils/shared.py", line 139, in acquire
    result = constructor_fn()
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 834, in load
    model = self._model_handler.load_model()
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/base.py", line 270, in load_model
    return self._unkeyed.load_model()
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 501, in load_model
    model, device = _load_model(
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 119, in _load_model
    raise e
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 102, in _load_model
    model.load_state_dict(state_dict)
  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BertForMaskedLM:
	Unexpected key(s) in state_dict: "bert.embeddings.position_ids".  [while running 'PyTorchRunInference/BeamML_RunInference-ptransform-81']

Issue Failure

Failure: Test is continually failing

Issue Priority

Priority: 2 (backlog / disabled test but we think the product is healthy)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@tvalentyn
Copy link
Contributor Author

cc: @AnandInguva @damccorm in case you are aware of recent changes.

@AnandInguva
Copy link
Contributor

could be due to the new release of transformers library: https://pypi.org/project/transformers/#history

@riteshghorse
Copy link
Contributor

riteshghorse commented Jul 31, 2023

Check #27734 . I updated the apache-beam-ml/models/huggingface.BertForMaskedLM.bert-base-uncased.pth file. I'll update the other saved file as well on GCS

@damccorm
Copy link
Contributor

Thanks Ritesh! I kicked off a run https://ci-beam.apache.org/job/beam_Inference_Python_Benchmarks_Dataflow/325/ we can close when that passes

@riteshghorse
Copy link
Contributor

riteshghorse commented Jul 31, 2023

language_modeling with BertForMaskedLM builds were successful.

benchmark-tests-pytorch-imagenet-python-gpu0731125250 failed with CUDA error.

  File "/opt/apache/beam-venv/beam-venv-****-sdk-0-0/lib/python3.8/site-packages/apache_beam/ml/inference/pytorch_inference.py", line 135, in _convert_to_device
    examples = examples.to(device)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
 [while running 'PyTorchRunInference/BeamML_RunInference-ptransform-73']

@riteshghorse
Copy link
Contributor

riteshghorse commented Jul 31, 2023

Could be a one off. Let's see in this run - https://ci-beam.apache.org/job/beam_Inference_Python_Benchmarks_Dataflow/326/

Also, we could wrap .to(device) in try-catch to avoid pipeline failure in case there is no memory left on GPU

@riteshghorse
Copy link
Contributor

There is also a quota exceeded error QUOTA_EXCEEDED: Instance 'benchmark-tests-pytorch-i-07311243-i2rg-harness-f2x1' creation failed: Quota 'NVIDIA_T4_GPUS' exceeded. Limit: 32.0 in region us-central1.

@riteshghorse
Copy link
Contributor

riteshghorse commented Aug 2, 2023

Latest run #327 passed. Looks like it was problem with GPU not being available which ultimately result in RuntimeError: CUDA error: misaligned address. Confirmed that with this #27785 where I checked for available GPU before moving tensors. It failed at the QUOTA_EXCEEDED error without reaching the .to(device) step.

@github-actions github-actions bot added this to the 2.50.0 Release milestone Aug 2, 2023
@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test P2 permared python run-inference tests
Projects
None yet
Development

No branches or pull requests

4 participants