Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression in 0.5 with pytorch segfault #2447

Closed
vlad17 opened this issue Jul 20, 2018 · 23 comments
Closed

regression in 0.5 with pytorch segfault #2447

vlad17 opened this issue Jul 20, 2018 · 23 comments

Comments

@vlad17
Copy link

vlad17 commented Jul 20, 2018

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04)
  • Ray installed from (source or binary): pip
  • Ray version: 0.5.0
  • Python version: 3.5
  • Exact command to reproduce:

Describe the problem

I hit an issue moving a PyTorch 0.4 model onto a k80 GPU from a tune worker where I was unable to see any error trace: the worker was segfaulting.

I was able to replicate the segfault by invoking the same training function (which is my application code) in the same main file that I started ray with ray.init. As soon as I called model.cuda(), and in particular when a Conv2d module was being moved to the GPU, there was a segfault in the pytorch code at lazy_cuda_init. The only interaction with ray is that ray was initialized in the same process.

When I demote ray to version 0.4 the issue disappears. This was on an AWS p2 instance.

I'll make a minimal example when I have some time, just wanted to post the issue after noticing the ray downgrade resolved the problem.

Source code / logs

to come

@pcmoritz @ericl @richardliaw @robertnishihara

@robertnishihara
Copy link
Collaborator

  1. Do you have tensorflow installed? If you install tensorflow does that change the behavior at all?
  2. Does it matter if you import ray or pytorch first?

Could be related to #2391 or #2159.

@robertnishihara
Copy link
Collaborator

@vlad17 please share the code when you have a chance.

@vlad17
Copy link
Author

vlad17 commented Jul 21, 2018

Please see the inline script and attached install file. Gets a segfault as shown on a machine with a k80 with cuda 9.1 installed. Pretty sure the rllib/tf monkey patch is unnecessary, did not slim down past my personal code deps but I figured you'd rather get replicating code earlier.

conda create -y -n breaking-env python=3.5
source activate breaking-env
./scripts/install-pytorch.sh
pip install ray==0.5 absl-py

# on a cuda 9.1 device
CUDA_VISIBLE_DEVICES=0 python -c '
from absl import app
from absl import flags
import ray
# monkey patch rllib dep to avoid bringing in gym and TF
ray.rllib = None
import ray.tune
from ray.tune import register_trainable, run_experiments

def ray_train(config, status_reporter):
    import torch
    torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()

def _main(_):
    ray.init(num_gpus=1)
    ray_train(None, None)

if __name__ == "__main__":
    app.run(_main)
'

install-pytorch.sh.zip

Fatal Python error: Segmentation fault

Thread 0x00007ff3204ae700 (most recent call first):
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/socket.py", line 134 in __init__
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 515 in _connect
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 484 in connect
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 585 in send_packed_command
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/connection.py", line 610 in send_command
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/client.py", line 667 in execute_command
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/redis/client.py", line 1347 in lrange
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/ray/worker.py", line 1920 in print_error_messages
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 862 in run
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 882 in _bootstrap

Thread 0x00007ff31fcad700 (most recent call first):
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/ray/worker.py", line 2076 in import_thread
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 862 in run
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 914 in _bootstrap_inner
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/threading.py", line 882 in _bootstrap

Current thread 0x00007ff356346700 (most recent call first):
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249 in <lambda>
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 182 in _apply
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 249 in cuda
  File "<string>", line 12 in ray_train
  File "<string>", line 16 in _main
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/absl/app.py", line 238 in _run_main
  File "/home/ubuntu/conda/envs/breaking-env/lib/python3.5/site-packages/absl/app.py", line 274 in run
  File "<string>", line 19 in <module>
Segmentation fault (core dumped)

@richardliaw
Copy link
Contributor

Works on a Titan Xp; trying to reproduce on a separate env now... is there a difference with just using conda install pytorch torchvision cuda91 -c pytorch?

@vlad17
Copy link
Author

vlad17 commented Jul 22, 2018

@richardliaw still segfaults even w/ that setup. can u replicate on a p2?

@richardliaw
Copy link
Contributor

trying now

@richardliaw
Copy link
Contributor

Got the segfault on a p2.xlarge. Simply

import ray
import torch
torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()

Fails with Torch 0.4 (cuda 9.0) and ray 0.5. Seems to be exactly the same as #2413

@richardliaw
Copy link
Contributor

This also fails: @pcmoritz, @robertnishihara

import sys
sys.path.insert(0, "/home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/ray/pyarrow_files/")
import pyarrow
import torch
print(pyarrow.__file__) # /home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/ray/pyarrow_files/pyarrow/__init__.py
torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()

No error will be thrown if one switches the order of pyarrow and torch importing. The pyarrow on pip works fine though.

@pcmoritz
Copy link
Contributor

@richardliaw Which AMI is this using? On the deep learning AMI, this works for me (with the latest master and pytorch from the AMI).

@richardliaw
Copy link
Contributor

richardliaw commented Jul 24, 2018 via email

@pcmoritz
Copy link
Contributor

Ok, if I use the python3 environment in the DL AMI, install pytorch from pip and ray from source, I still can't reproduce it unfortunately. What else could be different?

@richardliaw
Copy link
Contributor

richardliaw commented Jul 25, 2018 via email

@pcmoritz
Copy link
Contributor

Already tried that, no segfault.

@richardliaw
Copy link
Contributor

I used this autoscaler setup:

# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal_2

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
min_workers: 0
max_workers: 0

# docker:
#     image: tensorflow/tensorflow:1.5.0-py3
#     container_name: ray_docker

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1f

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

head_node:
    InstanceType: p2.xlarge
    ImageId: ami-4aa57835

setup_commands: 
    - echo "export PYTHONNOUSERSITE=True" >> ~/.bashrc
    - conda create -y -n breaking-env python=3.5
    - source activate breaking-env && conda install pytorch torchvision cuda91 -c pytorch && pip install ray==0.5 absl-py

@pcmoritz
Copy link
Contributor

So even with

conda create -y -n breaking-env python=3.5
source activate breaking-env && conda install pytorch torchvision cuda91 -c pytorch && pip install ray==0.5 absl-py

and then in IPython:

In [1]: import ray
/home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/ubuntu/anaconda3/envs/python3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

In [2]: import torch

In [3]: torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()
   ...: 
Out[3]: Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)

I'm not able to reproduce it. Could it be that the environment of the autoscaler is different in some way (maybe env variables)?

@richardliaw
Copy link
Contributor

richardliaw commented Jul 26, 2018 via email

@pcmoritz
Copy link
Contributor

Good point, IPython doesn't seem to be present in the env:

(breaking-env) ubuntu@ip-172-31-56-152:~$ which ipython
/home/ubuntu/anaconda3/envs/python3/bin/ipython
(breaking-env) ubuntu@ip-172-31-56-152:~$ which python
/home/ubuntu/anaconda3/envs/breaking-env/bin/python
(breaking-env) ubuntu@ip-172-31-56-152:~$ 

@pcmoritz
Copy link
Contributor

With just python it's working yay:

(breaking-env) ubuntu@ip-172-31-56-152:~$ python
Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> import torch
>>> torch.nn.Conv2d(64, 2, kernel_size=3, stride=1, padding=1, bias=False).cuda()
Segmentation fault (core dumped)

@pcmoritz
Copy link
Contributor

Here is the backtrace:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7bc8a99 in __pthread_once_slow (once_control=0x7fffdb227e50 <at::globalContext()::globalContext_+400>, init_routine=0x7fffe4973fe1 <std::__once_proxy()>)
    at pthread_once.c:116
#2  0x00007fffda3f2302 in at::Type::toBackend(at::Backend) const () from /home/ubuntu/anaconda3/envs/breaking-env/lib/python3.5/site-packages/torch/lib/libcaffe2.so
#3  0x00007fffdc031231 in torch::autograd::VariableType::toBackend (this=<optimized out>, b=<optimized out>) at torch/csrc/autograd/generated/VariableType.cpp:145
#4  0x00007fffdc371e8a in torch::autograd::THPVariable_cuda (self=0x7ffff6dbfdc8, args=0x7ffff6daf710, kwargs=0x0) at torch/csrc/autograd/generated/python_variable_methods.cpp:333
#5  0x000055555569f4e8 in PyCFunction_Call ()
#6  0x00005555556f67cc in PyEval_EvalFrameEx ()
#7  0x00005555556fbe08 in PyEval_EvalFrameEx ()
#8  0x00005555556f6e90 in PyEval_EvalFrameEx ()
#9  0x00005555556fbe08 in PyEval_EvalFrameEx ()
#10 0x000055555570103d in PyEval_EvalCodeEx ()
#11 0x0000555555701f5c in PyEval_EvalCode ()
#12 0x000055555575e454 in run_mod ()
#13 0x000055555562ab5e in PyRun_InteractiveOneObject ()
#14 0x000055555562ad01 in PyRun_InteractiveLoopFlags ()
#15 0x000055555562ad62 in PyRun_AnyFileExFlags.cold.2784 ()
#16 0x000055555562b080 in Py_Main.cold.2785 ()
#17 0x000055555562b871 in main ()
(gdb) 

So I'm pretty sure it's the same problem that happened with TensorFlow that we deployed a workaround in apache/arrow#2210

I'll open a JIRA in arrow. This is super annoying, I hope we can fix the arrow thread pool altogether, otherwise we will need a similar workaround for pytorch too.

@pcmoritz
Copy link
Contributor

This is tough, I can only reproduce it with ray pip installed, not compiled from source. And not with pyarrow from pip (maybe that's too old).

@richardliaw
Copy link
Contributor

richardliaw commented Jul 26, 2018 via email

@pcmoritz
Copy link
Contributor

Yes it does, but only if ray is pip installed (not if locally compiled).

Fortunately now I have also been able to reproduce it with manylinux1 pyarrow wheels compiled from the latest arrow master inside of manylinux1 docker :)

@pcmoritz
Copy link
Contributor

Here is the arrow bug report: https://issues.apache.org/jira/browse/ARROW-2920

wesm pushed a commit to apache/arrow that referenced this issue Jul 27, 2018
This fixes ARROW-2920 (see also ray-project/ray#2447) for me

Unfortunately we might not be able to have regression tests for this right now because we don't have CUDA in our test toolchain.

Author: Philipp Moritz <pcmoritz@gmail.com>

Closes #2329 from pcmoritz/fix-pytorch-segfault and squashes the following commits:

1d82825 <Philipp Moritz> fix
74bc93e <Philipp Moritz> add note
ff14c4d <Philipp Moritz> fix
b343ca6 <Philipp Moritz> add regression test
5f0cafa <Philipp Moritz> fix
2751679 <Philipp Moritz> fix
10c5a5c <Philipp Moritz> workaround for pyarrow segfault
@ericl ericl closed this as completed Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants