Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about ~/.nv/ComputeCache behavior with docker #272

Closed
ajhool opened this issue Mar 14, 2019 · 9 comments
Closed

Confusion about ~/.nv/ComputeCache behavior with docker #272

ajhool opened this issue Mar 14, 2019 · 9 comments

Comments

@ajhool
Copy link

ajhool commented Mar 14, 2019

As the README.md says:

Note that running waifu2x in without JIT caching is very slow, which is what would happen if you use docker. For a workaround, you can mount a host volume to the CUDA_CACHE_PATH, for instance,

nvidia-docker run -v $PWD/ComputeCache:/root/.nv/ComputeCache waifu2x th waifu2x.lua --help

Does this mean that when using docker waifu2x will always run very slowly the first time that waifu2x is executed on the host volume and subsequent executions on the host volume will be faster? Is luajit compiling the program the first time that it is used and then executing the compiled version in subsequent runs?

My specific use-case is that I'm executing the docker image in the cloud (AWS EC2 -- p3.2xlarge instances using the Volta architecture). This means that the host volume changes frequently. So, if I spin up a new EC2 instance from an AMI that has never executed waifu2x before, will the first execution of the docker image always be slow (even if I pass the ComputeCache path to docker). If so, I would generate the AMI after executing waifu2x so that the binary is already in the ComputeCache when the server is started, but that step is nontrivial in practice

Are there additional steps I need to take to "prime" the host container with precompiled binaries/libraries for the Volta architecture that would make subsequent docker executions run more quickly? Is it possible to simply build waifu2x ahead of time, instead of relying on JIT?

@nagadomi
Copy link
Owner

First, I am not familiar with Docker and I do not use it personally.

Are there additional steps I need to take to "prime" the host container with precompiled binaries/libraries for the Volta architecture that would make subsequent docker executions run more quickly? Is it possible to simply build waifu2x ahead of time, instead of relying on JIT?

It is related to CUDA, not LuaJIT.

https://devblogs.nvidia.com/cuda-pro-tip-understand-fat-binaries-jit-caching/: The first approach is to completely avoid the JIT cost by including binary code for one or more architectures in the application binary along with PTX code. The CUDA run time looks for code for the present GPU architecture in the binary, and runs it if found. If binary code is not found but PTX is available, then the driver compiles the PTX code. In this way deployed CUDA applications can support new GPUs when they come out.

So I guess that if you build Torch7(cutorch and cunn) with gencode arch=compute_70,code=sm_70 option you can avoid CUDA JIT compilation.
However, kaixhin/cuda-torch:7.5 is very old, and cutorch has not been updated before Volta was released.
So, probably need to recreate Docker Image that can build Torch7 with compute_70 binary.

I would generate the AMI after executing waifu2x so that the binary is already in the ComputeCache when the server is started

I guess this will work.  waifu2x.udp.jp uses AMI without Docker. (However, it takes about 30 seconds at the first execution).

@ajhool
Copy link
Author

ajhool commented Mar 15, 2019

I also was using the AMI without Docker and things were working properly, but when I added Docker the initial execution took 10 minutes (as opposed to 30 seconds on the bare AMI), so it might just be a simple docker integration issue. The specific hangup is that importing cudnn takes 10 minutes... there's a line in cudnn that tries to configure the gpus and struggles with Volta.

I'm also new to Docker, so the cacheing issue might be a red herring but I'm still working through it. It seems plausible. You had mentioned that cacheing might have been the issue here (I hadn't realized it was you that pointed me here :) ) soumith/cudnn.torch#385

@nagadomi
Copy link
Owner

ok, I will try to build cuda-torch:10.1 image and test it on p3 instance.

@ajhool
Copy link
Author

ajhool commented Mar 15, 2019

I don't believe cudnn has cuda 10 bindings, I was seeing this behavior with

Cuda 9 and cudnn 7.1

Here's the issue + docker file for how I was building it:
torch/torch7#1193

@nagadomi
Copy link
Owner

nagadomi commented Mar 16, 2019

I have built a Docker image, I changed to generate binaries for sm_70(volta) and sm_75 at docker build.
https://hub.docker.com/r/nagadomi/torch7
https://hub.docker.com/r/nagadomi/waifu2x

However it requires nvidia driver >= 418, and it seems that 418 driver for Tesla V100 has not been released yet. I just noticed now. 😢 So it is not tested.

Dockerfile for torch7: https://github.com/nagadomi/distro/blob/cuda10/Dockerfile
If you want to try it immediately, you can edit nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 to nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 and docker build.

@nagadomi
Copy link
Owner

nagadomi commented Mar 16, 2019

I found https://devtalk.nvidia.com/default/topic/1047710/announcements-and-news/linux-solaris-and-freebsd-driver-418-43-long-lived-branch-release-/ .

OK, it works.

  1. Launch p3.2xlarge instance with Deep Learning AMI (Ubuntu) Version 20.0 (ami-0f9e8c4a1305ecd22)
  2. Change the volume to 150GB because there is not enough disk space.
  3. Install nvidia 418 driver
  4. Test
$ docker pull nagadomi/waifu2x
$ time nvidia-docker run -v `pwd`/images:/images nagadomi/waifu2x th waifu2x.lua -force_cudnn 1 -m scale -scale 2 -i /images/miku_small.png -o /images/output.png
/images/output.png: 1.4679141044617 sec
real	0m7.688s
user	0m0.044s
sys	0m0.012s

@ajhool
Copy link
Author

ajhool commented Mar 18, 2019

Thanks for the incredibly quick response and guidance.

I notice that the waifu2x dockerfile is skipping the soumith cudnn install/make that is seen in the install_lua_modules.sh script. I am confused as to how lua is finding the cudnn bindings without that package?

From install_lua_modules.sh:

install_cudnn()
{
    rm -fr $CUDNN_WORK_DIR
    git clone https://github.com/soumith/cudnn.torch.git -b $CUDNN_BRANCH $CUDNN_WORK_DIR
    cd $CUDNN_WORK_DIR
    luarocks make cudnn-scm-1.rockspec
    cd ..
    rm -fr $CUDNN_WORK_DIR
}

@nagadomi
Copy link
Owner

cudnn.torch is installed at the time of installation of torch7.
https://github.com/nagadomi/distro/blob/2798753bf053e0e2535465697c40df78c251c7d4/install.sh#L156
I have updated distro's cudnn.torch submodule to R7 branch.

@ajhool
Copy link
Author

ajhool commented Mar 23, 2019

Can confirm that this strategy worked.

I realized that I was originally using the Amazon Linux Deep Learning AMI instead of the Ubuntu Deep Learning AMI. It's very possible that the Amazon Linux distro simply doesn't work properly with NVIDIA or cuda or nvidia-docker, there have been reports of similar issues. Thanks for taking the time to help me throught this @nagadomi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants