Skip to content

Commit

Permalink
Add CentOS documentation and improve dockerfiles for UCX (NVIDIA#2537)
Browse files Browse the repository at this point in the history
* Create Dockerfiles for Ubuntu and CentOS for RDMA and basic UCX installs

* Add a section specific to CentOS baremetal in docs

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

* Fix typos

* Update docs/additional-functionality/rapids-shuffle.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update docs/additional-functionality/rapids-shuffle.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* PR review comments

* Add info on where to fetch CUDA rpm from

* Fix typos

* Update docs/additional-functionality/rapids-shuffle.md

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
  • Loading branch information
abellina and jlowe authored May 28, 2021
1 parent 32e4a1c commit 9a34857
Show file tree
Hide file tree
Showing 5 changed files with 251 additions and 44 deletions.
98 changes: 54 additions & 44 deletions docs/additional-functionality/rapids-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,51 @@ The minimum UCX requirement for the RAPIDS Shuffle Manager is

---

##### CentOS UCX RPM
The UCX packages for CentOS 7 and 8 are divided into different RPMs. For example, UCX 1.10.1
available at
https://github.com/openucx/ucx/releases/download/v1.10.1/ucx-v1.10.1-centos7-mofed5.x-cuda11.0.tar.bz2
contains:

```
ucx-devel-1.10.1-1.el7.x86_64.rpm
ucx-debuginfo-1.10.1-1.el7.x86_64.rpm
ucx-1.10.1-1.el7.x86_64.rpm
ucx-cuda-1.10.1-1.el7.x86_64.rpm
ucx-rdmacm-1.10.1-1.el7.x86_64.rpm
ucx-cma-1.10.1-1.el7.x86_64.rpm
ucx-ib-1.10.1-1.el7.x86_64.rpm
```

For a setup without RoCE or Infiniband networking, the only packages required are:

```
ucx-1.10.1-1.el7.x86_64.rpm
ucx-cuda-1.10.1-1.el7.x86_64.rpm
```

If accelerated networking is available, the package list is:

```
ucx-1.10.1-1.el7.x86_64.rpm
ucx-cuda-1.10.1-1.el7.x86_64.rpm
ucx-rdmacm-1.10.1-1.el7.x86_64.rpm
ucx-ib-1.10.1-1.el7.x86_64.rpm
```

---
**NOTE:**

The CentOS RPM requires CUDA installed via RPMs to satisfy its dependencies. The CUDA runtime can be
downloaded from [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)
(note the [Archive of Previous CUDA releases](https://developer.nvidia.com/cuda-toolkit-archive)
link to download prior versions of the runtime).

For example, in order to download the CUDA RPM for CentOS 7 running on x86:
`Linux` > `x86_64` > `CentOS` > `7` or `8` > `rpm (local)` or `rpm (network)`.

---

#### Docker containers

Running with UCX in containers imposes certain requirements. In a multi-GPU system, all GPUs that
Expand All @@ -100,50 +145,15 @@ system if you have RDMA capable hardware.
Within the Docker container we need to install UCX and its requirements. These are Dockerfile
examples for Ubuntu 18.04:

##### Without RDMA:
The following is an example of a Docker container with UCX 1.10.1 and cuda-11.0 support, built
for a setup without RDMA capable hardware:

```
ARG CUDA_VER=11.0
# Now start the main container
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04
RUN apt update
RUN apt-get install -y wget
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.10.1/ucx-v1.10.1-ubuntu18.04-mofed5.x-cuda11.0.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
```

##### With RDMA:
The following is an example of a Docker container that shows how to install `rdma-core` and
UCX 1.10.1 with `cuda-11.0` support. You can use this as a base layer for containers that your
executors will use.
The following are examples of Docker containers with UCX 1.10.1 and cuda-11.0 support.

```
ARG CUDA_VER=11.0
# Throw away image to build rdma_core
FROM ubuntu:18.04 as rdma_core
RUN apt update
RUN apt-get install -y dh-make git build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc
RUN git clone --depth 1 --branch v33.0 https://github.com/linux-rdma/rdma-core
RUN cd rdma-core && debian/rules binary
| OS Type | RDMA | Dockerfile |
| ------- | ---- | ---------- |
| Ubuntu | Yes | [Dockerfile.ubuntu_rdma](shuffle-docker-examples/Dockerfile.ubuntu_rdma) |
| Ubuntu | No | [Dockerfile.ubuntu_no_rdma](shuffle-docker-examples/Dockerfile.ubuntu_no_rdma) |
| CentOS | Yes | [Dockerfile.centos_rdma](shuffle-docker-examples/Dockerfile.centos_rdma) |
| CentOS | No | [Dockerfile.centos_no_rdma](shuffle-docker-examples/Dockerfile.centos_no_rdma) |

# Now start the main container
FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04
COPY --from=rdma_core /*.deb /tmp/
RUN apt update
RUN apt-get install -y cuda-compat-11-0 wget udev dh-make libudev-dev libnl-3-dev libnl-route-3-dev python3-dev cython3
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/v1.10.1/ucx-v1.10.1-ubuntu18.04-mofed5.x-cuda11.0.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
```

### Validating UCX Environment

After installing UCX you can utilize `ucx_info` and `ucx_perftest` to validate the installation.
Expand All @@ -166,7 +176,7 @@ In this section, we are using a docker container built using the sample dockerfi

2. Test to check whether UCX can link against CUDA:
```
root@test-machin:/# ucx_info -d|grep cuda
root@test-machine:/# ucx_info -d|grep cuda
# Memory domain: cuda_cpy
# Component: cuda_cpy
# Transport: cuda_copy
Expand Down Expand Up @@ -353,4 +363,4 @@ for this, other than to trigger a GC cycle on the driver.

Spark has a configuration `spark.cleaner.periodicGC.interval` (defaults to 30 minutes), that
can be used to periodically cause garbage collection. If you are experiencing OOM situations, or
performance degradation with several Spark actions, consider tuning this setting in your jobs.
performance degradation with several Spark actions, consider tuning this setting in your jobs.
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample Dockerfile to install UCX in a CentosOS 7 image
#
# The parameters are:
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/

ARG CUDA_VER=11.0.3
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0

FROM nvidia/cuda:${CUDA_VER}-runtime-centos7
ARG UCX_VER
ARG UCX_CUDA_VER

RUN yum update -y && yum install -y wget bzip2
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/ucx-$UCX_VER-centos7-mofed5.x-cuda$UCX_CUDA_VER.tar.bz2
RUN cd /tmp && tar -xvf *.bz2 && \
yum install -y ucx-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-cuda-1.10.1-1.el7.x86_64.rpm && \
rm -rf /tmp/*.rpm
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Sample Dockerfile to install UCX in a CentosOS 7 image with RDMA support.
#
# The parameters are:
# - RDMA_CORE_VERSION: Set to 32.1 to match the rdma-core line in the latest
# released MLNX_OFED 5.x driver
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/
#
# The Dockerfile first fetches and builds `rdma-core` to satisfy requirements for
# the ucx-ib and ucx-rdma RPMs.

ARG RDMA_CORE_VERSION=32.1
ARG CUDA_VER=11.0.3
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0

# Throw away image to build rdma_core
FROM centos:7 as rdma_core
ARG RDMA_CORE_VERSION

RUN yum update -y
RUN yum install -y wget gcc cmake libnl3-devel libudev-devel make pkgconfig valgrind-devel epel-release
RUN yum install -y cmake3 ninja-build pandoc rpm-build python-docutils

RUN wget https://github.com/linux-rdma/rdma-core/releases/download/v${RDMA_CORE_VERSION}/rdma-core-${RDMA_CORE_VERSION}.tar.gz

# Build RPM
RUN mkdir -p rpmbuild/SOURCES tmp && \
tar --wildcards -xzf rdma-core*.tar.gz */redhat/rdma-core.spec --strip-components=2 && \
RPM_SRC=$((rpmspec -P *.spec || grep ^Source: *.spec) | awk '/^Source:/{split($0,a,"[ \t]+");print(a[2])}') && \
(cd rpmbuild/SOURCES && ln -sf ../../rdma-core*.tar.gz "$RPM_SRC")
RUN rpmbuild --define '_tmppath '$(pwd)'/tmp' --define '_topdir '$(pwd)'/rpmbuild' -bb *.spec
RUN mv rpmbuild/RPMS/x86_64/*.rpm /tmp

# Now start the main container
FROM nvidia/cuda:${CUDA_VER}-runtime-centos7
ARG UCX_VER
ARG UCX_CUDA_VER

COPY --from=rdma_core /tmp/*.rpm /tmp/

RUN yum update -y
RUN yum install -y wget bzip2
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/ucx-$UCX_VER-centos7-mofed5.x-cuda$UCX_CUDA_VER.tar.bz2
RUN cd /tmp && \
yum install -y *.rpm && \
tar -xvf *.bz2 && \
yum install -y ucx-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-cuda-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-ib-1.10.1-1.el7.x86_64.rpm && \
yum install -y ucx-rdmacm-1.10.1-1.el7.x86_64.rpm
RUN rm -rf /tmp/*.rpm && rm /tmp/*.bz2
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample Dockerfile to install UCX in a Ubuntu 18.04 image
#
# The parameters are:
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/

ARG CUDA_VER=11.0
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0

FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu18.04
ARG UCX_VER
ARG UCX_CUDA_VER

RUN apt update
RUN apt-get install -y wget
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/ucx-$UCX_VER-ubuntu18.04-mofed5.x-cuda$UCX_CUDA_VER.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Sample Dockerfile to install UCX in a Unbuntu 18.04 image with RDMA support.
#
# The parameters are:
# - RDMA_CORE_VERSION: Set to 32.1 to match the rdma-core line in the latest
# released MLNX_OFED 5.x driver
# - CUDA_VER: 11.0.3 to pick up the latest 11.x CUDA base layer
# - UCX_VER and UCX_CUDA_VER: these are used to pick a package matchin a specific UCX version and
# CUDA runtime from the UCX github repo.
# See: https://github.com/openucx/ucx/releases/
#
# The Dockerfile first fetches and builds `rdma-core` to satisfy requirements for
# the ucx-ib and ucx-rdma RPMs.

ARG RDMA_CORE_VERSION=32.1
ARG CUDA_VER=11.0.3
ARG UCX_VER=v1.10.1
ARG UCX_CUDA_VER=11.0

# Throw away image to build rdma_core
FROM ubuntu:18.04 as rdma_core
ARG RDMA_CORE_VERSION

RUN apt update
RUN apt-get install -y dh-make wget build-essential cmake gcc libudev-dev libnl-3-dev libnl-route-3-dev ninja-build pkg-config valgrind python3-dev cython3 python3-docutils pandoc

RUN wget https://github.com/linux-rdma/rdma-core/releases/download/v${RDMA_CORE_VERSION}/rdma-core-${RDMA_CORE_VERSION}.tar.gz
RUN tar -xvf *.tar.gz && cd rdma-core*/ && dpkg-buildpackage -b -d

# Now start the main container
FROM nvidia/cuda:${CUDA_VER}-runtime-ubuntu18.04
ARG UCX_VER
ARG UCX_CUDA_VER

COPY --from=rdma_core /*.deb /tmp/

RUN apt update
RUN apt-get install -y cuda-compat-11-0 wget udev dh-make libudev-dev libnl-3-dev libnl-route-3-dev python3-dev cython3
RUN cd /tmp && wget https://github.com/openucx/ucx/releases/download/$UCX_VER/ucx-$UCX_VER-ubuntu18.04-mofed5.x-cuda$UCX_CUDA_VER.deb
RUN apt install -y /tmp/*.deb && rm -rf /tmp/*.deb

0 comments on commit 9a34857

Please sign in to comment.