Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI example not working in IPv6 #2714

Merged
merged 1 commit into from
Mar 6, 2023
Merged

Fix MPI example not working in IPv6 #2714

merged 1 commit into from
Mar 6, 2023

Conversation

gengwg
Copy link

@gengwg gengwg commented Feb 28, 2023

This is to fix errors for the MPI example in #2713.

This PR does 2 things:

  • By updating the Docker base image, the OpenMPI version is effectively upgraded.
  • Add --mca option to mpiexec. This seems to have fixed the IPv6 issue.

Before:

$ k -n gengwg exec -it lm-mpi-job-mpiworker-0 -- mpirun --version
mpirun (Open MPI) 1.10.2

Besides, getting the mount error in #2713. Example:

Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/4GCDNOZIAYJDMKBWJVVSMRDLWE:/var/lib/docker/overlay2/l/RJ4EXRGNG4CKKFCJPAANAG6KWQ:/var/lib/docker/overlay2/l/3GCIAIOLS32GIVNV3KMWN6TPNN:/var/lib/docker/overlay2/l/RDG5AJQ3G64UCFYZPHBDGC7PR6:/var/lib/docker/overlay2/l/LZSWVN7NELVR3CTN6D254MGKD6:/var/lib/docker/overlay2/l/OMLRKFEAEDPGWVCU5JGAZVSL3X:/var/lib/docker/overlay2/l/FTXOND2LJMKTD5FQJSXBG6JEWN,upperdir=/var/lib/docker/overlay2/fbfe3afd54f7512357817dde4c887fd8083bbcb9508d3c4265dfb4344f4c'

After:

$ k -n gengwg exec -it lm-mpi-job-mpimaster-0 -- mpirun --version
mpirun (Open MPI) 2.1.1 

No mount error any more. Also MPI runs successful:

$ k logs lm-mpi-job-mpimaster-0 -n gengwg
Hello world from processor lm-mpi-job-mpimaster-0, rank 1 out of 2 processors
Hello world from processor lm-mpi-job-mpimaster-0, rank 0 out of 2 processors

Would it be possible for someone to perform a test of this on a cluster that uses only IPv4? (should still be many in current world). This would ensure that backward compatibility is maintained. It's worth noting that our environment consists of hosts and pods that are IPv6 only, so it's a bit hard for me to test IPv4.

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 28, 2023
@gengwg
Copy link
Author

gengwg commented Feb 28, 2023

/assign @Thor-wl

@@ -1,4 +1,4 @@
FROM ubuntu:16.04
FROM ubuntu:18.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update this to 22.04?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of that too, but I only tested 18.04 so far. I can test 22.04 tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test 22.04 on arm64. It is OK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested 22.04, and confirmed this works on x86_64 IPv6 too.

$ k -n gengwg logs  lm-mpi-job-mpimaster-0
Hello world from processor lm-mpi-job-mpimaster-0, rank 0 out of 2 processors
Hello world from processor lm-mpi-job-mpimaster-0, rank 1 out of 2 processors

$ k -n gengwg get po  -w | grep mpi
lm-mpi-job-mpimaster-0         1/1     Running            0                 4m51s
lm-mpi-job-mpiworker-0         1/1     Running            0                 4m51s
lm-mpi-job-mpiworker-1         1/1     Running            0                 4m51s

$ k -n gengwg exec -it lm-mpi-job-mpimaster-0 -- mpirun --version
mpirun (Open MPI) 4.1.2

@@ -23,7 +23,7 @@ spec:
- |
MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world;
mpiexec --allow-run-as-root --mca --host ${MPI_HOST} -np 2 mpi_hello_world;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this parameter have any effect on ipv4?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure. That's why I wanted someone to test it on ipv4. If we don't find someone to help, I can try spining up minikube, and test it there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test in kind, with arm64 and ipv4, it works fine.But we'd better explain what this parameter is for

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we'd better explain what this parameter is for

That's a great question, and one that I was also curious about. I am actually new to learning MPI myself, and I have been experimenting with various methods. It turns out that using the "--mca" flag has been successful for me, but I did some further research and discovered that it actually requires a key value pair!

-mca, --mca <key> <value>
    Send arguments to various MCA modules. See the "MCA" section, below. 

https://www.open-mpi.org/doc/v2.1/man1/mpiexec.1.php

I'm confused now. Hope some expert in OpenMPI can provide clarification and help me understand it better. I can also dig more tomorrow. Time for bed now. Thanks for the help!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retested using 22.04, and it appears to be working without the need for the --mca option.

$ grep mpiexec mpi-example.yaml
                  mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world; sleep 600;

$ k -n gengwg logs  lm-mpi-job-mpimaster-0
Warning: Permanently added 'lm-mpi-job-mpiworker-0.lm-mpi-job' (ED25519) to the list of known hosts.
Warning: Permanently added 'lm-mpi-job-mpiworker-1.lm-mpi-job' (ED25519) to the list of known hosts.
Hello world from processor lm-mpi-job-mpiworker-0, rank 0 out of 2 processors

$ k -n gengwg exec -it lm-mpi-job-mpimaster-0 -- mpirun --version
mpirun (Open MPI) 4.1.2

Let me update this PR to remove the --mca. Image/MPI version update seems sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Maybe someone could rebuild and upload the image to Docker Hub, such as example-mpi:0.0.3, if it is not built automatically. It would be beneficial for others experiencing the same issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the pr is merged, I can push the new example-mpi image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the pr is merged, I can push the new example-mpi image.

please also modify

image: volcanosh/example-mpi:0.0.1

Copy link
Member

@wangyang0616 wangyang0616 Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, thanks!

@hwdef
Copy link
Member

hwdef commented Feb 28, 2023

And please make the DCO CI happy

  1. Ensure you have a local copy of your branch by checking out the pull request locally via command line.
  2. In your local branch, run: git rebase HEAD~1 --signoff
  3. Force push your changes to overwrite the branch: git push --force-with-lease origin master

@gengwg
Copy link
Author

gengwg commented Feb 28, 2023

And please make the DCO CI happy

1. Ensure you have a local copy of your branch by [checking out the pull request locally via command line](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/checking-out-pull-requests-locally).

2. In your local branch, run: git rebase HEAD~1 --signoff

3. Force push your changes to overwrite the branch: git push --force-with-lease origin master

Thanks. Just tried. Seems working now.

 DCO / DCO succeeded Feb 27, 2023 in 0s
DCO

All commits are signed off!

@hwdef
Copy link
Member

hwdef commented Feb 28, 2023

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 28, 2023
@hwdef
Copy link
Member

hwdef commented Mar 1, 2023

I think it's better to squash your commits to one commit. The three commits now are too many, which will pollute the commit history

@gengwg
Copy link
Author

gengwg commented Mar 1, 2023

I think it's better to squash your commits to one commit. The three commits now are too many, which will pollute the commit history

I'm not very familiar with squashing commits since I don't use git frequently. However, I'll do some research how to do it. If I'm unable to do so, I'll create a new pull request as a last resort.

Signed-off-by: gengwg <genwg@users.noreply.github.com>

Update image to 22.04

Signed-off-by: gengwg <genwg@users.noreply.github.com>

remove --mca option

Signed-off-by: gengwg <genwg@users.noreply.github.com>
@hwdef
Copy link
Member

hwdef commented Mar 1, 2023

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 1, 2023
@gengwg
Copy link
Author

gengwg commented Mar 1, 2023

Squashed. I followed the instructions here. I think better than ChatGPT :)

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Thor-wl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2023
@volcano-sh-bot volcano-sh-bot merged commit 7ec3f2d into volcano-sh:master Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants