-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MPI example not working in IPv6 #2714
Conversation
/assign @Thor-wl |
example/integrations/mpi/Dockerfile
Outdated
@@ -1,4 +1,4 @@ | |||
FROM ubuntu:16.04 | |||
FROM ubuntu:18.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we update this to 22.04?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of that too, but I only tested 18.04 so far. I can test 22.04 tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test 22.04 on arm64. It is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested 22.04, and confirmed this works on x86_64 IPv6 too.
$ k -n gengwg logs lm-mpi-job-mpimaster-0
Hello world from processor lm-mpi-job-mpimaster-0, rank 0 out of 2 processors
Hello world from processor lm-mpi-job-mpimaster-0, rank 1 out of 2 processors
$ k -n gengwg get po -w | grep mpi
lm-mpi-job-mpimaster-0 1/1 Running 0 4m51s
lm-mpi-job-mpiworker-0 1/1 Running 0 4m51s
lm-mpi-job-mpiworker-1 1/1 Running 0 4m51s
$ k -n gengwg exec -it lm-mpi-job-mpimaster-0 -- mpirun --version
mpirun (Open MPI) 4.1.2
@@ -23,7 +23,7 @@ spec: | |||
- | | |||
MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`; | |||
mkdir -p /var/run/sshd; /usr/sbin/sshd; | |||
mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world; | |||
mpiexec --allow-run-as-root --mca --host ${MPI_HOST} -np 2 mpi_hello_world; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this parameter have any effect on ipv4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure. That's why I wanted someone to test it on ipv4. If we don't find someone to help, I can try spining up minikube, and test it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test in kind, with arm64 and ipv4, it works fine.But we'd better explain what this parameter is for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we'd better explain what this parameter is for
That's a great question, and one that I was also curious about. I am actually new to learning MPI myself, and I have been experimenting with various methods. It turns out that using the "--mca" flag has been successful for me, but I did some further research and discovered that it actually requires a key value pair!
-mca, --mca <key> <value>
Send arguments to various MCA modules. See the "MCA" section, below.
https://www.open-mpi.org/doc/v2.1/man1/mpiexec.1.php
I'm confused now. Hope some expert in OpenMPI can provide clarification and help me understand it better. I can also dig more tomorrow. Time for bed now. Thanks for the help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I retested using 22.04, and it appears to be working without the need for the --mca option.
$ grep mpiexec mpi-example.yaml
mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world; sleep 600;
$ k -n gengwg logs lm-mpi-job-mpimaster-0
Warning: Permanently added 'lm-mpi-job-mpiworker-0.lm-mpi-job' (ED25519) to the list of known hosts.
Warning: Permanently added 'lm-mpi-job-mpiworker-1.lm-mpi-job' (ED25519) to the list of known hosts.
Hello world from processor lm-mpi-job-mpiworker-0, rank 0 out of 2 processors
$ k -n gengwg exec -it lm-mpi-job-mpimaster-0 -- mpirun --version
mpirun (Open MPI) 4.1.2
Let me update this PR to remove the --mca. Image/MPI version update seems sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Maybe someone could rebuild and upload the image to Docker Hub, such as example-mpi:0.0.3, if it is not built automatically. It would be beneficial for others experiencing the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the pr is merged, I can push the new example-mpi image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the pr is merged, I can push the new example-mpi image.
please also modify
image: volcanosh/example-mpi:0.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, thanks!
And please make the DCO CI happy
|
Thanks. Just tried. Seems working now.
|
/ok-to-test |
I think it's better to squash your commits to one commit. The three commits now are too many, which will pollute the commit history |
I'm not very familiar with squashing commits since I don't use git frequently. However, I'll do some research how to do it. If I'm unable to do so, I'll create a new pull request as a last resort. |
Signed-off-by: gengwg <genwg@users.noreply.github.com> Update image to 22.04 Signed-off-by: gengwg <genwg@users.noreply.github.com> remove --mca option Signed-off-by: gengwg <genwg@users.noreply.github.com>
/lgtm |
Squashed. I followed the instructions here. I think better than ChatGPT :) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Thor-wl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is to fix errors for the MPI example in #2713.
This PR does 2 things:
--mca
option to mpiexec. This seems to have fixed the IPv6 issue.Before:
Besides, getting the mount error in #2713. Example:
After:
No mount error any more. Also MPI runs successful:
Would it be possible for someone to perform a test of this on a cluster that uses only IPv4? (should still be many in current world). This would ensure that backward compatibility is maintained. It's worth noting that our environment consists of hosts and pods that are IPv6 only, so it's a bit hard for me to test IPv4.