-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP connectivity problem in OpenMPI 4.1.4 #10734
Comments
Hello. Thanks for submitting an issue. I'd be curious to see your NOTE: your configure option |
Sorry - the configure line got chopped off; I edited the post above to correct it. Yes, I'm submitting via LSF, so my mpirun line looks something like:
|
Is the IP address that it tried to connect to correct (9.9.11.33)? Also, is there a reason you're using TCP over IB? That is known to be pretty slow compared to native IB protocols. I think early versions of TCP over IB had some reliability issues, too. You might just want to switch to building Open MPI with UCX and let Open MPI use the native IB protocols. |
I think the IP address is correct, but there are some connectivity problems. What's puzzling is that OpenMPI 3.1.0 works. Is there any way to see what interface is being used by mpirun? Yes, UCX would be preferable, but SLES12 is fairly old at this point, and the version of librdmacm that we have on the platform fails at configure time for UCX, so my understanding is that falls back on TCP anyway. (That's why I disabled UCX in the build of OpenMPI.) |
UCX/old SLES: ah, got it. I assume the cost (e.g., in time/resources) to upgrade to a newer OS is too prohibitive. That being said, it might not be that hard to get new librdmacm + new UCX + Open MPI v4.x to work with UCX/native IB. E.g., if you install all of them into the same installation tree, and ensure that that installation tree appears first in your One big disclaimer: I don't follow the SLES distro and the IB software stacks these days; I don't know if there's anything in the SLES 12 kernel, for example, that would explicitly prohibit using new librdmacm / new UCX. E.g., I don't know if you'll need new IB kernel drivers or not. All that being said, let's talk TCP. Yes, you can make the TCP BTL be very chatty about what it is doing. Set the MCA parameter |
Yes, unfortunately, upgrading the OS is a major undertaking and is not an option at this time. I ran some additional tests with one of our parallel applications on a portion of our cluster that has been partitioned off for investigation of this issue. This portion does not seem to have the TCP connect() error, but it does exhibit another issue that I've seen with OpenMPI 4.1 versus 3.1: considerably more erratic performance. These jobs all use 16 processes on systems that have 28 slots each, so there is relatively limited communication between hosts - many of the jobs should just be using vader. Here's the performance with OpenMPI 3.1.0:
Here's the same application compiled with OpenMPI 4.1.4:
I've attached outputs generated with |
Bump. Any thoughts on how to narrow down the problem? |
from the logs, Open MPI 3.1.0 uses both I suggest you try forcing
|
@ggouaillardet is right. But I see that the v4.1.x log is also using the In both versions of Open MPI, I'd suggest what @ggouaillardet suggested: force the use of Have you tried uninstalling the OS IB stack and installing your own, per my prior comment? |
Forcing the use of Part of the dysfunction here may be differing versions of OFED being installed on the build machine as compared to the rest of the cluster. (I'm asking the admins to look into it.) I thought 3.1.0 was using IB via TCP, but that seems to not be correct - I see If I force 3.1.0 to use |
I just re-read your comments and see this:
Does this mean each run is on a single node, launching MPI processes on 16 out of 28 total cores? |
There are three nodes in this special testing queue - each with 28 slots, so 84 slots in total. I submitted ten 16-process jobs to the queue, so about half of them would run entirely within a single node and the other half would be split between nodes. |
Oh, that makes a huge difference. If an MPI job is running entirely on a single node, it won't use TCP at all: it will use shared memory to communicate on-node. More generally, Open MPI processes will use shared memory (which is significantly faster than both TCP and native IB) to communicate with peers that are on the same node, and will use some kind of network to communicate with peers off-node. So if your jobs end up having different numbers of on-node / off-node peers, that can certainly explain why there's variations in total execution times. That being said, it doesn't explain why there's large differences between v3.x and v4.x. It would be good to get some apples-to-apples comparisons between v3.x and v4.x, though. Let's get the network out of the equation, and only test shared memory as an MPI transport. That avoids any questions about IPoIB. Can you get some timings of all-on-one-node runs with Open MPI v3.x and v4.x? |
OK, the machines I was running on got wiped and re-inserted into the general population and some other machines were swapped in to my partition of the network. These new machines are running SLES12-SP5, and the OFED version mismatch issue was sorted out. I re-compiled OpenMPI 4.1.4, and
I'm not sure what these mean or how catastrophic they are, but the jobs seem to run with With
On Version 3.1.0, I get:
Practically equivalent performance. Interestingly, if I run
Does that suggest some kind of network configuration issue? |
Some clarifying questions:
|
|
FYI: You should be able to download and install a later version of UCX yourself (e.g., just install it under your |
Understood, but current UCX does not work with the version of librdmacm from the OS. In principle, I could install a newer version, but it would be far easier if the OS load set could be made to work. Job #1, which is performing somewhat slowly has:
Job #2, which is performing very slowly, has:
Seems like there's overlap, no? |
I forgot about your librdmacm issue. Yes, you could install that manually, too -- it's also a 100% userspace library. Yes, those 2 jobs definitely overlap -- that's why you're seeing dramatic slowdowns: multiple MPI processes are are being bound to the same core, and therefore they're fighting for cycles. At this point, I have to turn you back over to @gpaulsen because I don't know how Open MPI reads the LSF job info and decides which cores to use. |
If you are running multiple |
@markalle Can you please take a look? Perhaps some ORTE verbosity will shed some light on things? |
What parameters should I set? |
If you have built with |
Are these jobs running at the same time? If they're not running at the same time then I don't think there's any overlap, they both look like 2-host jobs where Job 1 is: and Job 2 is: But if they're both bsubed simultaneously and are both trying to use bl3402 at the same time then I see what you're saying about overlap. I don't actually remember which version of OMPI prints full-host affinity output vs which would only show the cgroup it was handed and the binding relative to that cgroup... when it does the latter it leaves the output kind of unclear looking IMO. My expectation is that if those LSF jobs were running at the same time, then LSF should have handed a different cgroup to each job and those cgroups shouldn't overlap each other. I think those are probably all full-host affinity displays, but when in doubt I just stick my own function somewhere so I know what it's printing. Eg something like: #define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>
void
print_affinity()
{
int i, n;
char hostname[64];
char *str;
cpu_set_t mask;
n = sysconf(_SC_NPROCESSORS_ONLN);
sched_getaffinity(0, sizeof(mask), &mask);
str = malloc(n + 256);
if (!str) { return; }
gethostname(hostname, 64);
sprintf(str, "%s:%d ", hostname, getpid());
for (i=0; i<n; ++i) {
if (CPU_ISSET(i, &mask)) {
strcat(str, "1");
} else {
strcat(str, "0");
}
}
printf("%s\n", str);
free(str);
}
int
main() {
print_affinity();
return(0);
} |
Yes, both jobs were running at the same time. Re-compiling with |
To get LSF to pick the affinity for the job which would let the two jobs not overlap each other I think the bsub setting you need is -R, for example: When an option like that is in use the job should get an affinity assignment, and you can confirm what it did by looking at some environment variables. For example on my machine if I run two jobs at the same time with the above settings I get
Or if you wanted a more socket-oriented binding style that's closer to what OMPI v3.x was doing, you could use |
Ta-da! Run with OpenMPI 4.1.4 using
Thanks, guys! Before I mark this closed, are there similar options for Torque? We're between job schedulers at the moment, and I suspect this issue is affecting Torque jobs, also. |
I guarantee there are, but heck if I can remember them. Easiest solution is to skip the hieroglyphics and just use |
On any given machine, there may be a handful of non-MPI jobs running where our MPI job is dispatched. Are the job schedulers smart enough to look at the state of the machine and pick a core (or socket) based on which slots are idle? |
It sounds like your scheduler is configured to support that behavior - it must not be allocating solely at the node level. If it is allowed to allocate subdivisions of a node, then it will certainly do so. |
Let me re-phrase your question -- tell me if this is inaccurate:
I'm not an expert in this area / I do not closely follow the latest features of the various job schedulers out there, but this does not sound like a good idea. My gut reaction/assumption is that your scheduler(s) do not account for jobs launched outside of the job scheduler, but you should check the specific documentation of your system to be sure. My $0.02: you should launch everything through the job scheduler so that you 100% know that all the jobs are using the policies that are designated for your cluster (e.g., whether you want to allow oversubscription or not). If jobs are launched from outside the scheduler, there is no guarantee that the job scheduler will account for them, and therefore you can end up unexpectedly oversubscribing the resources, resulting in poor performance. Also, jobs launched outside of the job scheduler may not be restricted to specific resources (e.g., cores), so they may end up floating around between cores, and can therefore have varying effects on job-scheduler-launched jobs. These kinds of behaviors tend to make all users unhappy -- even those who are stubborn enough to not use the job scheduler to launch their jobs. |
I agree with jsquyres, on a former project I worked on we inherited an affinity system that tried to examine the load and pick a non-busy subset of the machines to run on and it was a terrible feature. It produced unpredictable behavior from run to run and was just about always a degradation in performance. As far as I know, LSF and any other scheduler is just going to partition the machines up among the jobs it manages and not try to deduce load of other jobs from outside the scheduler. So I'd also recommend not mixing non-scheduled jobs with scheduled jobs, and instead send everything through a scheduler so it can make sensible assignments. That said, I'm also not familiar with Torque options. Vanilla mpirun --bind-to socket is an interesting solution though. On the one hand that option wouldn't be keeping track of cross-job assignments, but in practical terms on ordinary configurations it's still likely to produce balanced workloads, and it's enough of a binding I expect it would give mostly the same benefits as specific core bindings. I would be curious though to see performance numbers comparing the core vs socket bindings |
No, that's not correct. All jobs are launched via the scheduler(s). I'm wondering if the scheduler has control over the which slots a non-MPI job occupies, or whether that's a decision made by the OS, and it sounds like the scheduler is probably deciding which slot to run the job on - or at least is capable of seeing which slots are vacant. If that's the case, it sounds like binding to core (and using mappings from the scheduler) sounds like the way to fly. |
That could be okay. If all the jobs are bsubed through LSF then at least the total amount of work being put on each machine would match the number of slots/cores the machine has. Then as to the specifics of which cores the various jobs are using you've got a few options, but I'd guess the non-MPI jobs wouldn't need to be created with a specific affinity, instead letting the OS decide. I'd probably run both ways just to see how the performance compares, but that's my guess. The situation where you ran into trouble was when LSF was putting two MPI jobs on the same system but leaving the affinity up to the app, and MPI doesn't have the cross-job awareness to deal with that and was binding them to the same cores. I'd say any of the following are decent solutions:
I like both the "bsub -R" and the "bsub without -R but bind-to socket" ideas and would just test the performance and pick based on that. Going just a little further down the details though, I'd also consider using "--rank-by core" alongside the "--bind-to socket" option. The same sockets would be used, but it would change which ranks are assigned to which socket, eg
On average I'd expect adjacent ranks to do slightly more communication between each other than with further away ranks, so I'd pick the second binding above in absence of other info about the job. So I'd compare the performance with
|
Ah, you really did mean MPI job. Ok. I was confused because you used the word "slot", but "slot" is very much an Open MPI / PMIx term -- not a scheduler term. I think @markalle outlined the situation well in his comment, above. I'd add one clarification: if you use For example, you have 2 x 6-node cores in your nodes. If LSF assigns cores in 3 different jobs on a single node like this:
In this situation,
This may be ok from your perspective, but just realize that it's different than every MPI process effectively being bound to all 6 cores in a single package. Put differently: Open MPI will bind to all the cores in a given package within the set of cores that are allocated to that specific job. Regardless, you can do a bunch of experimentation, and use |
When I do
Which is not what I want, and predictably performance isn't great if there are competing jobs on the node. However, if I do
This is what I want, and performance is good if there are other jobs running on the host (even when no affinity is specified). Why is the former different than the latter? Isn't I'm also curious why the OpenMPI default behavior changed. I can't imagine I'm the only less-sophisticated user scratching their head as a result of this. Interestingly, However, LSF also allows you to specify
And this also gives good performance when other jobs are running on the host. Also, IBM says that |
Since Open MPI v4 (IIRC) Open MPI switched to:
In your first pattern:
In your second pattern:
By default, LSF does not specify an affinity so it is left to Open MPI to determine how to map/bind/rank. If you specify affinity options to LSF then it creates a It should be noted that the Is there anything else that needs to be addressed in this issue before we close it? |
Actually, that is backwards - just a typo: if procs <= 2: map by core |
Yep you are correct (just re-reviewed this code ), but that doesn't quite explain what the user is seeing then. The Is one of the other CLI options overriding the default mapping? |
Remember, my brain is totally fizzed right now with all the drugs, so take this with a grain of salt. The difference is that the code defaults to |
I wanted to try to get the rules down and noticed a default mapping/binding issue in PRRTE.
Open MPI v4.1.x
Example: <= 2
Example: > 2
Example: > 2 and
Example: > 2 and
Open MPI
|
Sounds right for OMPI v4 and below. It was changed for OMPI v5 and above because you are electing to use the PRRTE defaults. You have the option of defining your own "defaults" logic using the schizo APIs (one each for map/rank/bind). FWIW, my actual intent was to have PRRTE default to "package" in place of "numa" given all the problems with defining "numa" nowadays, but I saw in the code that I had not done that yet. If we did, then that would more closely approximate prior OMPI behavior. A PR is welcome, if you have the time. |
The comment went up before it was ready. I updated it showing the problem with the default mapping (i.e., it always defaults to bind to core regardless of number of processes) |
Ah, so you aren't talking about a problem with setting the mapping default - you are talking about how the default ranking/binding get set in the absence of any directive IF mapping is specified by the user? Yeah, I can see a bug in there. I'll take a crack at it. |
Both really.
|
Yeah,
In PRRTE, your only real choices would be NODE (if they mapped by node), SLOT (if they mapped by slot), or FILL (everything else). That is close to what it currently does, but it is missing that last option, which is why you got that output. |
I also found that the default bind policy was only being set to follow the mapping policy IF the user specified the latter - which makes no sense, really. We should set the default bind to follow mapping regardless of default mapping or user-defined. I fixed it as well. Please give the referenced PR a try and see if it behaves better now. |
Testing Open MPI
Policy for Open MPI v4.1.x with ORTE
Policy for Open MPI main with PRRTE (please double check me here)
Example: <= 2Open MPI v4.1.x
Open MPI main
Example: > 2Open MPI v4.1.x
Open MPI main
Example: > 2 and
|
Actually, it technically would be
It technically is
I'm not quite sure I follow this sentence - the result would be to
Yes
Yes
Not really - it technically is There also is a rule that |
Just to be clear: this comment addresses the case where the user provides "--map-by foo", but nothing about the ranking or binding policies. I went back to try and understand where the alternative rules came from, and I think I understand. It all boils down to what you want from "default" ranking and binding patterns. What we now have (after the change) is probably more what people might expect - i.e., if I only specify a mapping policy, then ranking and binding simply match it. However, this does not result in the best performance in most cases. As was pointed out in a "debate" about this some time ago, the best performance is obtained by "dense" packing nodes/objects to ensure maximum usage of shared memory by adjacent ranks (i.e., procs whose rank differs by one) since that is the typical expectation of developers, and binding those procs to core. Having ranking simply follow mapping ensures the lowest density is obtained, and thus the poorest performance for the typical developer. Likewise, binding a proc to an object (other than a single core or cpu) is correct if the proc is multi-threaded, but otherwise reduces performance. So I'm not sure what is "correct" here. Giving the user something that "looks" like what they might expect, but achieves lower performance in general? Or giving them something that probably provides better performance, but "looks" wrong since ranking and binding don't follow mapping by default? |
The discussion on this issue diverged from the original post. Reading back, I think the issue has been addressed. @gregfi is there something further to investigate or do you have any other questions? |
Background information
What version of Open MPI are you using?
4.1.4
Describe how Open MPI was installed
Compiled from source
/openmpi-4.1.4/configure --with-tm=/local/xxxxxxxx/REQ0135770/torque-6.1.1/src --prefix=/tools/openmpi/4.1.4 --without-ucx --without-verbs --with-lsf=/tools/lsf/10.1 --with-lsf-libdir=/tools/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
Please describe the system on which you are running
Details of the problem
When I try the ring test (ring_c.c) across multiple hosts, I get the following error:
When I try the same test using OpenMPI 3.1.0, it works without issue. How can I identify and work around the problem?
The text was updated successfully, but these errors were encountered: