Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP connectivity problem in OpenMPI 4.1.4 #10734

Closed
gregfi opened this issue Aug 30, 2022 · 57 comments
Closed

TCP connectivity problem in OpenMPI 4.1.4 #10734

gregfi opened this issue Aug 30, 2022 · 57 comments

Comments

@gregfi
Copy link

gregfi commented Aug 30, 2022

Background information

What version of Open MPI are you using?

4.1.4

Describe how Open MPI was installed

Compiled from source

/openmpi-4.1.4/configure --with-tm=/local/xxxxxxxx/REQ0135770/torque-6.1.1/src --prefix=/tools/openmpi/4.1.4 --without-ucx --without-verbs --with-lsf=/tools/lsf/10.1 --with-lsf-libdir=/tools/lsf/10.1/linux3.10-glibc2.17-x86_64/lib

Please describe the system on which you are running

  • Operating system/version: SLES12-SP3
  • Computer hardware: Intel Xeon class
  • Network type: TCP over Infiniband

Details of the problem

When I try the ring test (ring_c.c) across multiple hosts, I get the following error:

--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: bl2609
  PID:        26165
  Message:    connect() to 9.9.11.33:1048 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------

When I try the same test using OpenMPI 3.1.0, it works without issue. How can I identify and work around the problem?

@gpaulsen
Copy link
Member

Hello. Thanks for submitting an issue.

I'd be curious to see your mpirun command line. I usually use something like mpirun -host <host1>:4,<host2>:4 a.out to run 4 ranks on each node. Of course if you're inside of an LSF or Torque allocation, it may auto-detect your job's allocation and launch that way.

NOTE: your configure option --with-lsf-libdir=/tools/lsf/10.1/lin, I would guess should end in lib instead of lin?

@gregfi
Copy link
Author

gregfi commented Aug 31, 2022

Sorry - the configure line got chopped off; I edited the post above to correct it.

Yes, I'm submitting via LSF, so my mpirun line looks something like:

bsub -n 32 -I mpirun /path/to/ring_c

@jsquyres
Copy link
Member

Is the IP address that it tried to connect to correct (9.9.11.33)?

Also, is there a reason you're using TCP over IB? That is known to be pretty slow compared to native IB protocols. I think early versions of TCP over IB had some reliability issues, too. You might just want to switch to building Open MPI with UCX and let Open MPI use the native IB protocols.

@gregfi
Copy link
Author

gregfi commented Aug 31, 2022

I think the IP address is correct, but there are some connectivity problems. What's puzzling is that OpenMPI 3.1.0 works. Is there any way to see what interface is being used by mpirun?

Yes, UCX would be preferable, but SLES12 is fairly old at this point, and the version of librdmacm that we have on the platform fails at configure time for UCX, so my understanding is that falls back on TCP anyway. (That's why I disabled UCX in the build of OpenMPI.)

@jsquyres
Copy link
Member

jsquyres commented Aug 31, 2022

UCX/old SLES: ah, got it. I assume the cost (e.g., in time/resources) to upgrade to a newer OS is too prohibitive.

That being said, it might not be that hard to get new librdmacm + new UCX + Open MPI v4.x to work with UCX/native IB. E.g., if you install all of them into the same installation tree, and ensure that that installation tree appears first in your LD_LIBRARY_PATH (i.e., so that new librdmacm will be found before the OS-installed librdmacm). Or even better yet, if you can fully uninstall all the OS packages needed for IB support, and install a whole new IB stack in an alternate location (e.g., /opt/hpc or wherever Nvidia installs all of its stuff these days -- my point is to not install the libraries and whatnot under /usr/lib, or wherever your OS installs libraries by default). This would mean that there is zero confusion between the OS IB stack and a new/modern IB stack. Open MPI and UCX can definitely work in this kind of scenario, if you're interested in investigating it.

One big disclaimer: I don't follow the SLES distro and the IB software stacks these days; I don't know if there's anything in the SLES 12 kernel, for example, that would explicitly prohibit using new librdmacm / new UCX. E.g., I don't know if you'll need new IB kernel drivers or not.

All that being said, let's talk TCP.

Yes, you can make the TCP BTL be very chatty about what it is doing. Set the MCA parameter btl_base_verbose to 100. For example, mpirun --mca btl_base_verbose 100 .... I don't know the exact syntax for this using bsub. This should make the TCP BTL tell you which IP interface(s) it is using, etc.

@gregfi
Copy link
Author

gregfi commented Sep 2, 2022

Yes, unfortunately, upgrading the OS is a major undertaking and is not an option at this time.

I ran some additional tests with one of our parallel applications on a portion of our cluster that has been partitioned off for investigation of this issue. This portion does not seem to have the TCP connect() error, but it does exhibit another issue that I've seen with OpenMPI 4.1 versus 3.1: considerably more erratic performance.

These jobs all use 16 processes on systems that have 28 slots each, so there is relatively limited communication between hosts - many of the jobs should just be using vader. Here's the performance with OpenMPI 3.1.0:

Execution  time  on  16  processor(s):  9   min,  29.7  sec
Execution  time  on  16  processor(s):  6   min,  49.9  sec
Execution  time  on  16  processor(s):  7   min,  11.8  sec
Execution  time  on  16  processor(s):  7   min,  24.7  sec
Execution  time  on  16  processor(s):  7   min,  12.0  sec
Execution  time  on  16  processor(s):  10  min,  50.0  sec
Execution  time  on  16  processor(s):  7   min,  4.5   sec
Execution  time  on  16  processor(s):  7   min,  20.0  sec
Execution  time  on  16  processor(s):  6   min,  25.8  sec

Here's the same application compiled with OpenMPI 4.1.4:

Execution  time  on  16  processor(s):  15  min,  46.8  sec
Execution  time  on  16  processor(s):  25  min,  14.7  sec
Execution  time  on  16  processor(s):  25  min,  46.3  sec
Execution  time  on  16  processor(s):  13  min,  17.6  sec
Execution  time  on  16  processor(s):  18  min,  41.2  sec
Execution  time  on  16  processor(s):  45  min,  53.3  sec
Execution  time  on  16  processor(s):  20  min,  23.6  sec
Execution  time  on  16  processor(s):  21  min,  26.3  sec
Execution  time  on  16  processor(s):  20  min,  21.1  sec

I've attached outputs generated with --mca pml_base_verbose 100 --mca btl_base_verbose 100. Any idea where I should look to identify the problem here?

3.1.0_pml_btl_verbose.txt

4.1.4_pml_btl_verbose.txt

@gregfi
Copy link
Author

gregfi commented Sep 13, 2022

Bump. Any thoughts on how to narrow down the problem?

@ggouaillardet
Copy link
Contributor

from the logs, Open MPI 3.1.0 uses both eth0 and ib0, but Open MPI 4.1.4 only uses eth0.

I suggest you try forcing ib0 and see how it goes

mpirun --mca btl_tcp_if_include ib0 ...

@jsquyres
Copy link
Member

@ggouaillardet is right. But I see that the v4.1.x log is also using the sppp interface -- I don't know what that is offhand.

In both versions of Open MPI, I'd suggest what @ggouaillardet suggested: force the use of ib0. Splitting network traffic over a much-slower eth0 and a much-faster ib0 can have weird performance effects.

Have you tried uninstalling the OS IB stack and installing your own, per my prior comment?

@gregfi
Copy link
Author

gregfi commented Sep 13, 2022

Forcing the use of ib0 with OpenMPI 4.1.4 does not seem to improve the performance.

Part of the dysfunction here may be differing versions of OFED being installed on the build machine as compared to the rest of the cluster. (I'm asking the admins to look into it.) I thought 3.1.0 was using IB via TCP, but that seems to not be correct - I see openib being cited in the in the 3.1.0 verbose output. OpenMPI 3.1.0 may have been compiled at a time prior to this mismatch.

If I force 3.1.0 to use tcp, the performance deteriorates a little bit, but not nearly to the extent seen in 4.1.4. So there still seems to be something causing tcp performance to drag in 4.1.4.

@jsquyres
Copy link
Member

I just re-read your comments and see this:

These jobs all use 16 processes on systems that have 28 slots each, so there is relatively limited communication between hosts - many of the jobs should just be using vader. Here's the performance with OpenMPI 3.1.0:

Does this mean each run is on a single node, launching MPI processes on 16 out of 28 total cores?

@gregfi
Copy link
Author

gregfi commented Sep 14, 2022

There are three nodes in this special testing queue - each with 28 slots, so 84 slots in total. I submitted ten 16-process jobs to the queue, so about half of them would run entirely within a single node and the other half would be split between nodes.

@jsquyres
Copy link
Member

Oh, that makes a huge difference.

If an MPI job is running entirely on a single node, it won't use TCP at all: it will use shared memory to communicate on-node. More generally, Open MPI processes will use shared memory (which is significantly faster than both TCP and native IB) to communicate with peers that are on the same node, and will use some kind of network to communicate with peers off-node.

So if your jobs end up having different numbers of on-node / off-node peers, that can certainly explain why there's variations in total execution times.

That being said, it doesn't explain why there's large differences between v3.x and v4.x. It would be good to get some apples-to-apples comparisons between v3.x and v4.x, though. Let's get the network out of the equation, and only test shared memory as an MPI transport. That avoids any questions about IPoIB.

Can you get some timings of all-on-one-node runs with Open MPI v3.x and v4.x?

@gregfi
Copy link
Author

gregfi commented Sep 20, 2022

OK, the machines I was running on got wiped and re-inserted into the general population and some other machines were swapped in to my partition of the network. These new machines are running SLES12-SP5, and the OFED version mismatch issue was sorted out. I re-compiled OpenMPI 4.1.4, and openib seems to be working better... mostly. I still see some messages to the effect of:

[bl3402:01258] rdmacm CPC only supported when the first QP is a PP QP; skipped
[bl3402:01258] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped

I'm not sure what these mean or how catastrophic they are, but the jobs seem to run with --mca btl ^tcp when spanning multiple hosts, so the openib btl seems to be working in some capacity.

With --mca btl vader,self on 4.1.4, I get:

Execution  time  on  12  processor(s):  15  min,  44.9  sec
Execution  time  on  12  processor(s):  14  min,  35.3  sec
Execution  time  on  12  processor(s):  15  min,  31.1  sec
Execution  time  on  12  processor(s):  15  min,  1.0   sec
Execution  time  on  12  processor(s):  14  min,  41.0  sec
Execution  time  on  12  processor(s):  15  min,  26.4  sec
Execution  time  on  12  processor(s):  15  min,  29.7  sec
Execution  time  on  12  processor(s):  15  min,  27.5  sec
Execution  time  on  12  processor(s):  14  min,  42.8  sec

On Version 3.1.0, I get:

Execution  time  on  12  processor(s):  15  min,  35.2  sec
Execution  time  on  12  processor(s):  14  min,  31.4  sec
Execution  time  on  12  processor(s):  15  min,  27.8  sec
Execution  time  on  12  processor(s):  14  min,  28.2  sec
Execution  time  on  12  processor(s):  14  min,  59.4  sec
Execution  time  on  12  processor(s):  15  min,  39.5  sec
Execution  time  on  12  processor(s):  15  min,  30.8  sec
Execution  time  on  12  processor(s):  15  min,  16.6  sec
Execution  time  on  12  processor(s):  14  min,  45.6  sec

Practically equivalent performance. Interestingly, if I run --mca btl ^tcp, I see the same inconsistent performance, with some jobs running very slowly. However, on the last (slowest) job that runs, performance improves dramatically when the other MPI jobs finish. Here are the times (in seconds) for each computational iteration that I see on the last running job:

100.152
101.964
 99.710
101.042
102.910
102.894
102.817
102.995
102.481
102.479
 82.162
 35.575
 35.576
 35.578
 35.599
 35.600
 35.607

Does that suggest some kind of network configuration issue?

@jsquyres
Copy link
Member

Some clarifying questions:

  • With your shared memory tests, are you running with 12 dedicated cores on a single host (and no other MPI processes on the node at the same time)? If not, can you explain exactly how the jobs are run?
  • If you're able to run with openib, you should probably also be able to run with UCX. Have you tried that?
    • I keep asking about UCX because it is better supported than openib. Indeed, openib is disappearing in the upcoming Open MPI v5.0 -- the UCX PML will effectively be the only way to run on InfiniBand.
  • With your ^tcp tests, are you mixing multiple jobs on the same host at the same time? Your comment about "performance improves dramatically when the other MPI jobs finish" suggests that there might be some overloading occurring -- i.e., multiple MPI processes are being bound to the same core. You might want to run with mpirun --report-bindings to see exactly which core(s) each process is being bound to.

@gregfi
Copy link
Author

gregfi commented Sep 20, 2022

  • Yes, the shared memory tests are running with 12 dedicated cores and no other simultaneous processes.
  • I've gotten the admins to install the UCX devel libraries, and I'm trying the configuration right now. It's an older version (1.4) that's distributed with the OS, but I'm hoping it can be made to work. (I see the warning about 1.8, but hopefully earlier versions are OK.)
  • Yes, with the ^tcp jobs, I'm running 16-process jobs on 12-slot hosts. So the division is host1,host2 = 12,4 or 8,8 depending on the machine. I will rerun with --report-bindings and post the results.

@jsquyres
Copy link
Member

FYI: You should be able to download and install a later version of UCX yourself (e.g., just install it under your $HOME, such as to $HOME/install/ucx or somesuch). It's a 100% userspace library; there's no special permissions needed. Then you can build Open MPI with ./configure --with-ucx=$HOME/install/ucx ....

@gregfi
Copy link
Author

gregfi commented Sep 20, 2022

Understood, but current UCX does not work with the version of librdmacm from the OS. In principle, I could install a newer version, but it would be far easier if the OS load set could be made to work.

Job #1, which is performing somewhat slowly has:

[bl3403:19505] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3403:19505] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3403:19505] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3403:19505] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3403:19505] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3403:19505] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3403:19505] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3403:19505] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3403:19505] MCW rank 8 bound to socket 1[core 8[hwt 0]]: [./././././.][././B/././.]
[bl3403:19505] MCW rank 9 bound to socket 1[core 9[hwt 0]]: [./././././.][./././B/./.]
[bl3403:19505] MCW rank 10 bound to socket 1[core 10[hwt 0]]: [./././././.][././././B/.]
[bl3403:19505] MCW rank 11 bound to socket 1[core 11[hwt 0]]: [./././././.][./././././B]
[bl3402:18730] MCW rank 12 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:18730] MCW rank 13 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:18730] MCW rank 14 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:18730] MCW rank 15 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]

Job #2, which is performing very slowly, has:

[bl3402:18717] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3402:18717] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3402:18717] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3402:18717] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3402:18717] MCW rank 4 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3402:18717] MCW rank 5 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3402:18717] MCW rank 6 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3402:18717] MCW rank 7 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]
[bl3401:02154] MCW rank 8 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.]
[bl3401:02154] MCW rank 9 bound to socket 0[core 1[hwt 0]]: [./B/./././.][./././././.]
[bl3401:02154] MCW rank 10 bound to socket 0[core 2[hwt 0]]: [././B/././.][./././././.]
[bl3401:02154] MCW rank 11 bound to socket 0[core 3[hwt 0]]: [./././B/./.][./././././.]
[bl3401:02154] MCW rank 12 bound to socket 0[core 4[hwt 0]]: [././././B/.][./././././.]
[bl3401:02154] MCW rank 13 bound to socket 0[core 5[hwt 0]]: [./././././B][./././././.]
[bl3401:02154] MCW rank 14 bound to socket 1[core 6[hwt 0]]: [./././././.][B/././././.]
[bl3401:02154] MCW rank 15 bound to socket 1[core 7[hwt 0]]: [./././././.][./B/./././.]

Seems like there's overlap, no?

@jsquyres
Copy link
Member

I forgot about your librdmacm issue. Yes, you could install that manually, too -- it's also a 100% userspace library.

Yes, those 2 jobs definitely overlap -- that's why you're seeing dramatic slowdowns: multiple MPI processes are are being bound to the same core, and therefore they're fighting for cycles.

At this point, I have to turn you back over to @gpaulsen because I don't know how Open MPI reads the LSF job info and decides which cores to use.

@gpaulsen gpaulsen self-assigned this Sep 20, 2022
@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2022

If you are running multiple mpirun calls that are receiving the same allocation information, then they will overlap as the don't know about each other. It sounds to me like either an error in your bsub command or a bug in the ORTE internal code that reads the resulting allocation info. If you are saying this worked with OMPI v3.x, I very much doubt the ORTE code changed when going to OMPI v4.x - though someone could easily check the relevant orte/mca/ras component to see.

@gpaulsen gpaulsen assigned markalle and unassigned gpaulsen Sep 20, 2022
@gpaulsen
Copy link
Member

@markalle Can you please take a look?

Perhaps some ORTE verbosity will shed some light on things?

@gregfi
Copy link
Author

gregfi commented Sep 20, 2022

What parameters should I set?

@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2022

If you have built with --enable-debug, add --mca ras_base_verbose 10 to your mpirun cmd line.

@markalle
Copy link
Contributor

markalle commented Sep 20, 2022

Are these jobs running at the same time? If they're not running at the same time then I don't think there's any overlap, they both look like 2-host jobs where

Job 1 is:
host bl3403 : 12 ranks
host bl3402 : 4 ranks

and Job 2 is:
host bl3402 : 8 ranks
host bl3401 : 8 ranks

But if they're both bsubed simultaneously and are both trying to use bl3402 at the same time then I see what you're saying about overlap.

I don't actually remember which version of OMPI prints full-host affinity output vs which would only show the cgroup it was handed and the binding relative to that cgroup... when it does the latter it leaves the output kind of unclear looking IMO. My expectation is that if those LSF jobs were running at the same time, then LSF should have handed a different cgroup to each job and those cgroups shouldn't overlap each other.

I think those are probably all full-host affinity displays, but when in doubt I just stick my own function somewhere so I know what it's printing. Eg something like:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sched.h>
#include <unistd.h>

void
print_affinity()
{
    int i, n;
    char hostname[64];
    char *str;
    cpu_set_t mask;
    n = sysconf(_SC_NPROCESSORS_ONLN);
    sched_getaffinity(0, sizeof(mask), &mask);
    str = malloc(n + 256);
    if (!str) { return; }
    gethostname(hostname, 64);
    sprintf(str, "%s:%d ", hostname, getpid());
    for (i=0; i<n; ++i) {
        if (CPU_ISSET(i, &mask)) {
            strcat(str, "1");
        } else {
            strcat(str, "0");
        }
    }
    printf("%s\n", str);
    free(str);
}

int
main() {
    print_affinity();
    return(0);
}

@gregfi
Copy link
Author

gregfi commented Sep 20, 2022

Yes, both jobs were running at the same time.

Re-compiling with --enable-debug to try Ralph's suggestion. I'm not sure I understand @markalle's suggestion

@markalle
Copy link
Contributor

To get LSF to pick the affinity for the job which would let the two jobs not overlap each other I think the bsub setting you need is -R, for example:
bsub -R'affinity[core(1):distribute=pack]' ...

When an option like that is in use the job should get an affinity assignment, and you can confirm what it did by looking at some environment variables. For example on my machine if I run two jobs at the same time with the above settings I get

first job: (bsub -R'affinity[core(1):distribute=pack]' -n 2 ...)
LSB_BIND_CPU_LIST=0,1,2,3,4,5,6,7
RM_CPUTASK1=0,1,2,3
RM_CPUTASK2=4,5,6,7
Rank0: 11110000000000000000000000000000...
Rank1: 00001111000000000000000000000000...

second job: (bsub -R'affinity[core(1):distribute=pack]' -n 2 ...)
LSB_BIND_CPU_LIST=8,9,10,11,12,13,14,15
RM_CPUTASK1=8,9,10,11
RM_CPUTASK2=12,13,14,15
Rank0: 00000000111100000000000000000000...
Rank1: 00000000000011110000000000000000...

Or if you wanted a more socket-oriented binding style that's closer to what OMPI v3.x was doing, you could use
bsub -R'affinity[core(1):cpubind=socket:distribute=balance]' ...
and that would still allocate based on the number of cores, but then bind the affinity of each rank to the whole containing socket instead of just one core.

@gregfi
Copy link
Author

gregfi commented Sep 21, 2022

Ta-da! Run with OpenMPI 4.1.4 using -R'affinity[core(1):distribute=pack]':

Execution  time  on  16  processor(s):  12  min,  3.0   sec
Execution  time  on  16  processor(s):  13  min,  22.8  sec
Execution  time  on  16  processor(s):  12  min,  16.1  sec
Execution  time  on  16  processor(s):  12  min,  38.6  sec
Execution  time  on  16  processor(s):  12  min,  5.1   sec
Execution  time  on  16  processor(s):  13  min,  59.3  sec
Execution  time  on  16  processor(s):  12  min,  7.9   sec
Execution  time  on  16  processor(s):  12  min,  34.5  sec
Execution  time  on  16  processor(s):  12  min,  14.8  sec

Thanks, guys! Before I mark this closed, are there similar options for Torque? We're between job schedulers at the moment, and I suspect this issue is affecting Torque jobs, also.

@rhc54
Copy link
Contributor

rhc54 commented Sep 21, 2022

I guarantee there are, but heck if I can remember them. Easiest solution is to skip the hieroglyphics and just use mpirun --bind-to socket in both environments and be done with it. You'll get nearly identical performance without the hassle of environment-unique cmd scripts.

@gregfi
Copy link
Author

gregfi commented Sep 21, 2022

On any given machine, there may be a handful of non-MPI jobs running where our MPI job is dispatched. Are the job schedulers smart enough to look at the state of the machine and pick a core (or socket) based on which slots are idle?

@rhc54
Copy link
Contributor

rhc54 commented Sep 21, 2022

It sounds like your scheduler is configured to support that behavior - it must not be allocating solely at the node level. If it is allowed to allocate subdivisions of a node, then it will certainly do so.

@jsquyres
Copy link
Member

On any given machine, there may be a handful of non-MPI jobs running where our MPI job is dispatched. Are the job schedulers smart enough to look at the state of the machine and pick a core (or socket) based on which slots are idle?

Let me re-phrase your question -- tell me if this is inaccurate:

On any given machine, there may be a handful of jobs running that were not launched via our scheduler. Are the job schedulers smart enough to look at the state of the machine and pick a core (or socket) based on which cores are idle?

I'm not an expert in this area / I do not closely follow the latest features of the various job schedulers out there, but this does not sound like a good idea. My gut reaction/assumption is that your scheduler(s) do not account for jobs launched outside of the job scheduler, but you should check the specific documentation of your system to be sure.

My $0.02: you should launch everything through the job scheduler so that you 100% know that all the jobs are using the policies that are designated for your cluster (e.g., whether you want to allow oversubscription or not). If jobs are launched from outside the scheduler, there is no guarantee that the job scheduler will account for them, and therefore you can end up unexpectedly oversubscribing the resources, resulting in poor performance. Also, jobs launched outside of the job scheduler may not be restricted to specific resources (e.g., cores), so they may end up floating around between cores, and can therefore have varying effects on job-scheduler-launched jobs. These kinds of behaviors tend to make all users unhappy -- even those who are stubborn enough to not use the job scheduler to launch their jobs.

@markalle
Copy link
Contributor

I agree with jsquyres, on a former project I worked on we inherited an affinity system that tried to examine the load and pick a non-busy subset of the machines to run on and it was a terrible feature. It produced unpredictable behavior from run to run and was just about always a degradation in performance.

As far as I know, LSF and any other scheduler is just going to partition the machines up among the jobs it manages and not try to deduce load of other jobs from outside the scheduler. So I'd also recommend not mixing non-scheduled jobs with scheduled jobs, and instead send everything through a scheduler so it can make sensible assignments.

That said, I'm also not familiar with Torque options.

Vanilla mpirun --bind-to socket is an interesting solution though. On the one hand that option wouldn't be keeping track of cross-job assignments, but in practical terms on ordinary configurations it's still likely to produce balanced workloads, and it's enough of a binding I expect it would give mostly the same benefits as specific core bindings. I would be curious though to see performance numbers comparing the core vs socket bindings

@gregfi
Copy link
Author

gregfi commented Sep 22, 2022

On any given machine, there may be a handful of jobs running that were not launched via our scheduler. Are the job schedulers smart enough to look at the state of the machine and pick a core (or socket) based on which cores are idle?

No, that's not correct. All jobs are launched via the scheduler(s). I'm wondering if the scheduler has control over the which slots a non-MPI job occupies, or whether that's a decision made by the OS, and it sounds like the scheduler is probably deciding which slot to run the job on - or at least is capable of seeing which slots are vacant. If that's the case, it sounds like binding to core (and using mappings from the scheduler) sounds like the way to fly.

@markalle
Copy link
Contributor

That could be okay. If all the jobs are bsubed through LSF then at least the total amount of work being put on each machine would match the number of slots/cores the machine has. Then as to the specifics of which cores the various jobs are using you've got a few options, but I'd guess the non-MPI jobs wouldn't need to be created with a specific affinity, instead letting the OS decide. I'd probably run both ways just to see how the performance compares, but that's my guess.

The situation where you ran into trouble was when LSF was putting two MPI jobs on the same system but leaving the affinity up to the app, and MPI doesn't have the cross-job awareness to deal with that and was binding them to the same cores. I'd say any of the following are decent solutions:

  • use bsub -R just on the MPI jobs : then LSF would hand non-overlapping cpusets to each MPI job, and any non-MPI jobs would float on their own to the unused cores
  • use bsub -R on all the jobs : then the non-MPI jobs would be bound to specific cores too, I doubt this would make a visible performance difference but I'd be curious to see
  • use bsub without -R but use mpirun with a loose binding like --bind-to socket : in this mode LSF wouldn't be making specific cpuset assignments to the jobs, and you'd be relying on the fact that socket-level binding is a pretty broad distribution of the work to keep the MPI jobs from stepping on each other

I like both the "bsub -R" and the "bsub without -R but bind-to socket" ideas and would just test the performance and pick based on that.

Going just a little further down the details though, I'd also consider using "--rank-by core" alongside the "--bind-to socket" option. The same sockets would be used, but it would change which ranks are assigned to which socket, eg

--bind-to socket example:
R0: [B/B/B/B][./././.]
R1: [./././.][B/B/B/B]
R2: [B/B/B/B][./././.]
R3: [./././.][B/B/B/B]

--bind-to socket --rank-by core example:
R0: [B/B/B/B][./././.]
R1: [B/B/B/B][./././.]
R2: [./././.][B/B/B/B]
R3: [./././.][B/B/B/B]

On average I'd expect adjacent ranks to do slightly more communication between each other than with further away ranks, so I'd pick the second binding above in absence of other info about the job.

So I'd compare the performance with

  • MPi jobs bound to specific cores (eg -R with affinity[core(1):distribute=pack]')
  • MPI jobs bound to sockets alternating (eg --bind-to socket)
  • MPI jobs bound to sockets contiguously (eg --bind-to socket --rank-by core or -R with affinity[core(1):cpubind=socket:distribute=balance)

@jsquyres
Copy link
Member

No, that's not correct. All jobs are launched via the scheduler(s). I'm wondering if the scheduler has control over the which slots a non-MPI job occupies, or whether that's a decision made by the OS, and it sounds like the scheduler is probably deciding which slot to run the job on - or at least is capable of seeing which slots are vacant. If that's the case, it sounds like binding to core (and using mappings from the scheduler) sounds like the way to fly.

Ah, you really did mean MPI job. Ok. I was confused because you used the word "slot", but "slot" is very much an Open MPI / PMIx term -- not a scheduler term.

I think @markalle outlined the situation well in his comment, above. I'd add one clarification: if you use bsub -R, then just doing --bind-to socket on all your jobs may not do what you expect. In that situation, Open MPI should bind each MPI process to all the cores in that package (socket) on which it landed. This may be less than all the cores on that package.

For example, you have 2 x 6-node cores in your nodes.

If LSF assigns cores in 3 different jobs on a single node like this:

  • job A: package 0, cores 0-3
  • job B: package 0, cores 4-5 and socket 1, cores 0-1
  • job C: package 1, cores 2-5

In this situation, --bind-to socket will bind like this:

  • job A to package 0, cores 0-3 (not 0-5)
  • job B: 2 processes to package 0 cores 4-5 and the other 2 processes to package 1 cores 0-1
  • job C to package 1 cores 2-5

This may be ok from your perspective, but just realize that it's different than every MPI process effectively being bound to all 6 cores in a single package.

Put differently: Open MPI will bind to all the cores in a given package within the set of cores that are allocated to that specific job.

Regardless, you can do a bunch of experimentation, and use --report-bindings a) to see where processes are actually bound, and b) what effect that has on performance.

@gregfi
Copy link
Author

gregfi commented Oct 14, 2022

When I do --bind-to socket --rank-by core, I get:

[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]

Which is not what I want, and predictably performance isn't great if there are competing jobs on the node. However, if I do mpirun --map-by socket --bind-to socket --rank-by core, I get:

[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././.]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]
[./././././././././././././.][B/B/B/B/B/B/B/B/B/B/B/B/B/B]

This is what I want, and performance is good if there are other jobs running on the host (even when no affinity is specified). Why is the former different than the latter? Isn't --map-by socket the default?

I'm also curious why the OpenMPI default behavior changed. I can't imagine I'm the only less-sophisticated user scratching their head as a result of this.

Interestingly, -R'affinity[core(1):distribute=pack]' works fine if the other jobs running on the host are submitted with -R'affinity[core(1)]', but performance is atrocious if the non-MPI jobs did not specify affinity. Apparently, if no affinity is specified, LSF leaves everything up to the OS. But even if I start the MPI job first and then add the other non-MPI processes, performance still deteriorates. I don't understand why the OS would dispatch the non-MPI jobs to slots that are already loaded down, but that seems to be what's happening. Do all new processes just alternate between sockets by default?

However, LSF also allows you to specify -R "affinity[core(1):distribute=balance]", which results in:

[B/././././././././././././.][./././././././././././././.]
[./B/./././././././././././.][./././././././././././././.]
[././B/././././././././././.][./././././././././././././.]
[./././B/./././././././././.][./././././././././././././.]
[././././B/././././././././.][./././././././././././././.]
[./././././B/./././././././.][./././././././././././././.]
[././././././B/././././././.][./././././././././././././.]
[./././././././B/./././././.][./././././././././././././.]
[./././././././././././././.][B/././././././././././././.]
[./././././././././././././.][./B/./././././././././././.]
[./././././././././././././.][././B/././././././././././.]
[./././././././././././././.][./././B/./././././././././.]
[./././././././././././././.][././././B/././././././././.]
[./././././././././././././.][./././././B/./././././././.]
[./././././././././././././.][././././././B/././././././.]
[./././././././././././././.][./././././././B/./././././.]

And this also gives good performance when other jobs are running on the host. Also, IBM says that affinity[core(1)] can be added in RES_REQ of the queue configuration so that jobs have affinity specified per default.

@jjhursey
Copy link
Member

Since Open MPI v4 (IIRC) Open MPI switched to:

  • If the number of procs >= 2 : map by core
  • If the number of procs < 2 : map by numa/package

In your first pattern: --bind-to socket --rank-by core (which implies ---map-by core -bind-to socket --rank-by core)

  1. First processes are mapped to core starting from core 0. So it will fill the first socket before moving to the next socket.
  2. After mapping the processes are bound to the socket.
  3. Finally the processes are ranked by their assigned core locations

In your second pattern: -map-by socket --bind-to socket --rank-by core

  1. First processes are mapped to socket starting from socket 0. It will bounce between socket 0 and 1 assigning one process to each.
  2. After mapping the processes are bound to the socket.
  3. Finally the processes are ranked by their core locations on the assigned socket.

By default, LSF does not specify an affinity so it is left to Open MPI to determine how to map/bind/rank. If you specify affinity options to LSF then it creates a LSB_AFFINITY_HOSTFILE which Open MPI will read and make a best attempt at honoring the affinity described within.

It should be noted that the LSB_AFFINITY_HOSTFILE support is broken in main and v5.0.x. It will emit an error message if the file contains any affinity. I'm working to see if I can restore that functionality.

Is there anything else that needs to be addressed in this issue before we close it?

@rhc54
Copy link
Contributor

rhc54 commented Oct 26, 2022

  • If the number of procs >= 2 : map by core

    • If the number of procs < 2 : map by numa/package

Actually, that is backwards - just a typo:

if procs <= 2: map by core
if procs > 2: map by numa/package

@jjhursey
Copy link
Member

Yep you are correct (just re-reviewed this code ), but that doesn't quite explain what the user is seeing then.

The --bind-to socket --rank-by core output makes sense to me if it implied --map-by core, but the default is --map-by package from your comment so I would expect their second set of output.

Is one of the other CLI options overriding the default mapping?

@rhc54
Copy link
Contributor

rhc54 commented Oct 26, 2022

Remember, my brain is totally fizzed right now with all the drugs, so take this with a grain of salt. The difference is that the code defaults to map-by NUMA, which I suspect is not the same as map-by SOCKET on the user's system. The map is just showing that difference. We typically conflate the two, which is something we should probably get used to not doing.

@jjhursey
Copy link
Member

jjhursey commented Oct 28, 2022

I wanted to try to get the rules down and noticed a default mapping/binding issue in PRRTE.

  • main always --map-by core regardless of the number of processes
  • main does not carry forward the ranking implication of the --map-by object like it does in v4.1.x
    • main does not have a --rank-by core

Open MPI v4.1.x

  • If number of processes <= 2 then default --map-by core --bind-to core --rank-by core
  • If number of processes > 2 then default --map-by socket --bind-to socket --rank-by socket
  • If --bind-to and --rank-by are not specified and --map-by is specified then --map-by object is implied for both --bind-to and --rank-by

Example: <= 2

shell$ mpirun -np 2 ./get-pretty-cpu | sort -k 1 -n
  0/  2 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/  2 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]

Example: > 2

shell$ mpirun -np 3 ./get-pretty-cpu | sort -k 1 -n
  0/  3 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/  3 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  2/  3 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
shell$ mpirun -np 16 ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  2/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  4/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  6/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  8/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 10/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 12/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 14/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]

Example: > 2 and --map-by core

shell$ mpirun -np 16 --map-by core ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../BBBBBBBB/......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [......../......../......../......../BBBBBBBB/......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../BBBBBBBB/......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [......../......../......../......../......../......../BBBBBBBB/......../......../........][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../BBBBBBBB/......../........][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [......../......../......../......../......../......../......../......../BBBBBBBB/........][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../BBBBBBBB/......../......../......../......../......../......../......../........]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../BBBBBBBB/......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../BBBBBBBB/......../......../......../......../......../........]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../BBBBBBBB/......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../......../BBBBBBBB/......../......../......../........]

Example: > 2 and --map-by socket

  0/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  2/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  4/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  6/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  8/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 10/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 12/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 14/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]

Open MPI main

  • I would expect this to have the same rules as v4.1.x, but it looks like it binds to core regardless of the number of processes.

Example: <= 2

shell$ mpirun -np 2 ./get-pretty-cpu | sort -k 1 -n
  0/  2 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/  2 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]

Example: > 2

shell$ mpirun -np 3 ./get-pretty-cpu | sort -k 1 -n
  0/  3 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/  3 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  2/  3 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
shell$ mpirun -np 16 ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../BBBBBBBB/......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [......../......../......../......../BBBBBBBB/......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../BBBBBBBB/......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [......../......../......../......../......../......../BBBBBBBB/......../......../........][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../BBBBBBBB/......../........][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [......../......../......../......../......../......../......../......../BBBBBBBB/........][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../BBBBBBBB/......../......../......../......../......../......../......../........]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../BBBBBBBB/......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../BBBBBBBB/......../......../......../......../......../........]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../BBBBBBBB/......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../......../BBBBBBBB/......../......../......../........]

Example: > 2 and --map-by core

  0/ 16 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../BBBBBBBB/......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [......../......../......../......../BBBBBBBB/......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../BBBBBBBB/......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [......../......../......../......../......../......../BBBBBBBB/......../......../........][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../BBBBBBBB/......../........][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [......../......../......../......../......../......../......../......../BBBBBBBB/........][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../BBBBBBBB/......../......../......../......../......../......../......../........]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../BBBBBBBB/......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../BBBBBBBB/......../......../......../......../......../........]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../BBBBBBBB/......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../......../BBBBBBBB/......../......../......../........]

Example: > 2 and --map-by socket

  0/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2022

Sounds right for OMPI v4 and below. It was changed for OMPI v5 and above because you are electing to use the PRRTE defaults. You have the option of defining your own "defaults" logic using the schizo APIs (one each for map/rank/bind).

FWIW, my actual intent was to have PRRTE default to "package" in place of "numa" given all the problems with defining "numa" nowadays, but I saw in the code that I had not done that yet. If we did, then that would more closely approximate prior OMPI behavior.

A PR is welcome, if you have the time.

@jjhursey
Copy link
Member

The comment went up before it was ready. I updated it showing the problem with the default mapping (i.e., it always defaults to bind to core regardless of number of processes)

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2022

Ah, so you aren't talking about a problem with setting the mapping default - you are talking about how the default ranking/binding get set in the absence of any directive IF mapping is specified by the user?

Yeah, I can see a bug in there. I'll take a crack at it.

@jjhursey
Copy link
Member

Both really.

  • If nothing is specified and the number of processes > 2 then it is mapping by core instead of by socket.
  • If they specify a --map-by but nothing else then it seems the --rank-by is not set correctly. But then again in PRRTE there is no core or socket in the --rank-by IIRC. so maybe this is working as expected and is just a difference between PRRTE and ORTE.

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2022

If nothing is specified and the number of processes > 2 then it is mapping by core instead of by socket.

Yeah, options.nprocs isn't being initialized so it is zero

If they specify a --map-by but nothing else then it seems the --rank-by is not set correctly. But then again in PRRTE there is no core or socket in the --rank-by IIRC. so maybe this is working as expected and is just a difference between PRRTE and ORTE.

In PRRTE, your only real choices would be NODE (if they mapped by node), SLOT (if they mapped by slot), or FILL (everything else). That is close to what it currently does, but it is missing that last option, which is why you got that output.

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2022

I also found that the default bind policy was only being set to follow the mapping policy IF the user specified the latter - which makes no sense, really. We should set the default bind to follow mapping regardless of default mapping or user-defined. I fixed it as well.

Please give the referenced PR a try and see if it behaves better now.

@jjhursey
Copy link
Member

jjhursey commented Oct 28, 2022

Testing Open MPI main with PRRTE main + openpmix/prrte#1571

  • This fixes the default binding policy
  • As noted above, the ranking policy is slightly different because there is no 1-to-1 match any more between the --map-by and --rank-by options. But It's workable.

Policy for Open MPI v4.1.x with ORTE

  • If number of processes <= 2 then default --map-by core --bind-to core --rank-by core
  • If number of processes > 2 then default --map-by socket --bind-to socket --rank-by socket
  • If --bind-to and --rank-by are not specified and --map-by is specified then --map-by object is implied for both --bind-to and --rank-by

Policy for Open MPI main with PRRTE (please double check me here)

  • If number of processes <= 2 then default --map-by core --bind-to core --rank-by slot
  • If number of processes > 2 then default --map-by socket --bind-to socket --rank-by slot
  • If --bind-to is not specified and --map-by is specified then --map-by object is implied for --bind-to
    • If --map-by core implies --bind-to core
    • If --map-by socket implies --bind-to socket
  • --rank-by slot is the default

Example: <= 2

Open MPI v4.1.x

shell$ mpirun -np 2 ./get-pretty-cpu | sort -k 1 -n
  0/  2 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/  2 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]

Open MPI main

shell$ mpirun -np 2 ./get-pretty-cpu | sort -k 1 -n
  0/  2 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/  2 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]

Example: > 2

Open MPI v4.1.x

shell$ mpirun -np 3 ./get-pretty-cpu | sort -k 1 -n
  0/  3 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/  3 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  2/  3 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
shell$ mpirun -np 16 ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  2/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  4/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  6/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  8/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 10/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 12/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 14/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]

Open MPI main

shell$ mpirun -np 3 ./get-pretty-cpu | sort -k 1 -n
  0/  3 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/  3 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  2/  3 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
shell$ mpirun -np 16 ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]

Example: > 2 and --map-by core

Open MPI v4.1.x

shell$ mpirun -np 16 --map-by core ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../BBBBBBBB/......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [......../......../......../......../BBBBBBBB/......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../BBBBBBBB/......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [......../......../......../......../......../......../BBBBBBBB/......../......../........][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../BBBBBBBB/......../........][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [......../......../......../......../......../......../......../......../BBBBBBBB/........][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../BBBBBBBB/......../......../......../......../......../......../......../........]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../BBBBBBBB/......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../BBBBBBBB/......../......../......../......../......../........]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../BBBBBBBB/......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../......../BBBBBBBB/......../......../......../........]

Open MPI main

shell$ mpirun -np 16 --map-by core ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../BBBBBBBB/......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [......../......../......../......../BBBBBBBB/......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../BBBBBBBB/......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [......../......../......../......../......../......../BBBBBBBB/......../......../........][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../BBBBBBBB/......../........][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [......../......../......../......../......../......../......../......../BBBBBBBB/........][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../BBBBBBBB/......../......../......../......../......../......../......../........]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../BBBBBBBB/......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../BBBBBBBB/......../......../......../......../......../........]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../BBBBBBBB/......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../......../BBBBBBBB/......../......../......../........]

Example: > 2 and --map-by socket

Open MPI v4.1.x

shell$ mpirun -np 16 --map-by socket ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  2/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  4/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  6/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
  8/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 10/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 12/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 14/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]

Open MPI main

shell$ mpirun -np 16 --map-by socket ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB/BBBBBBBB]

Playing with --rank-by on Open MPI main

shell$ mpirun -np 16 --bind-to core --rank-by slot ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../BBBBBBBB/......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [......../......../......../......../BBBBBBBB/......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../BBBBBBBB/......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [......../......../......../......../......../......../BBBBBBBB/......../......../........][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../BBBBBBBB/......../........][......../......../......../......../......../......../......../......../......../........]
  8/ 16 c660f5n18: [......../......../......../......../......../......../......../......../BBBBBBBB/........][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../BBBBBBBB][......../......../......../......../......../......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../BBBBBBBB/......../......../......../......../......../......../......../........]
 12/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../BBBBBBBB/......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../BBBBBBBB/......../......../......../......../......../........]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../BBBBBBBB/......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../......../BBBBBBBB/......../......../......../........]
shell$ mpirun -np 16 --bind-to core --rank-by span ./get-pretty-cpu | sort -k 1 -n
  0/ 16 c660f5n18: [BBBBBBBB/......../......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  1/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][BBBBBBBB/......../......../......../......../......../......../......../......../........]
  2/ 16 c660f5n18: [......../BBBBBBBB/......../......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  3/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../BBBBBBBB/......../......../......../......../......../......../......../........]
  4/ 16 c660f5n18: [......../......../BBBBBBBB/......../......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  5/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../BBBBBBBB/......../......../......../......../......../......../........]
  6/ 16 c660f5n18: [......../......../......../BBBBBBBB/......../......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  7/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../BBBBBBBB/......../......../......../......../......../........]
  8/ 16 c660f5n18: [......../......../......../......../BBBBBBBB/......../......../......../......../........][......../......../......../......../......../......../......../......../......../........]
  9/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../BBBBBBBB/......../......../......../......../........]
 10/ 16 c660f5n18: [......../......../......../......../......../BBBBBBBB/......../......../......../........][......../......../......../......../......../......../......../......../......../........]
 11/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../........][......../......../......../......../......../BBBBBBBB/......../......../......../........]
 12/ 16 c660f5n18: [......../......../......../......../......../......../BBBBBBBB/......../......../........][......../......../......../......../......../......../......../......../......../........]
 13/ 16 c660f5n18: [......../......../......../......../......../......../......../BBBBBBBB/......../........][......../......../......../......../......../......../......../......../......../........]
 14/ 16 c660f5n18: [......../......../......../......../......../......../......../......../BBBBBBBB/........][......../......../......../......../......../......../......../......../......../........]
 15/ 16 c660f5n18: [......../......../......../......../......../......../......../......../......../BBBBBBBB][......../......../......../......../......../......../......../......../......../........]

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2022

Policy for Open MPI main with PRRTE (please double check me here)
If number of processes <= 2 then default --map-by core --bind-to core --rank-by slot

Actually, it technically would be rank-by core, but that is equivalent here

If number of processes > 2 then default --map-by socket --bind-to socket --rank-by slot

It technically is rank-by fill since you are mapping by an object

If --bind-to is not specified and --map-by is specified then --map-by object is implied for --bind-to

I'm not quite sure I follow this sentence - the result would be to bind-to whatever map-by was set to

If --map-by core implies --bind-to core

Yes

If --map-by socket implies --bind-to socket

Yes

--rank-by slot is the default

Not really - it technically is rank-by fill.

There also is a rule that map-by foo:span will default to rank-by span

@rhc54
Copy link
Contributor

rhc54 commented Oct 28, 2022

Just to be clear: this comment addresses the case where the user provides "--map-by foo", but nothing about the ranking or binding policies.

I went back to try and understand where the alternative rules came from, and I think I understand. It all boils down to what you want from "default" ranking and binding patterns. What we now have (after the change) is probably more what people might expect - i.e., if I only specify a mapping policy, then ranking and binding simply match it.

However, this does not result in the best performance in most cases. As was pointed out in a "debate" about this some time ago, the best performance is obtained by "dense" packing nodes/objects to ensure maximum usage of shared memory by adjacent ranks (i.e., procs whose rank differs by one) since that is the typical expectation of developers, and binding those procs to core. Having ranking simply follow mapping ensures the lowest density is obtained, and thus the poorest performance for the typical developer. Likewise, binding a proc to an object (other than a single core or cpu) is correct if the proc is multi-threaded, but otherwise reduces performance.

So I'm not sure what is "correct" here. Giving the user something that "looks" like what they might expect, but achieves lower performance in general? Or giving them something that probably provides better performance, but "looks" wrong since ranking and binding don't follow mapping by default?

@jjhursey
Copy link
Member

jjhursey commented Nov 3, 2022

The discussion on this issue diverged from the original post. Reading back, I think the issue has been addressed. @gregfi is there something further to investigate or do you have any other questions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants