Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@everywhere is slow on HPC with multi-node environment #39291

Closed
algorithmx opened this issue Jan 17, 2021 · 17 comments · Fixed by #44671
Closed

@everywhere is slow on HPC with multi-node environment #39291

algorithmx opened this issue Jan 17, 2021 · 17 comments · Fixed by #44671
Labels
domain:parallelism Parallel or distributed computation

Comments

@algorithmx
Copy link

algorithmx commented Jan 17, 2021

remotecall_eval(Main, procs, ex)

Please check here for descriptions of the problem by three Julia users:

https://discourse.julialang.org/t/everywhere-takes-a-very-long-time-when-using-a-cluster/35724

I have tested @everywhere and pmap() on an HPC. Test code and result available here
https://github.com/algorithmx/nodeba

Basically I just put timestamps between the lines. You can see in t*.log files that the largest gap is the one between timestamp 3 and 4. More interestingly, I found that increasing nworkers() causes the gap to increase linearly. I believe that this gap represents the execution time of the macro @everywhere, seen from master.

The vesion info is :

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c7 (2020-11-09 13:37 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD EPYC 7452 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver2)
Environment:
JULIA_PKG_SERVER = https://mirrors.tuna.tsinghua.edu.cn/julia
@algorithmx
Copy link
Author

algorithmx commented Jan 18, 2021

related issue #28966
@KristofferC

@arnauqb
Copy link

arnauqb commented Feb 22, 2021

Hey @algorithmx I'm facing exactly the same issue you describe, have you found any workaround yet?

Could it be realted to this? JuliaLang/Pkg.jl#1219

@algorithmx
Copy link
Author

Hey @algorithmx I'm facing exactly the same issue you describe, have you found any workaround yet?

Could it be realted to this? JuliaLang/Pkg.jl#1219

not yet :-(

@arnauqb
Copy link

arnauqb commented Mar 3, 2021

So I have been able to reduce the delay quite significantly by using the latest Julia 1.6 release (seems that the faster compilation speed helps), and also changing the Base.DEPOT_PATH.

using Distributed, ClusterManagers
pids = addprocs_slurm(...)
@everywhere pushfirst!(Base.DEPOT_PATH, "/tmp/julia.cache")

@moble
Copy link
Contributor

moble commented Aug 9, 2021

I've also run into this problem (posted on discourse here), and traced it back to just using @everywhere with basically any simple statement — even

@everywhere 1+2

It so happens that most of us run @everywhere using <SomePackage> first, so it looks like it has to do with precompilation, but I don't think it does. If I literally run @everywhere 1+2 first, then do all my imports, the imports are nice and fast — but only after 1+2 finishes, which takes forever.

This is a real killer for my use case, which involves scaling up to thousands of processors, which will waste (and has wasted for me) thousands of cpu-hours just running that first @everywhere statement.

@moble
Copy link
Contributor

moble commented Aug 10, 2021

Also note that @affans reported that this was a regression, with julia 1.0.5 running very quickly, and 1.3 very slowly.

@KristofferC
Copy link
Sponsor Member

If possible, it would be good to run a bisect between julia 1.0 and 1.3 to find out if there was a specific commit that caused the regression.

@LarkAnspach
Copy link

Yes

@vancleve
Copy link

vancleve commented Sep 7, 2021

I've also noticed this on Julia 1.6.2 and it's not just multi-node environments. When I am on a 128 core AMD machine and perform @everywhere using pkgs, I notice using top that quickly only a handful of julia processes are using any CPU at once (~10% or so) and only one of the processes is running. Which process is running changes until finally the @everywhere using completes. This happen on multi-node systems too except its one node at a time with a handful of processes at low CPU and one process running.

I have a video of what this looks like on a single node here:
https://youtu.be/mTar7HvIMQo

@moble
Copy link
Contributor

moble commented Sep 7, 2021

Workaround described here. Basically, you have to precompile the code that lets processes talk to each other.

@vancleve
Copy link

vancleve commented Sep 8, 2021

thanks @moble! I guess what I wonder is why is seems like precompilation problem is worse with many more processors? In other words, why aren't they all just precompiling simultaneously (the video makes it look like they're doing it one by one almost)?

@moble
Copy link
Contributor

moble commented Sep 9, 2021

I don't actually know how things work under the hood, but I have tested it and found that the timing increases linearly with the number of processes. So my mental model is that @everywhere involves the primary process sending the instruction to workers and waiting for some type of confirmation that the instruction was received — or at least that sending has started. (I don't know it's an actual receipt confirmation, or opening of the socket, or creation of that worker's log, or the beginning of deserialization...)

But the primary must do this in serial to some extent, meaning it doesn't start sending the instruction to the next worker until it has whatever confirmation it needs. Normally this wouldn't be a problem, because confirmation is presumably almost instantaneous the great majority of the time. But compiling the code required to confirm takes ~1 second. And that's the part that has to happen on each worker in serial. That 1 second is not used in compiling the statement itself (which I know because I've tried statements that take much longer to compile); it must be just some piece of code required to let the primary know it got the message.

By precompiling something as simple as @everywhere 1+1, each worker can skip that step of compiling the confirming function, so the primary can move on to the next worker more quickly. And that's exactly what KristofferC has added/re-enabled in #42156.

@vancleve
Copy link

maybe the precompilation in #42156 didn't fix this issue?

#42156 (comment)

maybe there is some other code that's needs to be precompiled on the worker end that isn't precompiled by just calling @everywhere 1+1 or the other lines in generate_precompile.jl?

@moble
Copy link
Contributor

moble commented Sep 19, 2021

Maybe. I don't know how julia's own build process works; maybe the image you used isn't being built with multiple processes.

Also, I'll point out that in the workaround I linked above, I actually used the --trace-compile flag on both the primary julia process and the worker process, then combined the outputs in case the worker's output wasn't a subset of the primary's. (I didn't actually check whether it was or not.) I don't know whether or not julia does this when building itself.

@carlocastoldi
Copy link

Is it possible that his bug also affects MPI.jl?
I have built a framework that revolves around MPI calls to synchronize I/O operations over all HPC nodes. With few nodes (e.g. 4) it works like charm, but as soon I go up to 50/100 nodes it just becomes unbearable.

@everywhere calls doesn't seem to be the cause this time since I'm using @moble 's trick, but I'm now starting to think it's the MPI calls' fault.
I tried to add to the precompilation the MPI calls i'm using, but it doesn't seem to work.
I did it by adding :MPI in create_sysimage() call in precompile.jl. Then in precompile_everywhere.jl I wrote the calls i use:

# all precompile(...) calls
function main()
    MPI.Initialized()
    MPI.Init(threadlevel=:multiple)
    base_comm = MPI.COMM_WORLD
    print(" comm size: $(MPI.Comm_size(base_comm)) ---")
    base_grp = MPI.Comm_group(base_comm)
    id_group = MPI.Group_incl(base_grp, Int32[0])
    comm = MPI.Comm_create_group(base_comm, id_group, 42)
    MPI.Barrier(base_comm)
    fh = MPI.File.open(comm, "path/to/file/foo.bar"; append=true, write=true, create=true)
    MPI.File.seek_shared(fh, 0)
    MPI.File.write_ordered(fh, Int32[42,420])
    MPI.File.write_at_all(fh, 1, Int32[69])
    MPI.File.write_at(fh, 0, Int32[1])
    close(fh)
end

main()

And I then just execute julia precompile.jl precompile.
Do you have any idea? It seems like even doing MPI.Init() on 100 nodes takes ~1 hour....

@giordano
Copy link
Contributor

giordano commented Dec 9, 2021

Is it possible that his bug also affects MPI.jl?

Unlikely, since MPI.jl has nothing to do with the Distributed standard library. You may want to report the issue to the MPI.jl repository, but you have to provide more details about your system.

@carlocastoldi
Copy link

Unlikely, since MPI.jl has nothing to do with the Distributed standard library. You may want to report the issue to the MPI.jl repository, but you have to provide more details about your system.

Sure, thank you. I'm now trying investigating on it so that i have more informations about it

@ViralBShah ViralBShah added the domain:parallelism Parallel or distributed computation label Mar 13, 2022
KristofferC pushed a commit that referenced this issue Mar 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
KristofferC pushed a commit that referenced this issue Mar 25, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Apr 20, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Jul 4, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Dec 21, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 62e0729)
staticfloat pushed a commit that referenced this issue Dec 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 62e0729)
vchuravy pushed a commit to JuliaLang/Distributed.jl that referenced this issue Oct 6, 2023
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes JuliaLang/julia#44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes JuliaLang/julia#39291.
- JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with JuliaLang/julia#38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
(cherry picked from commit 3b57a49)
Keno pushed a commit that referenced this issue Jun 5, 2024
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants