Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of subworlds in CoreNEURON #2185

Merged
merged 28 commits into from
Jan 31, 2023

Conversation

iraikov
Copy link
Contributor

@iraikov iraikov commented Jan 24, 2023

This PR contains initial support for subworlds in CoreNEURON per Slack discussion on 1/23/2023.

@iraikov iraikov marked this pull request as draft January 24, 2023 15:01
Copy link
Member

@iomaganaris iomaganaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Ivan!
Thanks a lot for your really nice contribution.
I took some time to skim through your PR to understand the changes and left some comments/suggestions.
I'm trying to follow the slack conversation as well but if you have any other question or need and clarification feel free to post it here as well 👍

src/nrnmpi/nrnmpi.cpp Outdated Show resolved Hide resolved
src/nrnmpi/nrnmpi.cpp Outdated Show resolved Hide resolved
src/nrnmpi/bbsmpipack.cpp Outdated Show resolved Hide resolved
src/oc/nrnmpi.h Outdated Show resolved Hide resolved
src/nrnmpi/nrnmpi.cpp Show resolved Hide resolved
test/coreneuron/test_subworlds.py Show resolved Hide resolved
src/coreneuron/apps/corenrn_parameters.hpp Outdated Show resolved Hide resolved
@nrnhines
Copy link
Member

I'm not experienced in working on a pull request originating on another repository. I followed the "Merging via command line" Step1 instructions but because this pr is currently out of date with the master, a successful pull needed a merge or rebase.
I chose:

git pull --rebase git@github.com:iraikov/nrn.git iraikov/coreneuron_subworlds

but I suppose this means getting my changes back to the pr will require more or less a manual patch.

@iraikov
Copy link
Contributor Author

iraikov commented Jan 25, 2023

The definition of the nrn2core_subworld_info_ function pointer is in the CoreNeuron version of nrnmpi.cpp, but it looks like this file is not compiled when MPI is disabled. This results in the error "Could not get symbol nrn2core_subworld_info_ from CoreNEURON" when CoreNEURON is enabled and psolve is invoked.

Does anybody have any advice where to place it?

@nrnhines
Copy link
Member

Does anybody have any advice where to place it?

Looking into this.

@iraikov
Copy link
Contributor Author

iraikov commented Jan 25, 2023

I'm not experienced in working on a pull request originating on another repository. I followed the "Merging via command line" Step1 instructions but because this pr is currently out of date with the master, a successful pull needed a merge or rebase. I chose:

git pull --rebase git@github.com:iraikov/nrn.git iraikov/coreneuron_subworlds

but I suppose this means getting my changes back to the pr will require more or less a manual patch.

@nrnhines I have checked the option that is supposed to give you write access to the branch. You should be able to push commits to the branch, but if not I can add you manually.

@nrnhines
Copy link
Member

looking into this

I think that since

void (*nrn2core_subworld_info_)(int&, int&, int&);

args and return do not depend on mpi.h, it can be declared just about anywhere that is always compiled and become in nrnmpi.cpp

extern void (*nrn2core_subworld_info_)(int&, int&, int&);

adding comment about nrnmpi_subworld_change_cnt;
moving declaration of nrn2core_subworld_info_ to src/coreneuron/utils/utils.cpp;
set values of myid_bbs and numprocs_bbs on all ranks
adding corenrn subworlds test to ctest config;
somewhat more robust corenrn subworlds
@nrnhines
Copy link
Member

In at least one run (perhaps many?) I see the failure in external-nrntest

run with -> celltype=1  <-
run with mpi -> celltype=1  <-

This is also occurring on my desktop. Perhaps it has something to do with a recent change to nrntests

hines@hines-t7500:~/neuron/corenrn-subworlds/external/tests/nrntest$ git status
HEAD detached at aa899ce

in external/tests/nrntest/nrniv/Parallel/recv

$ git log test.sh
commit 1d283f3e72fba9fd1dda3a81e8ccd7616b49b559
Author: Alexandru Săvulescu <alexandru.savulescu@epfl.ch>
Date:   Wed Nov 24 16:11:06 2021 +0100

    enable more parallel testing (#25)
    
    * make `nrniv/Parallel/recv` CI enabled

@iraikov
Copy link
Contributor Author

iraikov commented Jan 26, 2023

In at least one run (perhaps many?) I see the failure in external-nrntest

run with -> celltype=1  <-
run with mpi -> celltype=1  <-

This is also occurring on my desktop. Perhaps it has something to do with a recent change to nrntests

hines@hines-t7500:~/neuron/corenrn-subworlds/external/tests/nrntest$ git status
HEAD detached at aa899ce

in external/tests/nrntest/nrniv/Parallel/recv

$ git log test.sh
commit 1d283f3e72fba9fd1dda3a81e8ccd7616b49b559
Author: Alexandru Săvulescu <alexandru.savulescu@epfl.ch>
Date:   Wed Nov 24 16:11:06 2021 +0100

    enable more parallel testing (#25)
    
    * make `nrniv/Parallel/recv` CI enabled

I see this test hangs on my computer, and I suspect it is possibly caused by the changed semantics of nhost_bbs and myid_bbs. It looks like calls to context somehow are not sent to the workers.

@nrnhines
Copy link
Member

this test hangs on my computer, and I suspect it is possibly caused by the changed semantics of nhost_bbs and myid_bbs.

That is worth exploring. I thought the issue was something else but my machine does successfully do ctest -R external_nrntest

@iraikov
Copy link
Contributor Author

iraikov commented Jan 26, 2023

this test hangs on my computer, and I suspect it is possibly caused by the changed semantics of nhost_bbs and myid_bbs.

That is worth exploring. I thought the issue was something else but my machine does successfully do ctest -R external_nrntest

@nrnhines The reason for the hang on my system is that when runworker is called, the worker processes (myid > 0) all execute BBS::subworld_worker_execute, where they invoke nrnmpi_int_broadcast and wait. At the same time, the master calls context and then take to obtain results, but somehow the call to context does not result in the workers receiving the call, so they are stuck in the nrnmpi_int_broadcast. I do not understand enough about the BBS internals to know the correct sequence that should occur. But it looks like the change in numprocs_bbs myid_bbs is breaking context in the default case, when subworlds is not called.

@nrnhines
Copy link
Member

nrnhines commented Jan 26, 2023

The reason for the hang

Ah yes. When there are subworlds some code relies on nrnmpi_myid_bbs == -1 .
That code is probably reformulatable by looking at both the new nrnmpi_myid_bbs and nrnmpi_myid
The comment in subworld.cpp is interesting:

void BBSImpl::subworld_worker_execute() {
    // execute the same thing that execute_worker is executing. This
    // is done for all the nrnmpi_myid_bbs == -1 workers associated with
    // the specific nrnmpi_myid == 0 with nrnmpi_myid_bbs >= 0.   
    // All the nrnmpi/mpispike.cpp functions can be used since the
    // proper communicators for a subworld are used by those functions.
    // The broadcast functions are particularly useful and those are
    // how execute_worker passes messages into here.

I will review in context any substantive use of nrnmpi_myid_bbs. I.e.

ocbbs.cpp:246:    return nrnmpi_myid_bbs;
bbsdirectmpi.cpp:191:    BBSDirectServer::server_->post_todo(parentid, nrnmpi_myid_bbs, sendbuf_);
bbs.cpp:57:        BBSImpl::is_master_ = (nrnmpi_myid_bbs == 0) ? true : false;
bbs.cpp:431:        if (nrnmpi_myid_bbs == -1) {  // wait for message from

@nrnhines
Copy link
Member

nrnhines commented Jan 27, 2023

I think it will be best to return to the former '-1' values for nrnmpi_myid_bbs and nrnmpi_nhost_bbs. This can be revisited after this pr is merged. Instead, for purposes of filling the arguments in nrn/src/nrniv/nrncore_write/callbacks/nrncore_callbacks.cpp
void nrn2core_subworld_info(int& cnt, int& subworld_index, int& subworld_rank) {
try using something like

subworld_index = nrnmpi_myid_world/nrnmpi_numprocs;

or else copy the present calculation of nrnmpi_myid_bbs into another global variable and then for nrnmpi_myid != 0 set
nrnmpi_myid_bbs =-1 and nrnmpi_nhost_bbs = -1.

By the way, I realize that calling pc.subworlds(n) multiple times is useless from the point of view of the bulletin board. After calling pc.runworker you are locked in to whatever subworld organization there was.

@iraikov
Copy link
Contributor Author

iraikov commented Jan 27, 2023

I think it will be best to return to the former '-1' values for nrnmpi_myid_bbs and nrnmpi_nhost_bbs. This can be revisited after this pr is merged. Instead, for purposes of filling the arguments in nrn/src/nrniv/nrncore_write/callbacks/nrncore_callbacks.cpp void nrn2core_subworld_info(int& cnt, int& subworld_index, int& subworld_rank) { try using something like

subworld_index = nrnmpi_myid_world/nrnmpi_numprocs;

or else copy the present calculation of nrnmpi_myid_bbs into another global variable and then for nrnmpi_myid != 0 set nrnmpi_myid_bbs =-1 and nrnmpi_nhost_bbs = -1.

Ok, will do.

By the way, I realize that calling pc.subworlds(n) multiple times is useless from the point of view of the bulletin board. After calling pc.runworker you are locked in to whatever subworld organization there was.

I thought it was possible to call pc.done and then subworlds again, but I agree this is a fairly contrived scenario.

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2023

Codecov Report

Merging #2185 (5301035) into master (aa19bcb) will increase coverage by 0.00%.
The diff coverage is 84.09%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@           Coverage Diff           @@
##           master    #2185   +/-   ##
=======================================
  Coverage   55.76%   55.77%           
=======================================
  Files         616      616           
  Lines      123975   124017   +42     
=======================================
+ Hits        69133    69168   +35     
- Misses      54842    54849    +7     
Impacted Files Coverage Δ
src/coreneuron/utils/utils.cpp 60.00% <ø> (ø)
src/parallel/bbs.cpp 73.96% <33.33%> (-0.47%) ⬇️
src/nrnmpi/nrnmpi.cpp 72.79% <58.33%> (-1.40%) ⬇️
src/coreneuron/mpi/lib/nrnmpi.cpp 64.94% <100.00%> (+9.10%) ⬆️
...rniv/nrncore_write/callbacks/nrncore_callbacks.cpp 95.46% <100.00%> (+0.05%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@iraikov iraikov marked this pull request as ready for review January 27, 2023 22:27
@iraikov
Copy link
Contributor Author

iraikov commented Jan 27, 2023

The NEURON CI checks are now passing, but the neuronsimulator.nrn checks did not pass and I do not understand the error.

@ramcdougal
Copy link
Member

The CI failures have nothing to do with this pull request. We have the same issue in #2199 which is a typo fix for the documentation, but I have no idea why the CI is misbehaving.

@nrnhines
Copy link
Member

I'm guessing something about the system has changed on one of the CI runners. Our CI scripts for the CI resources seem to need occasional maintenance as the latter upgrades to new versions of software that we use.
This one is a mystery to me but @alexsavulescu often works wonders.
https://dev.azure.com/neuronsimulator/aa1fb98d-a914-45c3-a215-5e5ef1bd7687/_apis/build/builds/7827/logs/207

2023-01-27T22:20:35.1664148Z ##[section]Starting: GitHubComment
2023-01-27T22:20:35.1672416Z ==============================================================================
2023-01-27T22:20:35.1672724Z Task         : GitHub Comment
2023-01-27T22:20:35.1672888Z Description  : Write a comment to your Github entity i.e. issue or a Pull Request (PR)
2023-01-27T22:20:35.1673139Z Version      : 0.198.0
2023-01-27T22:20:35.1673294Z Author       : Microsoft Corporation
2023-01-27T22:20:35.1674178Z Help         : 
2023-01-27T22:20:35.1674324Z ==============================================================================
2023-01-27T22:20:35.7460734Z ##[error]Writing comment failed
2023-01-27T22:20:35.7479471Z ##[section]Finishing: GitHubComment

@alexsavulescu
Copy link
Member

As far as I can tell, it's related to permissions. Since this PR is incoming from a fork. Will look into it Monday

@olupton
Copy link
Collaborator

olupton commented Jan 30, 2023

As far as I can tell, it's related to permissions. Since this PR is incoming from a fork. Will look into it Monday

Note that the ci/gitlab/bbpgitlab.epfl.ch pipeline will also not run on PRs coming from forks.
It's not currently marked as required, so it won't block merging, but we will be blind to any issues specific to NVIDIA/Intel compilers, GPU execution, HPE-MPI, and so on.

src/coreneuron/mpi/lib/nrnmpi.cpp Show resolved Hide resolved
@alexsavulescu
Copy link
Member

As far as I can tell, it's related to permissions. Since this PR is incoming from a fork. Will look into it Monday

I've explored in dealing with permissions but with no success. So mitigation in place via #2201, this means that for fork PRs people need to manually get the Azure drop URL in case they need the build artifacts (wheels)

@iraikov
Copy link
Contributor Author

iraikov commented Jan 30, 2023

As far as I can tell, it's related to permissions. Since this PR is incoming from a fork. Will look into it Monday

I've explored in dealing with permissions but with no success. So mitigation in place via #2201, this means that for fork PRs people need to manually get the Azure drop URL in case they need the build artifacts (wheels)

Thank you for looking into this! Is there anything I can do on my end to ensure that the CI-related checks pass?

@alexsavulescu
Copy link
Member

Thank you for looking into this! Is there anything I can do on my end to ensure that the CI-related checks pass?

No special consideration required for that job anymore.

@nrnhines nrnhines merged commit d2a76ab into neuronsimulator:master Jan 31, 2023
@iraikov iraikov deleted the iraikov/coreneuron_subworlds branch January 31, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants