Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RuntimeError during running spleen_ct_segmentation_sim and spleen_ct_segmentation_local #2427

Closed
KumoLiu opened this issue Mar 20, 2024 · 2 comments · Fixed by #2458
Closed
Assignees
Labels
bug Something isn't working

Comments

@KumoLiu
Copy link
Contributor

KumoLiu commented Mar 20, 2024

2024-03-20 14:37:59,155 - ClientTaskWorker - INFO - Clean up ClientRunner for : site-1 
2024-03-20 14:37:59,157 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 71655
2024-03-20 14:37:59,157 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00004 Not Connected] is closed PID: 71550
2024-03-20 14:37:59,401 - CoreCell - ERROR - site-1.simulate_job.0: error stopping Communicator: RuntimeError: cannot join current thread
2024-03-20 14:37:59,402 - CoreCell - ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvflare/fuel/f3/cellnet/core_cell.py", line 899, in stop
    self.communicator.stop()
  File "/usr/local/lib/python3.10/dist-packages/nvflare/fuel/f3/communicator.py", line 84, in stop
    self.conn_manager.stop()
  File "/usr/local/lib/python3.10/dist-packages/nvflare/fuel/f3/sfm/conn_manager.py", line 155, in stop
    self.frame_mgr_executor.shutdown(True)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/usr/lib/python3.10/threading.py", line 1093, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

2024-03-20 14:37:59,765 - SubWorkerExecutor - INFO - SubWorkerExecutor process shutdown.
2024-03-20 14:38:00,090 - SubWorkerExecutor - INFO - SubWorkerExecutor process shutdown.
2024-03-20 14:38:00,417 - SimulatorServer - INFO - Server app stopped.

The run command:
nvflare simulator /opt/toolkit/tutorials/fl/spleen_ct_segmentation_sim/job_multi_gpu --workspace sim_spleen_ct_seg --threads 1 --n_clients 1

nvflare version: 2.4.1rc1

@KumoLiu KumoLiu added the bug Something isn't working label Mar 20, 2024
@YuanTingHsieh YuanTingHsieh self-assigned this Mar 22, 2024
@YuanTingHsieh
Copy link
Collaborator

I'll update the tutorials inside the MONAI repo

@YuanTingHsieh
Copy link
Collaborator

This was coming from the monai toolkit example.

The real issue is that our PTMultiProcessExecutor is not 100% gracefully shutdown.

I did the test and found out the example starts and finishes.
Just that some error messages are printed out in the end.

@yhwen I think this means we are not gracefully shutdown PTMultiProcessExecutor, we can see if we can do some things there to improve it.

Error message:

2024-03-28 07:03:11,503 - FederatedClient - INFO - Shutting down client run: site-1
2024-03-28 07:03:11,503 - CoreCell - WARNING - [ME=site-1.:ate_job O=? D=site-1.:ate_job.1 F=? T=? CH=client_sub_worker_command TP=fire_event SCH=? STP=? SEQ=?] no connection to child site-1.simulate_job.0
2024-03-28 07:03:11,504 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=cross_site_model_eval]: asked to abort - triggered abort_signal to stop the RUN
2024-03-28 07:03:11,504 - CoreCell - ERROR - [ME=site-1.:ate_job O=site-1.:ate_job D=site-1.:ate_job.0 F=site-1.:ate_job T=site-1.:ate_job.0 CH=client_sub_worker_command TP=fire_event SCH=? STP=? SEQ=?] cannot send to 'site-1.simulate_job.0': target_unreachable
2024-03-28 07:03:11,504 - CoreCell - WARNING - [ME=site-1.:ate_job O=? D=site-1.:ate_job.1 F=? T=? CH=client_sub_worker_command TP=fire_event SCH=? STP=? SEQ=?] no connection to child site-1.simulate_job.1
2024-03-28 07:03:11,504 - CoreCell - ERROR - [ME=site-1.:ate_job O=site-1.:ate_job D=site-1.:ate_job.1 F=site-1.:ate_job T=site-1.:ate_job.1 CH=client_sub_worker_command TP=fire_event SCH=? STP=? SEQ=?] cannot send to 'site-1.simulate_job.1': target_unreachable
2024-03-28 07:03:11,505 - ClientTaskWorker - INFO - Clean up ClientRunner for : site-1 
2024-03-28 07:03:11,507 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00004 Not Connected] is closed PID: 25591
2024-03-28 07:03:11,507 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 25646
2024-03-28 07:03:11,653 - CoreCell - ERROR - site-1.simulate_job.0: error stopping Communicator: RuntimeError: cannot join current thread
2024-03-28 07:03:11,655 - CoreCell - ERROR - Traceback (most recent call last):
  File "/my_home/NVFlare/nvflare/fuel/f3/cellnet/core_cell.py", line 899, in stop
    self.communicator.stop()
  File "/my_home/NVFlare/nvflare/fuel/f3/communicator.py", line 84, in stop
    self.conn_manager.stop()
  File "/my_home/NVFlare/nvflare/fuel/f3/sfm/conn_manager.py", line 155, in stop
    self.frame_mgr_executor.shutdown(True)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/usr/lib/python3.10/threading.py", line 1093, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread


2024-03-28 07:03:11,696 - CoreCell - ERROR - site-1.simulate_job.1: error stopping Communicator: RuntimeError: cannot join current thread
2024-03-28 07:03:11,697 - CoreCell - ERROR - Traceback (most recent call last):
  File "/my_home/NVFlare/nvflare/fuel/f3/cellnet/core_cell.py", line 899, in stop
    self.communicator.stop()
  File "/my_home/NVFlare/nvflare/fuel/f3/communicator.py", line 84, in stop
    self.conn_manager.stop()
  File "/my_home/NVFlare/nvflare/fuel/f3/sfm/conn_manager.py", line 155, in stop
    self.frame_mgr_executor.shutdown(True)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/usr/lib/python3.10/threading.py", line 1093, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread


2024-03-28 07:03:11,848 - SubWorkerExecutor - INFO - SubWorkerExecutor process shutdown.
2024-03-28 07:03:12,152 - SubWorkerExecutor - INFO - SubWorkerExecutor process shutdown.
2024-03-28 07:03:12,844 - SimulatorServer - INFO - Server app stopped.
2024-03-28 07:03:13,008 - nvflare.fuel.hci.server.hci - INFO - Admin Server localhost on Port 60259 shutdown!
2024-03-28 07:03:13,009 - SimulatorServer - INFO - shutting down server
2024-03-28 07:03:13,009 - SimulatorServer - INFO - canceling sync locks
2024-03-28 07:03:13,009 - SimulatorServer - INFO - server off
2024-03-28 07:03:13,355 - MPM - WARNING - #### MPM: still running thread Thread-2 (monitor_parent_process)
2024-03-28 07:03:13,355 - MPM - INFO - MPM: Good Bye!
2024-03-28 07:03:13,659 - MPM - WARNING - #### MPM: still running thread Thread-2 (monitor_parent_process)
2024-03-28 07:03:13,660 - MPM - INFO - MPM: Good Bye!
2024-03-28 07:03:16,513 - MPM - INFO - MPM: Good Bye! 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants