WIP: server_app josn export.

generate the job app config. fully functional pythonic job creation. Added simulator_run for pythonic API. reformat. Added filters support for pythonic job creation. handled the direct import case in fed_job. refactor. Added the resource_spec set function for FedJob. refactored. Moved the ClientApp and ServerApp into fed_app.py. Refactored: removed the _FilterDef class. refactored. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine Enable obj in the constructor as paramenter. Added support for the launcher script. refactored. reformat. Update the comment. re-arrange the package location. Added add_ext_script() for BaseAppConfig. codestyle fix. Removed the client-api-pt example. removed no used import. fixed the in_time_accumulate_weighted_aggregator_test.py Added Enum parameter support. Added docstring. Fix typo (#2432) Enable StreamCell for all application channels (#2407) Add back request header (#2440) Check wandb login (#2445) * check wandb login * Use default wandb offline mode * add mode online check Add note about delay in workspace creation for larger jobs (#2454) Client API Update: Job Templates, examples to reflect different type of Client API (#2456) * 1. Update README 2. fix bugs on in-proc client API 3. update examples to use in-proc client api in cases make sense * 1. update documentation * 1. update job template description 2. update in process API to allow user keep the existing configuration 3. update notebooks for step-by-step sag * update README.md * remove task_fn_args argument in the executor * remove task_fn_args argument in the executor add controller interface (#2451) Update README.md (#2460) fix typo improve reliable msg (#2459) CC block byoc jobs (#2403) * WIP: tdx_cc integration. * fixed toke_file read. * WIP: added info for CC add client tokens.: * Fixed an error when client does not have CC token reported. * Added handle for client does not have CC_INFO. * Added CLIENT_QUIT event for CCManager to remove client token. * Added _add_client_token client token logging info. * Added peer_ctx for client quit. * set_peer_context for client quit. * Changed the AUTHORIZATION_REASON set_prop sticky to False. * WIP: TokenPundit interface change. * WIP: added cc_authorizer_ids config. * Added cc_issuer_id for CCManager. * renamed the TokenPundit to CCAutorizer. * Added CC token adding through client heartbeat. * Added function to stop current running job if CC verify fail. * if CC failed to get toke, don't allow the system to start. * Added exceptions None check. * Address the client side CC check before job scheduled. * fixed the PEER_FL_CONTEXT error. * Added CCManager support to have multiple cc_issuers. * optimized CCManager. * updated the _verify_participants() logic. * set up the proper fl_ctx for admin send_requests(). * Add proper fl_ctx. * Refactor the CCManager. * Refactor the CCManager and TDX_authorizer. * Added TOKEN_EXPIRATION for each cc_issue in CCManager. * Fixed CC TOKEN_EXPIRATION error. * refactor the CCManager _prepare_cc_info() * Refactor. * refactor the cc tokens periodic verification. * added critical_level for CCManager. * codestyle fix. * removed no used import. * removed no use import. * Fixed the unitest. * Added CCManager unit tests. * Added CCTokenGenerateError and CCTokenVerifyError. Updated CCAuthorizer interface. * WIP: CC block byoc job. * block BYOC job for CC. * Addressed some PR reviews. * Added exception catch for TDXAuthorizer. * codestyle fix. * renamed some events. * renamed event names. * renamed event names. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Fixed the authz and site_security check for check_resource command. (#2462) Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> add garbage collect at ends of round-based workflows (#2463) add WFController (#2468) Add warning when the same admin in project.yml has different role Add custom order and early termination to CyclicController (#2387) * Add custom order and early termination to CyclicController and add tests * Add more error handling Add IPC agent and exchanger (#2435) * support av ipc agent * removed unused import * address PR comments fix typo (#2473) Refactor WFController and ModelController (#2475) * refactor wf and model controller * clarify persisor_id Add example for mulitparty kaplan-meier analysis with HE (#2259) * add example for mulitparty kaplan meier analysis with HE * update requirements * update baseline script, remove complex settings and keep basic only * add readme with details * add readme with details * add curves, modify saving functions (curve and km details) * job name update * remove redundant print * move data preparation part out of local code * move HE context part out of FL process to better accomodate the transition to real application * update to use new controller interface * change to send_model_and_wait * format * updated readme * fix merge conflict * update readme * update readme * update readme * update readme * move to job template --------- Co-authored-by: Sean Yang <seany314@gmail.com> remove old task_fn_args (#2479) Enable simulator to run HE (#2339) * Enable simulator to run HE. * fixed the unittest. * Created startup folder for simulator run if not exist. * Changed to use setup and teardown for pytest. * extract common codes init_security_content_service(). * removed no use import. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> not creating Workspace object (#2489) Fix xgboost integration tests (#2486) * change to use path * update finance and vertical xgboost Added ability to handle parameters from base class. Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. Added params_exchange_format for PTInProcessClientAPIExecutor. codestyle fix. Fixed a custom code folder structure issue. work for sub-folder custom files. backed to handle parameters from base classes. Support folder structure job config. Added support for flat folder from '.XXX' import. codestyle fix. refactored and add docstring. Add FedBPT research example (#2465) * Add FedBPT research example initial fedbpt files add roberta model and run FL move send to end upgrade to 2.4.1rc and run experiment with 10 clients move init to top debug using pickle record successful setting use custom decomposer clean code add summary writer add result figure formatting fix broken links remove debug messages update readme with system resources use decomposer widget on server * address comments; enable selection of evaluation client * use new FedAvg api * exclude dir from license test * only exclude file for license check fix xgboost test setup (#2494) add Client API documentation (#2497) * add Client API documentation * add Client API documentation Added more logging for the job status changing. (#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (#2508) * check workflow id before updating client status * change order of checks Address some of the PR reviews. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine run demo run demo set gpus and external scripts move FedJob api change folder structure xval example xval example reuse code add filter example minor updates update job dir refactor Controller/ExcecutorApps hide ControllerApp/ExecutorApp fix doubled deploy call handle filters handle cross-site val add swarm example (wip) Add user guide on how to deploy to EKS (#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (#2521) Upgrade dependencies (#2516) Use full path for PSI components (#2437) (#2517) Multiple bug fixes from 2.4 (#2518) * [2.4] Support client custom code in simulator (#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (#2457) * Fix sub_worker_process shutdown (#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (#2474) Pythonic job creation (#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (#2519) * Starts heartbeat after task is pull and before task execution (#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (#2442) * [2.4] Improve cell pipe timeout handling (#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (#2478) * Fix pipe handler timeout in task exchanger and launcher executor (#2495) * Fix metric relay pipe handler timeout (#2496) * Rely on launcher check_run_status to pause/resume hb (#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (#2520) * Update github actions (#2450) * Fix premerge (#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> WIP: constructed the FedJob. WIP: server_app josn export. generate the job app config. fully functional pythonic job creation. Added simulator_run for pythonic API. reformat. Added filters support for pythonic job creation. handled the direct import case in fed_job. refactor. Added the resource_spec set function for FedJob. refactored. Moved the ClientApp and ServerApp into fed_app.py. Refactored: removed the _FilterDef class. refactored. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine Enable obj in the constructor as paramenter. Added support for the launcher script. refactored. reformat. Update the comment. re-arrange the package location. Added add_ext_script() for BaseAppConfig. codestyle fix. Removed the client-api-pt example. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine run demo set gpus and external scripts move FedJob api change folder structure xval example xval example reuse code add filter example minor updates update job dir refactor Controller/ExcecutorApps hide ControllerApp/ExecutorApp fix doubled deploy call handle filters handle cross-site val add swarm example (wip) make FedJob2 default FedJob use ScriptExecutor test swarm learning add cyclic workflow add todo update swarm learning make FedJob2 default again use controller name for stats (#2522) Simulator workspace re-design (#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply use ScriptExecutor add kmeans example simplify job simulator_run to take only one workspace parameter. (#2528) test kmeans, use latest main fix kmeans some redesign address comments rename source dir Add missing client api test jobs (#2535) Fixed the simulator server workspace root dir (#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo remove use of uuid4 handle ids of built-in components expose aggregate_fn to users for overwriting (#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (#2542) FLModel summary (#2544) * add FLModel Summary * format Update KM example, add 2-stage solution without HE (#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> handle cases where the script with relative path in Script Runner (#2543) * handle cases where the script with relative path * handle cases where the script with relative path * add more unit test cases and change the file search logics * code format * add more unit test cases and change the file search logics Lr newton raphson (#2529) * Implement federated logistic regression with second-order newton raphson. Update file headers. Update README. Update README. Fix README. Refine README. Update README. Added more logging for the job status changing. (#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (#2508) * check workflow id before updating client status * change order of checks Add user guide on how to deploy to EKS (#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (#2521) Upgrade dependencies (#2516) Use full path for PSI components (#2437) (#2517) Multiple bug fixes from 2.4 (#2518) * [2.4] Support client custom code in simulator (#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (#2457) * Fix sub_worker_process shutdown (#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (#2474) Pythonic job creation (#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (#2519) * Starts heartbeat after task is pull and before task execution (#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (#2442) * [2.4] Improve cell pipe timeout handling (#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (#2478) * Fix pipe handler timeout in task exchanger and launcher executor (#2495) * Fix metric relay pipe handler timeout (#2496) * Rely on launcher check_run_status to pause/resume hb (#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (#2520) * Update github actions (#2450) * Fix premerge (#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> use controller name for stats (#2522) Simulator workspace re-design (#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply simplify job simulator_run to take only one workspace parameter. (#2528) Fix README. Fix file links in README. Fix file links in README. Add comparison between centralized and federated training code. Add missing client api test jobs (#2535) Fixed the simulator server workspace root dir (#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo Update README for launching python script. Modify tensorboard logdir. Link to environment setup instructions. expose aggregate_fn to users for overwriting (#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (#2542) Remove line number in code link. FLModel summary (#2544) * add FLModel Summary * format formatting Update KM example, add 2-stage solution without HE (#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update license --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Holger Roth <hroth@nvidia.com> handle ids minor updates rename folder use default ids update kmeans add lightning example handle multiple GPUs make model selection metric configurable make model selection metric configurable add docstrings Add information about dig (bind9-dnsutils) in the document Update monai readme to remove logging.conf (#2552) MONAI mednist example (#2532) * add monai notebook * add training script * update example * update notebook * use job template * call init later * swith back * add gitignore * update notebooks * add readmes * send received model to GPU * use monai tb stats handler * formatting Improve AWS cloud launch script restore files reset file. Add docstring formatting
NVIDIA · May 6, 2024 · d2d1d4d · d2d1d4d
1 parent d94b23a
commit d2d1d4d
Show file tree

Hide file tree

Showing 413 changed files with 22,655 additions and 2,085 deletions.
diff --git a/.github/workflows/blossom-ci.yml b/.github/workflows/blossom-ci.yml
@@ -74,7 +74,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout code
-        uses: actions/checkout@v2
+        uses: actions/checkout@v4
         with:
           repository: ${{ fromJson(needs.Authorization.outputs.args).repo }}
           ref: ${{ fromJson(needs.Authorization.outputs.args).ref }}

diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml
@@ -36,7 +36,7 @@ jobs:
 
     steps:
     - name: Checkout repository
-      uses: actions/checkout@v3
+      uses: actions/checkout@v4
 
     # Initializes the CodeQL tools for scanning.
     - name: Initialize CodeQL

diff --git a/.github/workflows/markdown-links-check.yml b/.github/workflows/markdown-links-check.yml
@@ -23,7 +23,7 @@ jobs:
   markdown-link-check:
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout@master
+    - uses: actions/checkout@v4
     - uses: gaurav-nelson/github-action-markdown-link-check@1.0.15
       with:
         max-depth: -1

diff --git a/.github/workflows/premerge.yml b/.github/workflows/premerge.yml
@@ -29,15 +29,15 @@ jobs:
         os: [ ubuntu-22.04, ubuntu-20.04 ]
         python-version: [ "3.8", "3.9", "3.10" ]
     steps:
-      - uses: actions/checkout@v3
+      - uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
       - name: Install dependencies
         run: |
-          python -m pip install --upgrade pip
-          pip install -e .[dev]
+          python3 -m pip install --upgrade pip
+          python3 -m pip install --no-cache-dir -e .[dev]
       - name: Run unit test
         run: ./runtest.sh
 
@@ -49,15 +49,15 @@ jobs:
         os: [ ubuntu-22.04, ubuntu-20.04 ]
         python-version: [ "3.8", "3.9", "3.10" ]
     steps:
-      - uses: actions/checkout@v3
+      - uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
       - name: Install dependencies
         run: |
-          python -m pip install --upgrade pip
-          pip install -e .[dev]
-          pip install build twine torch torchvision
+          python3 -m pip install --upgrade pip
+          python3 -m pip install --no-cache-dir -e .[dev]
+          python3 -m pip install --no-cache-dir build twine torch torchvision
       - name: Run wheel build
         run: python3 -m build --wheel
diff --git a/docs/programming_guide/execution_api_type/client_api.rst b/docs/programming_guide/execution_api_type/client_api.rst
@@ -159,24 +159,75 @@ If you are using PyTorch Lightning in your training code, you can check the
 Lightning API Module :mod:`nvflare.app_opt.lightning.api`.
 
 
+Client API communication patterns
+=================================
+
+.. image:: ../../resources/client_api.png
+    :height: 300px
+
+We offer various implementations of Client APIs tailored to different scenarios, each linked with distinct communication patterns.
+
+Broadly, we present in-process and sub-process executors. The in-process executor, slated for release in NVFlare 2.5.0,
+entails both training scripts and client executor operating within the same process. Communication between them occurs
+through an in-memory databus.
+
+On the other hand, the LauncherExecutor employs a sub-process to execute training scripts, leading to the client executor
+and training scripts residing in separate processes. Communication between them is facilitated by either CellPipe
+(default) or FilePipe.
+
+When the training process involves either a single GPU or no GPUs, and the training script doesn't integrate third-party
+training systems, the in-process executor is preferable (when available). For scenarios involving multi-GPU training or
+the utilization of external training infrastructure, opting for the Launcher executor might be more suitable.
+
+
+Choice of different Pipes
+=========================
+In the 2.5.x release, for most users, we recommend utilizing the default setting with the in-process executor
+(defaulting to memory-based data exchanges).
+Conversely, in the 2.4.x release, we suggest using the default setting with CellPipe for most users.
+
+CellPipe facilitates TCP-based cell-to-cell connections between the Executor and training script processes on
+the local host. The term cell represents logical endpoints. This communication enables the exchange of models, metrics,
+and metadata between the two processes.
+
+In contrast, FilePipe offers file-based communication between the Executor and training script processes,
+utilizing a job-specific file directory for exchanging models and metadata via files. While FilePipe is easier to set up
+than CellPipe, it's not suitable for high-frequency metrics exchange.
+
+
 Configuration
 =============
 
-In the config_fed_client in the FLARE app, in order to launch the training
-script we use the
-:class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>` component.
-The defined ``script`` is invoked, and ``launch_once`` can be set to either
-launch once for the whole job, or launch a process for each task received from the server.
+Different configurations are available for each type of executor.
+
+Definition lists:
+
+in-process executor configuration
+    .. literalinclude:: ../../../job_templates/sag_pt_in_proc/config_fed_client.conf
+
+    This configuration specifically caters to PyTorch applications, providing serialization and deserialization
+    (aka Decomposers) for commonly used PyTorch objects. For non-PyTorch applications, the generic
+    ``InProcessClientAPIExecutor`` can be employed.
 
-A corresponding :class:`LauncherExecutor<nvflare.app_common.executors.launcher_executor.LauncherExecutor>`
-is used as the executor to handle the tasks and perform the data exchange using the pipe.
-For the Pipe component we provide implementations of :class:`FilePipe<nvflare.fuel.utils.pipe.file_pipe>`
-and :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe>`.
+subprocess launcher Executor configuration
+    In the config_fed_client in the FLARE app, in order to launch the training script we use the
+    :class:`SubprocessLauncher<nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher>` component.
+    The defined ``script`` is invoked, and ``launch_once`` can be set to either
+    launch once for the whole job (launch_once = True), or launch a process for each task received from the server (launch_once = False)
 
-.. literalinclude:: ../../../job_templates/sag_pt/config_fed_client.conf
+   ``launch_once`` dictates how many times the training scripts are invoked during the overall training process.
+    When set to False, the executor essentially invokes ``python <training scripts>.py`` every round of training.
+    Typically, launch_once is set to True.
 
-For example configurations, take a look at the :github_nvflare_link:`job_templates <job_templates>`
-directory for templates using the launcher and Client API.
+    A corresponding :class:`LauncherExecutor<nvflare.app_common.executors.launcher_executor.LauncherExecutor>`
+    is used as the executor to handle the tasks and perform the data exchange using the pipe.
+    For the Pipe component we provide implementations of :class:`FilePipe<nvflare.fuel.utils.pipe.file_pipe>`
+    and :class:`CellPipe<nvflare.fuel.utils.pipe.cell_pipe>`.
+
+    .. literalinclude:: ../../../job_templates/sag_pt/config_fed_client.conf
+
+    For example configurations, take a look at the :github_nvflare_link:`job_templates <job_templates>`
+    directory for templates using the launcher and Client API.
 
 .. note::
    In that case that the user does not need to launch the process and instead
@@ -193,3 +244,24 @@ For additional examples, also take a look at the
 :github_nvflare_link:`step-by-step series <examples/hello-world/step-by-step>`
 that use Client API to write the
 :github_nvflare_link:`train script <examples/hello-world/step-by-step/cifar10/code/fl/train.py>`.
+
+
+Selection of Job Templates
+==========================
+To help user quickly setup job configurations, we create many job templates. You can pick one job template that close to your use cases
+and adapt to your needs by modify the needed variables.
+
+use command ``nvflare job list_templates`` you can find all job templates nvflare provided.
+
+.. image:: ../../resources/list_templates_results.png
+    :height: 300px
+
+looking at the ``Execution API Type``, you will find ``client_api``. That's indicates the specified job template will use
+Client API configuration.  You can further nail down the selection by choice of machine learning framework: pytorch or sklearn or xgboost,
+in-process or not, type of models ( GNN, NeMo LLM), workflow patterns ( Swarm learning or standard fedavg with scatter and gather (sag)) etc.
+
+
+
+
+
+
diff --git a/docs/real_world_fl.rst b/docs/real_world_fl.rst
@@ -30,5 +30,6 @@ to see the capabilities of the system and how it can be operated.
    real_world_fl/job
    real_world_fl/workspace
    real_world_fl/cloud_deployment
+   real_world_fl/kubernetes
    real_world_fl/notes_on_large_models
    user_guide/security/identity_security
diff --git a/docs/real_world_fl/cloud_deployment.rst b/docs/real_world_fl/cloud_deployment.rst
@@ -44,11 +44,11 @@ To run NVFlare dashboard on Azure, run:
 
 .. note::
 
-    The script also requires sshpass and jq.  Both can be installed on Ubuntu, with:
+    The script also requires sshpass, dig and jq.  All can be installed on Ubuntu, with:
 
         .. code-block:: shell
 
-           sudo apt install sshpass jq
+           sudo apt install sshpass bind9-dnsutils jq
 
 Users only need to enter an email address and press Enter. This user needs to remember this email and the temporary password that will be provided, as
 this is the login credentials for the NVFLARE Dashboard once the Dashboard is up and running. 
@@ -101,11 +101,11 @@ To run NVFlare dashboard on AWS, run:
 
 .. note::
 
-    The script also requires sshpass and jq.  They can be installed on Ubuntu, with:
+    The script also requires sshpass, dig and jq.  They can be installed on Ubuntu, with:
 
         .. code-block:: shell
 
-           sudo apt install sshpass jq
+           sudo apt install sshpass bind9-dnsutils jq
 
 AWS manages authentications via AWS access_key and access_secret, you will need to have these credentials before you can start creating AWS infrastructure.
 
@@ -128,9 +128,10 @@ You can accept all default values by pressing ENTER.
 
 .. code-block:: none
 
-    This script requires az (Azure CLI), sshpass and jq.  Now checking if they are installed.
+    This script requires az (Azure CLI), sshpass dig and jq.  Now checking if they are installed.
     Checking if az exists. => found
     Checking if sshpass exists. => found
+    Checking if dig exists. => found
     Checking if jq exists. => found
     Cloud VM image, press ENTER to accept default Canonical:0001-com-ubuntu-server-focal:20_04-lts-gen2:latest: 
     Cloud VM size, press ENTER to accept default Standard_B2ms: 
@@ -190,9 +191,10 @@ You can accept all default values by pressing ENTER.
 
 .. code-block::
 
-    This script requires aws (AWS CLI), sshpass and jq.  Now checking if they are installed.
+    This script requires aws (AWS CLI), sshpass, dig and jq.  Now checking if they are installed.
     Checking if aws exists. => found
     Checking if sshpass exists. => found
+    Checking if dig exists. => found
     Checking if jq exists. => found
     If the server requires additional dependencies, please copy the requirements.txt to /home/nvflare/workspace/aws/nvflareserver/startup.
     Press ENTER when it's done or no additional dependencies.