Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add FedBN Implementation on NVFlare research folder - a local batch normalization federated learning method #2524

Merged
merged 37 commits into from
May 10, 2024

Commits on Apr 20, 2024

  1. add research/fedbn

    MinghuiChen43 committed Apr 20, 2024
    Configuration menu
    Copy the full SHA
    a25cf54 View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2024

  1. Configuration menu
    Copy the full SHA
    37587c5 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    de454b4 View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2024

  1. Configuration menu
    Copy the full SHA
    a682140 View commit details
    Browse the repository at this point in the history

Commits on Apr 27, 2024

  1. rewrite fedbn

    MinghuiChen43 committed Apr 27, 2024
    Configuration menu
    Copy the full SHA
    1a9db50 View commit details
    Browse the repository at this point in the history

Commits on Apr 28, 2024

  1. update jobs

    MinghuiChen43 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    98b30a8 View commit details
    Browse the repository at this point in the history
  2. remove workspace

    MinghuiChen43 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    113b5ad View commit details
    Browse the repository at this point in the history
  3. update README

    MinghuiChen43 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    69c9a34 View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2024

  1. Configuration menu
    Copy the full SHA
    dc5ef27 View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2024

  1. Fixed the simulator server workspace root dir (NVIDIA#2533)

    * Fixed the simulator server root dir error.
    
    * Added unit test for SimulatorRunner start_server_app.
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    2 people authored and MinghuiChen43 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    a5abb01 View commit details
    Browse the repository at this point in the history
  2. Improve InProcessClientAPIExecutor (NVIDIA#2536)

    * 1. rename ExeTaskFnWrapper class to TaskScriptRunner
    2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
    3. redirect print() to logger.info()
    
    * 1. rename ExeTaskFnWrapper class to TaskScriptRunner
    2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
    3. redirect print() to logger.info()
    
    * make result check and result pull use the same configurable variable
    
    * rename exec_task_fn_wrapper to task_script_runner.py
    
    * fix typo
    chesterxgchen authored and MinghuiChen43 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    4fac311 View commit details
    Browse the repository at this point in the history
  3. FIX MLFLow and Tensorboard Output to be consistent with new Workspace…

    … root changes (NVIDIA#2537)
    
    * 1) fix mlruns and tb_events dirs due to workspace directory changes
    2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job
    
    * 1) fix mlruns and tb_events dirs due to workspace directory changes
    2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job
    
    * 1) fix mlruns and tb_events dirs due to workspace directory changes
    2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job
    
    * 1. Remove the default code to use configuration
    2. fix some broken notebook
    
    * rollback changes
    chesterxgchen authored and MinghuiChen43 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    c83039b View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    0c35216 View commit details
    Browse the repository at this point in the history

Commits on May 1, 2024

  1. Configuration menu
    Copy the full SHA
    f76f71b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    394e137 View commit details
    Browse the repository at this point in the history
  3. FLModel summary (NVIDIA#2544)

    * add FLModel Summary
    
    * format
    chesterxgchen authored and MinghuiChen43 committed May 1, 2024
    Configuration menu
    Copy the full SHA
    6655321 View commit details
    Browse the repository at this point in the history
  4. remove jobs folder

    MinghuiChen43 committed May 1, 2024
    Configuration menu
    Copy the full SHA
    feab6e6 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    3865a59 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    e4dbfc4 View commit details
    Browse the repository at this point in the history

Commits on May 2, 2024

  1. handle cases where the script with relative path in Script Runner (NV…

    …IDIA#2543)
    
    * handle cases where the script with relative path
    
    * handle cases where the script with relative path
    
    * add more unit test cases and change the file search logics
    
    * code format
    
    * add more unit test cases and change the file search logics
    chesterxgchen authored and MinghuiChen43 committed May 2, 2024
    Configuration menu
    Copy the full SHA
    985182b View commit details
    Browse the repository at this point in the history
  2. Lr newton raphson (NVIDIA#2529)

    * Implement federated logistic regression with second-order newton raphson.
    
    Update file headers.
    
    Update README.
    
    Update README.
    
    Fix README.
    
    Refine README.
    
    Update README.
    
    Added more logging for the job status changing. (NVIDIA#2480)
    
    * Added more logging for the job status changing.
    
    * Fixed a logging call error.
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
    
    Fix update client status (NVIDIA#2508)
    
    * check workflow id before updating client status
    
    * change order of checks
    
    Add user guide on how to deploy to EKS (NVIDIA#2510)
    
    * Add user guide on how to deploy to EKS
    
    * Address comments
    
    Improve dead client handling (NVIDIA#2506)
    
    * dev
    
    * test dead client cmd
    
    * added more info for dead client tracing
    
    * remove unused imports
    
    * fix unit test
    
    * fix test case
    
    * address PR comments
    
    ---------
    
    Co-authored-by: Sean Yang <seany314@gmail.com>
    
    Enhance WFController (NVIDIA#2505)
    
    * set flmodel variables in basefedavg
    
    * make round info optional, fix inproc api bug
    
    temporarily disable preflight tests (NVIDIA#2521)
    
    Upgrade dependencies (NVIDIA#2516)
    
    Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)
    
    Multiple bug fixes from 2.4 (NVIDIA#2518)
    
    * [2.4] Support client custom code in simulator (NVIDIA#2447)
    
    * Support client custom code in simulator
    
    * Fix client custom code
    
    * Remove cancel_futures args (NVIDIA#2457)
    
    * Fix sub_worker_process shutdown (NVIDIA#2458)
    
    * Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)
    
    Pythonic job creation (NVIDIA#2483)
    
    * WIP: constructed the FedJob.
    
    * WIP: server_app josn export.
    
    * generate the job app config.
    
    * fully functional pythonic job creation.
    
    * Added simulator_run for pythonic API.
    
    * reformat.
    
    * Added filters support for pythonic job creation.
    
    * handled the direct import case in fed_job.
    
    * refactor.
    
    * Added the resource_spec set function for FedJob.
    
    * refactored.
    
    * Moved the ClientApp and ServerApp into fed_app.py.
    
    * Refactored: removed the _FilterDef class.
    
    * refactored.
    
    * Rename job config classes (NVIDIA#3)
    
    * rename config related classes
    
    * add client api example
    
    * fix metric streaming
    
    * add to() routine
    
    * Enable obj in the constructor as paramenter.
    
    * Added support for the launcher script.
    
    * refactored.
    
    * reformat.
    
    * Update the comment.
    
    * re-arrange the package location.
    
    * Added add_ext_script() for BaseAppConfig.
    
    * codestyle fix.
    
    * Removed the client-api-pt example.
    
    * removed no used import.
    
    * fixed the in_time_accumulate_weighted_aggregator_test.py
    
    * Added Enum parameter support.
    
    * Added docstring.
    
    * Added ability to handle parameters from base class.
    
    * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.
    
    * Added params_exchange_format for PTInProcessClientAPIExecutor.
    
    * codestyle fix.
    
    * Fixed a custom code folder structure issue.
    
    * work for sub-folder custom files.
    
    * backed to handle parameters from base classes.
    
    * Support folder structure job config.
    
    * Added support for flat folder from '.XXX' import.
    
    * codestyle fix.
    
    * refactored and add docstring.
    
    * Address some of the PR reviews.
    
    ---------
    
    Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>
    Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    
    Enhancements from 2.4 (NVIDIA#2519)
    
    * Starts heartbeat after task is pull and before task execution (NVIDIA#2415)
    
    * Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)
    
    * [2.4] Improve cell pipe timeout handling (NVIDIA#2441)
    
    * improve cell pipe timeout handling
    
    * improved end and abort handling
    
    * improve timeout handling
    
    ---------
    
    Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
    
    * [2.4] Enhance launcher executor (NVIDIA#2433)
    
    * Update LauncherExecutor logs and execution setup timeout
    
    * Change name
    
    * [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)
    
    * Fire and forget for pipe handler control messages
    
    * Add default timeout value
    
    * fix wait-for-reply (NVIDIA#2478)
    
    * Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)
    
    * Fix metric relay pipe handler timeout (NVIDIA#2496)
    
    * Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com>
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    
    Update ci cd from 2.4 (NVIDIA#2520)
    
    * Update github actions (NVIDIA#2450)
    
    * Fix premerge (NVIDIA#2467)
    
    * Fix issues on hello-world TF2 notebook
    
    * Fix tf integration test (NVIDIA#2504)
    
    * Add client api integration tests
    
    ---------
    
    Co-authored-by: Isaac Yang <isaacy@nvidia.com>
    Co-authored-by: Sean Yang <seany314@gmail.com>
    
    use controller name for stats (NVIDIA#2522)
    
    Simulator workspace re-design (NVIDIA#2492)
    
    * Redesign simulator workspace structure.
    
    * working, needs clean.
    
    * Changed the simulator workspacce structure to be consistent with POC.
    
    * Moved the logfile init to start_server_app().
    
    * optimzed.
    
    * adjust the stats pool location.
    
    * Addressed the PR views.
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
    
    Simulator end run for all clients (NVIDIA#2514)
    
    * Provide an option to run END_RUN for all clients.
    
    * Added end_run_all option for simulator to run END_RUN event for all clients.
    
    * Fixed a add_argument type, added help message.
    
    * Changed to use add_argument(() compatible with python 3.8.
    
    * reformat.
    
    * rewrite the _end_run_clients() and add docstring for easier understanding.
    
    * reformat.
    
    * adjusting the locking in the _end_run_clients.
    
    * Fixed a potential None pointer error.
    
    * renamed the clients_finished_end_run variable.
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    Co-authored-by: Sean Yang <seany314@gmail.com>
    Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
    
    Secure XGBoost Integration (NVIDIA#2512)
    
    * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme
    
    * Refactoring
    
    * Refactored the secure version to histogram_based_v2
    
    * Replaced Paillier with a mock encryptor
    
    * Added license header
    
    * Put mock back
    
    * Added metrics_writer back and fixed GRPC error reply
    
    simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)
    
    Fix README.
    
    Fix file links in README.
    
    Fix file links in README.
    
    Add comparison between centralized and federated training code.
    
    Add missing client api test jobs (NVIDIA#2535)
    
    Fixed the simulator server workspace root dir (NVIDIA#2533)
    
    * Fixed the simulator server root dir error.
    
    * Added unit test for SimulatorRunner start_server_app.
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    
    Improve InProcessClientAPIExecutor  (NVIDIA#2536)
    
    * 1. rename ExeTaskFnWrapper class to TaskScriptRunner
    2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
    3. redirect print() to logger.info()
    
    * 1. rename ExeTaskFnWrapper class to TaskScriptRunner
    2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
    3. redirect print() to logger.info()
    
    * make result check and result pull use the same configurable variable
    
    * rename exec_task_fn_wrapper to task_script_runner.py
    
    * fix typo
    
    Update README for launching python script.
    
    Modify tensorboard logdir.
    
    Link to environment setup instructions.
    
    expose aggregate_fn to users for overwriting (NVIDIA#2539)
    
    FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)
    
    * 1) fix mlruns and tb_events dirs due to workspace directory changes
    2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job
    
    * 1) fix mlruns and tb_events dirs due to workspace directory changes
    2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job
    
    * 1) fix mlruns and tb_events dirs due to workspace directory changes
    2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job
    
    * 1. Remove the default code to use configuration
    2. fix some broken notebook
    
    * rollback changes
    
    Fix decorator issue (NVIDIA#2542)
    
    Remove line number in code link.
    
    FLModel summary (NVIDIA#2544)
    
    * add FLModel Summary
    
    * format
    
    formatting
    
    Update KM example, add 2-stage solution without HE (NVIDIA#2541)
    
    * add KM without HE, update everything
    
    * fix license header
    
    * fix license header - update year to 2024
    
    * fix format
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    
    * update license
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    Co-authored-by: Holger Roth <hroth@nvidia.com>
    3 people authored and MinghuiChen43 committed May 2, 2024
    Configuration menu
    Copy the full SHA
    1a8dd1b View commit details
    Browse the repository at this point in the history
  3. format update

    ZiyueXu77 authored and MinghuiChen43 committed May 2, 2024
    Configuration menu
    Copy the full SHA
    2a36592 View commit details
    Browse the repository at this point in the history
  4. Update KM example, add 2-stage solution without HE (NVIDIA#2541)

    * add KM without HE, update everything
    
    * fix license header
    
    * fix license header - update year to 2024
    
    * fix format
    
    ---------
    
    Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
    2 people authored and MinghuiChen43 committed May 2, 2024
    Configuration menu
    Copy the full SHA
    514bb03 View commit details
    Browse the repository at this point in the history

Commits on May 6, 2024

  1. Configuration menu
    Copy the full SHA
    f251451 View commit details
    Browse the repository at this point in the history
  2. MONAI mednist example (NVIDIA#2532)

    * add monai notebook
    
    * add training script
    
    * update example
    
    * update notebook
    
    * use job template
    
    * call init later
    
    * swith back
    
    * add gitignore
    
    * update notebooks
    
    * add readmes
    
    * send received model to GPU
    
    * use monai tb stats handler
    
    * formatting
    holgerroth authored and MinghuiChen43 committed May 6, 2024
    Configuration menu
    Copy the full SHA
    46b8d2a View commit details
    Browse the repository at this point in the history

Commits on May 7, 2024

  1. Add in process client api tests (NVIDIA#2549)

    * Add in process client api tests
    
    * Fix headers
    
    * Fix comments
    YuanTingHsieh authored and MinghuiChen43 committed May 7, 2024
    Configuration menu
    Copy the full SHA
    42afc68 View commit details
    Browse the repository at this point in the history
  2. Add client controller executor (NVIDIA#2530)

    * add client controller executor
    
    * address comments
    
    * enhance abort, set peer props
    
    * remove asserts
    
    ---------
    
    Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
    2 people authored and MinghuiChen43 committed May 7, 2024
    Configuration menu
    Copy the full SHA
    0d71db7 View commit details
    Browse the repository at this point in the history

Commits on May 8, 2024

  1. Configuration menu
    Copy the full SHA
    2279fa8 View commit details
    Browse the repository at this point in the history

Commits on May 9, 2024

  1. Configuration menu
    Copy the full SHA
    5f2c9be View commit details
    Browse the repository at this point in the history
  2. update README

    MinghuiChen43 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    d943892 View commit details
    Browse the repository at this point in the history
  3. update readme

    MinghuiChen43 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    4438f60 View commit details
    Browse the repository at this point in the history
  4. update readme

    MinghuiChen43 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    1268634 View commit details
    Browse the repository at this point in the history
  5. update readme

    MinghuiChen43 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    e3e9bec View commit details
    Browse the repository at this point in the history
  6. [2.5] Clean up to allow creation of nvflare light (NVIDIA#2573)

    * clean up to allow creation of nvflare light
    
    * move defs to cellnet
    yanchengnv authored and MinghuiChen43 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    8f187eb View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    7d133c9 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2024

  1. verified commit

    MinghuiChen43 committed May 10, 2024
    Configuration menu
    Copy the full SHA
    8e183cb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a9f8c10 View commit details
    Browse the repository at this point in the history