Skip to content

Commit

Permalink
Lr newton raphson (NVIDIA#2529)
Browse files Browse the repository at this point in the history
* Implement federated logistic regression with second-order newton raphson.

Update file headers.

Update README.

Update README.

Fix README.

Refine README.

Update README.

Added more logging for the job status changing. (NVIDIA#2480)

* Added more logging for the job status changing.

* Fixed a logging call error.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>

Fix update client status (NVIDIA#2508)

* check workflow id before updating client status

* change order of checks

Add user guide on how to deploy to EKS (NVIDIA#2510)

* Add user guide on how to deploy to EKS

* Address comments

Improve dead client handling (NVIDIA#2506)

* dev

* test dead client cmd

* added more info for dead client tracing

* remove unused imports

* fix unit test

* fix test case

* address PR comments

---------

Co-authored-by: Sean Yang <seany314@gmail.com>

Enhance WFController (NVIDIA#2505)

* set flmodel variables in basefedavg

* make round info optional, fix inproc api bug

temporarily disable preflight tests (NVIDIA#2521)

Upgrade dependencies (NVIDIA#2516)

Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517)

Multiple bug fixes from 2.4 (NVIDIA#2518)

* [2.4] Support client custom code in simulator (NVIDIA#2447)

* Support client custom code in simulator

* Fix client custom code

* Remove cancel_futures args (NVIDIA#2457)

* Fix sub_worker_process shutdown (NVIDIA#2458)

* Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474)

Pythonic job creation (NVIDIA#2483)

* WIP: constructed the FedJob.

* WIP: server_app josn export.

* generate the job app config.

* fully functional pythonic job creation.

* Added simulator_run for pythonic API.

* reformat.

* Added filters support for pythonic job creation.

* handled the direct import case in fed_job.

* refactor.

* Added the resource_spec set function for FedJob.

* refactored.

* Moved the ClientApp and ServerApp into fed_app.py.

* Refactored: removed the _FilterDef class.

* refactored.

* Rename job config classes (NVIDIA#3)

* rename config related classes

* add client api example

* fix metric streaming

* add to() routine

* Enable obj in the constructor as paramenter.

* Added support for the launcher script.

* refactored.

* reformat.

* Update the comment.

* re-arrange the package location.

* Added add_ext_script() for BaseAppConfig.

* codestyle fix.

* Removed the client-api-pt example.

* removed no used import.

* fixed the in_time_accumulate_weighted_aggregator_test.py

* Added Enum parameter support.

* Added docstring.

* Added ability to handle parameters from base class.

* Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor.

* Added params_exchange_format for PTInProcessClientAPIExecutor.

* codestyle fix.

* Fixed a custom code folder structure issue.

* work for sub-folder custom files.

* backed to handle parameters from base classes.

* Support folder structure job config.

* Added support for flat folder from '.XXX' import.

* codestyle fix.

* refactored and add docstring.

* Address some of the PR reviews.

---------

Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

Enhancements from 2.4 (NVIDIA#2519)

* Starts heartbeat after task is pull and before task execution (NVIDIA#2415)

* Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442)

* [2.4] Improve cell pipe timeout handling (NVIDIA#2441)

* improve cell pipe timeout handling

* improved end and abort handling

* improve timeout handling

---------

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>

* [2.4] Enhance launcher executor (NVIDIA#2433)

* Update LauncherExecutor logs and execution setup timeout

* Change name

* [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413)

* Fire and forget for pipe handler control messages

* Add default timeout value

* fix wait-for-reply (NVIDIA#2478)

* Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495)

* Fix metric relay pipe handler timeout (NVIDIA#2496)

* Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502)

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

---------

Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

Update ci cd from 2.4 (NVIDIA#2520)

* Update github actions (NVIDIA#2450)

* Fix premerge (NVIDIA#2467)

* Fix issues on hello-world TF2 notebook

* Fix tf integration test (NVIDIA#2504)

* Add client api integration tests

---------

Co-authored-by: Isaac Yang <isaacy@nvidia.com>
Co-authored-by: Sean Yang <seany314@gmail.com>

use controller name for stats (NVIDIA#2522)

Simulator workspace re-design (NVIDIA#2492)

* Redesign simulator workspace structure.

* working, needs clean.

* Changed the simulator workspacce structure to be consistent with POC.

* Moved the logfile init to start_server_app().

* optimzed.

* adjust the stats pool location.

* Addressed the PR views.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>

Simulator end run for all clients (NVIDIA#2514)

* Provide an option to run END_RUN for all clients.

* Added end_run_all option for simulator to run END_RUN event for all clients.

* Fixed a add_argument type, added help message.

* Changed to use add_argument(() compatible with python 3.8.

* reformat.

* rewrite the _end_run_clients() and add docstring for easier understanding.

* reformat.

* adjusting the locking in the _end_run_clients.

* Fixed a potential None pointer error.

* renamed the clients_finished_end_run variable.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: Sean Yang <seany314@gmail.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>

Secure XGBoost Integration (NVIDIA#2512)

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme

* Refactoring

* Refactored the secure version to histogram_based_v2

* Replaced Paillier with a mock encryptor

* Added license header

* Put mock back

* Added metrics_writer back and fixed GRPC error reply

simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528)

Fix README.

Fix file links in README.

Fix file links in README.

Add comparison between centralized and federated training code.

Add missing client api test jobs (NVIDIA#2535)

Fixed the simulator server workspace root dir (NVIDIA#2533)

* Fixed the simulator server root dir error.

* Added unit test for SimulatorRunner start_server_app.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

Improve InProcessClientAPIExecutor  (NVIDIA#2536)

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* 1. rename ExeTaskFnWrapper class to TaskScriptRunner
2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function
3. redirect print() to logger.info()

* make result check and result pull use the same configurable variable

* rename exec_task_fn_wrapper to task_script_runner.py

* fix typo

Update README for launching python script.

Modify tensorboard logdir.

Link to environment setup instructions.

expose aggregate_fn to users for overwriting (NVIDIA#2539)

FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537)

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1) fix mlruns and tb_events dirs due to workspace directory changes
2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job

* 1. Remove the default code to use configuration
2. fix some broken notebook

* rollback changes

Fix decorator issue (NVIDIA#2542)

Remove line number in code link.

FLModel summary (NVIDIA#2544)

* add FLModel Summary

* format

formatting

Update KM example, add 2-stage solution without HE (NVIDIA#2541)

* add KM without HE, update everything

* fix license header

* fix license header - update year to 2024

* fix format

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

* update license

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: Holger Roth <hroth@nvidia.com>
  • Loading branch information
3 people authored and nvidianz committed May 6, 2024
1 parent a0250cd commit af3392e
Show file tree
Hide file tree
Showing 12 changed files with 987 additions and 0 deletions.
246 changes: 246 additions & 0 deletions examples/advanced/lr-newton-raphson/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# Federated Logistic Regression with Second-Order Newton-Raphson optimization

This example shows how to implement a federated binary
classification via logistic regression with second-order Newton-Raphson optimization.

The [UCI Heart Disease
dataset](https://archive.ics.uci.edu/dataset/45/heart+disease) is
used in this example. Scripts are provided to download and process the
dataset as described
[here](https://github.com/owkin/FLamby/tree/main/flamby/datasets/fed_heart_disease). This
dataset contains samples from 4 sites, splitted into training and
testing sets as described below:
|site | sample split |
|-------------|---------------------------------------|
|Cleveland | train: 199 samples, test: 104 samples |
|Hungary | train: 172 samples, test: 89 samples |
|Switzerland | train: 30 samples, test: 16 samples |
|Long Beach V | train: 85 samples, test: 45 samples |

The number of features in each sample is 13.

## Introduction

The [Newton-Raphson
optimization](https://en.wikipedia.org/wiki/Newton%27s_method) problem
can be described as follows.

In a binary classification task with logistic regression, the
probability of a data sample $x$ classified as positive is formulated
as:
$$p(x) = \sigma(\beta \cdot x + \beta_{0})$$
where $\sigma(.)$ denotes the sigmoid function. We can incorporate
$\beta_{0}$ and $\beta$ into a single parameter vector $\theta =
( \beta_{0}, \beta)$. Let $d$ be the number
of features for each data sample $x$ and let $N$ be the number of data
samples. We then have the matrix version of the above probability
equation:
$$p(X) = \sigma( X \theta )$$
Here $X$ is the matrix of all samples, with shape $N \times (d+1)$,
having it's first column filled with value 1 to account for the
intercept $\theta_{0}$.

The goal is to compute parameter vector $\theta$ that maximizes the
below likelihood function:
$$L_{\theta} = \prod_{i=1}^{N} p(x_i)^{y_i} (1 - p(x_i)^{1-y_i})$$

The Newton-Raphson method optimizes the likelihood function via
quadratic approximation. Omitting the maths, the theoretical update
formula for parameter vector $\theta$ is:
$$\theta^{n+1} = \theta^{n} - H_{\theta^{n}}^{-1} \nabla L_{\theta^{n}}$$
where
$$\nabla L_{\theta^{n}} = X^{T}(y - p(X))$$
is the gradient of the likelihood function, with $y$ being the vector
of ground truth for sample data matrix $X$, and
$$H_{\theta^{n}} = -X^{T} D X$$
is the Hessian of the likelihood function, with $D$ a diagonal matrix
where diagonal value at $(i,i)$ is $D(i,i) = p(x_i) (1 - p(x_i))$.

In federated Newton-Raphson optimization, each client will compute its
own gradient $\nabla L_{\theta^{n}}$ and Hessian $H_{\theta^{n}}$
based on local training samples. A server will aggregate the gradients
and Hessians computed from all clients, and perform the update of
parameter $\theta$ based on the theoretical update formula described
above.

## Implementation

Using `nvflare`, The federated logistic regression with Newton-Raphson
optimization is implemented as follows.

On the server side, all workflow logics are implemented in
class `FedAvgNewtonRaphson`, which can be found
[here](job/newton_raphson/app/custom/newton_raphson_workflow.py). The
`FedAvgNewtonRaphson` class inherits from the
[`BaseFedAvg`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/base_fedavg.py)
class, which itself inherits from the **Workflow Controller**
([`WFController`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/wf_controller.py))
class. This is the preferrable approach to implement a custom
workflow, since `WFController` decouples communication logic from
actual workflow (training & validation) logic. The mandatory
method to override in `WFController` is the
[`run()`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/wf_controller.py#L37)
method, where the orchestration of server-side workflow actually
happens. The implementation of `run()` method in
[`FedAvgNewtonRaphson`](job/newton_raphson/app/custom/newton_raphson_workflow.py)
is similar to the classic
[`FedAvg`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/fedavg.py#L44):
- Initialize the global model, this is acheived through method `load_model()`
from base class
[`ModelController`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/model_controller.py#L292),
which relies on the
[`ModelPersistor`](https://nvflare.readthedocs.io/en/main/glossary.html#persistor). A
custom
[`NewtonRaphsonModelPersistor`](job/newton_raphson/app/custom/newton_raphson_persistor.py)
is implemented in this example, which is based on the
[`NPModelPersistor`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/np/np_model_persistor.py)
for numpy data, since the _model_ in the case of logistic regression
is just the parameter vector $\theta$ that can be represented by a
numpy array. Only the `__init__` method needs to be re-implemented
to provide a proper initialization for the global parameter vector
$\theta$.
- During each training round, the global model will be sent to the
list of participating clients to perform a training task. This is
done using the
[`send_model_and_wait()`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/workflows/wf_controller.py#L41)
method. Once
the clients finish their local training, results will be collected
and sent back to server as
[`FLModel`](https://nvflare.readthedocs.io/en/main/programming_guide/fl_model.html#flmodel)s.
- Results sent by clients contain their locally computed gradient and
Hessian. A [custom aggregation
function](job/newton_raphson/app/custom/newton_raphson_workflow.py)
is implemented to get the averaged gradient and Hessian, and compute
the Newton-Raphson update for the global parameter vector $\theta$,
based on the theoretical formula shown above. The averaging of
gradient and Hessian is based on the
[`WeightedAggregationHelper`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_common/aggregators/weighted_aggregation_helper.py#L20),
which weighs the contribution from each client based on the number
of local training samples. The aggregated Newton-Raphson update is
returned as an `FLModel`.
- After getting the aggregated Newton-Raphson update, an
[`update_model()`](job/newton_raphson/app/custom/newton_raphson_workflow.py#L172)
method is implemented to actually apply the Newton-Raphson update to
the global model.
- The last step is to save the updated global model, again through
the `NewtonRaphsonModelPersistor` using `save_model()`.


On the client side, the local training logic is implemented
[here](job/newton_raphson/app/custom/newton_raphson_train.py). The
implementation is based on the [`Client
API`](https://nvflare.readthedocs.io/en/main/programming_guide/execution_api_type.html#client-api). This
allows user to add minimum `nvflare`-specific codes to turn a typical
centralized training script to a federated client side local training
script.
- During local training, each client receives a copy of the global
model, sent by the server, using `flare.receive()` API. The received
global model is an instance of `FLModel`.
- A local validation is first performed, where validation metrics
(accuracy and precision) are streamed to server using the
[`SummaryWriter`](https://nvflare.readthedocs.io/en/main/apidocs/nvflare.client.tracking.html#nvflare.client.tracking.SummaryWriter). The
streamed metrics can be loaded and visualized using tensorboard.
- Then each client computes it's gradient and Hessian based on local
training data, using their respective theoretical formula described
above. This is implemented in the
[`train_newton_raphson()`](job/newton_raphson/app/custom/newton_raphson_train.py#L82)
method. Each client then sends the computed results (always in
`FLModel` format) to server for aggregation, using `flare.send()`
API.

Each client site corresponds to a site listed in the data table above.

A [centralized training script](./train_centralized.py) is also
provided, which allows for comparing the federated Newton-Raphson
optimization versus the centralized version. In the centralized
version, training data samples from all 4 sites were concatenated into
a single matrix, used to optimize the model parameters. The
optimized model was then tested separately on testing data samples of
the 4 sites, using accuracy and precision as metrics.

Comparing the federated [client-side training
code](job/newton_raphson/app/custom/newton_raphson_train.py) with the
centralized [training code](./train_centralized.py), we can see that
the training logic remains similar: load data, perform training
(Newton-Raphson updates), and valid trained model. The only added
differences in the federated code are related to interaction with the
FL system, such as receiving and send `FLModel`.

## Set Up Environment & Install Dependencies

Follow instructions
[here](https://github.com/NVIDIA/NVFlare/tree/main/examples#set-up-a-virtual-environment)
to set up a virtual environment for `nvflare` examples and install
dependencies for this example.

## Download and prepare data

Execute the following script
```
bash ./prepare_heart_disease_data.sh
```
This will download the heart disease dataset under
`/tmp/flare/dataset/heart_disease_data/`

## Centralized Logistic Regression

Launch the following script:
```
python ./train_centralized.py --solver custom
```

Two implementations of logistic regression are provided in the
centralized training script, which can be specified by the `--solver`
argument:
- One is using `sklearn.LogisticRegression` with `newton-cholesky`
solver
- The other one is manually implemented using the theoretical update
formulas described above.

Both implementations were tested to converge in 4 iterations and to
give the same result.

Example output:
```
using solver: custom
loading training data.
training data X loaded. shape: (486, 13)
training data y loaded. shape: (486, 1)
site - 1
validation set n_samples: 104
accuracy: 0.75
precision: 0.7115384615384616
site - 2
validation set n_samples: 89
accuracy: 0.7528089887640449
precision: 0.6122448979591837
site - 3
validation set n_samples: 16
accuracy: 0.75
precision: 1.0
site - 4
validation set n_samples: 45
accuracy: 0.6
precision: 0.9047619047619048
```

## Federated Logistic Regression

Execute the following command to launch federated logistic
regression. This will run in `nvflare`'s simulator mode.
```
nvflare simulator -w ./workspace -n 4 -t 4 job/newton_raphson/
```

Accuracy and precision for each site can be viewed in Tensorboard:
```
tensorboard --logdir=./workspace/server/simulate_job/tb_events
```
As can be seen from the figure below, per-site evaluation metrics in
federated logistic regression are on-par with the centralized version.

<img src="./figs/tb-metrics.png" alt="Tensorboard metrics server"/>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"format_version": 2,
"app_script": "newton_raphson_train.py",
"app_config": "--data_root /tmp/flare/dataset/heart_disease_data",
"executors": [
{
"tasks": [
"train"
],
"executor": {
"path": "nvflare.app_common.executors.client_api_launcher_executor.ClientAPILauncherExecutor",
"args": {
"launcher_id": "launcher",
"pipe_id": "pipe",
"heartbeat_timeout": 60,
"params_exchange_format": "raw",
"params_transfer_type": "FULL",
"train_with_evaluation": false
}
}
}
],
"task_data_filters": [],
"task_result_filters": [],
"components": [
{
"id": "launcher",
"path": "nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher",
"args": {
"script": "python3 custom/{app_script} {app_config}",
"launch_once": true
}
},
{
"id": "pipe",
"path": "nvflare.fuel.utils.pipe.cell_pipe.CellPipe",
"args": {
"mode": "PASSIVE",
"site_name": "{SITE_NAME}",
"token": "{JOB_ID}",
"root_url": "{ROOT_URL}",
"secure_mode": "{SECURE_MODE}",
"workspace_dir": "{WORKSPACE}"
}
},
{
"id": "metrics_pipe",
"path": "nvflare.fuel.utils.pipe.cell_pipe.CellPipe",
"args": {
"mode": "PASSIVE",
"site_name": "{SITE_NAME}",
"token": "{JOB_ID}",
"root_url": "{ROOT_URL}",
"secure_mode": "{SECURE_MODE}",
"workspace_dir": "{WORKSPACE}"
}
},
{
"id": "metric_relay",
"path": "nvflare.app_common.widgets.metric_relay.MetricRelay",
"args": {
"pipe_id": "metrics_pipe",
"event_type": "fed.analytix_log_stats",
"read_interval": 0.1
}
},
{
"id": "client_api_config_preparer",
"path": "nvflare.app_common.widgets.external_configurator.ExternalConfigurator",
"args": {
"component_ids": ["metric_relay"]
}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"format_version": 2,
"server": {
"heart_beat_timeout": 600
},
"task_data_filters": [],
"task_result_filters": [],
"components": [
{
"id": "newton_raphson_persistor",
"path": "newton_raphson_persistor.NewtonRaphsonModelPersistor",
"args": {
"n_features": 13
}
},
{
"id": "tb_analytics_receiver",
"path": "nvflare.app_opt.tracking.tb.tb_receiver.TBAnalyticsReceiver",
"args.events": ["fed.analytix_log_stats"]
}
],
"workflows": [
{
"id": "fedavg_newton_raphson",
"path": "newton_raphson_workflow.FedAvgNewtonRaphson",
"args": {
"min_clients": 4,
"num_rounds": 5,
"damping_factor": 0.8,
"persistor_id": "newton_raphson_persistor"
}
}
]
}
Loading

0 comments on commit af3392e

Please sign in to comment.