Skip to content

Commit

Permalink
[2.4] Secure XGBoost Documentation (NVIDIA#2671)
Browse files Browse the repository at this point in the history
* add 2.4.2 documentation

* update plugin configuration section

* address comments

* address comments 2

* change default plugin to cuda_paillier

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
  • Loading branch information
SYangster and chesterxgchen committed Aug 5, 2024
1 parent 8366a9b commit 8cae824
Show file tree
Hide file tree
Showing 13 changed files with 1,025 additions and 142 deletions.
151 changes: 9 additions & 142 deletions docs/programming_guide/experiment_tracking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,150 +4,17 @@
Experiment Tracking
###################

***********************
Overview and Approaches
***********************
FLARE seamlessly integrates with leading experiment tracking systems—MLflow, Weights & Biases, and TensorBoard—to facilitate comprehensive monitoring of metrics.

In a federated computing setting, the data is distributed across multiple devices or systems, and training is run
on each device independently while preserving each client's data privacy.
You can choose between decentralized and centralized tracking configurations:

Assuming a federated system consisting of one server and many clients and the server coordinating the ML training of clients,
we can interact with ML experiment tracking tools in two different ways:
- **Decentralized tracking**: Each client manages its own metrics and experiment tracking server locally, maintaining training metric privacy. However, this setup limits the ability to compare data across different sites.
- **Centralized tracking**: All metrics are streamed to a central FL server, which then pushes the data to a selected tracking system. This setup supports effective cross-site metric comparisons

- Client-side experiment tracking: Each client will directly send the log metrics/parameters to the ML experiment
tracking server (like MLflow or Weights and Biases) or local file system (like tensorboard)
- Aggregated experiment tracking: Clients will send the log metrics/parameters to FL server, and the FL server will
send the metrics to ML experiment tracking server or local file system
We provide solutions for different client execution types. For the Client API, use the corresponding experiment tracking APIs. For Executors or Learners, use the experiment tracking LogWriters.

Each approach will have its use cases and unique challenges. In NVFLARE, we developed a server-side approach (in the
provided examples, the Receiver is on the FL server, but it could also be on the FL client):
.. toctree::
:maxdepth: 1

- Clients don't need to have access to the tracking server, avoiding the additional
authentication for every client. In many cases, the clients may be from different organizations
and different from the host organization of the experiment tracking server.
- Since we reduced connections to the tracking server from N clients to just one server, the traffic to the tracking server
can be highly reduced. In some cases, such as in MLFLow, the events can be buffered in the server and sent to the tracking
server in batches, further reducing the traffic to the tracking server. The buffer may add additional latency, so you can
disable the buffering if you can set the buffer flush time to 0 assuming the tracking server can take the traffic.
- Another key benefit of using server-side experiment tracking is that we separate the metrics data collection
from the metrics data delivery to the tracking server. Clients are only responsible for collecting metrics, and only the server needs to
know about the tracking server. This allows us to have different tools for data collection and data delivery.
For example, if the client has training code with logging in Tensorboard syntax, without changing the code, the server can
receive the logged data and deliver the metrics to MLflow.
- Server-side experiment tracking also can organize different clients' results into different experiment runs so they can be easily
compared side-by-side.

.. note::

This page covers experiment tracking using :class:`LogWriters <nvflare.app_common.tracking.log_writer.LogWriter>`,
which are configured and used with :ref:`executor` or :ref:`model_learner` on the FLARE-side code.
However if using the Client API, please refer to :ref:`client_api` and :ref:`nvflare.client.tracking` for adding experiment tracking to your custom training code.


**************************************
Tools, Sender, LogWriter and Receivers
**************************************

With the "experiment_tracking" examples in the advanced examples directory, you can see how to track and visualize
experiments in real time and compare results by leveraging several experiment tracking solutions:

- `Tensorboard <https://www.tensorflow.org/tensorboard>`_
- `MLflow <https://mlflow.org/>`_
- `Weights and Biases <https://wandb.ai/site>`_

.. note::

The user needs to sign up at Weights and Biases to access the service, NVFlare can not provide access.

In the Federated Learning phase, users can choose an API syntax that they are used to from one
of above tools. NVFlare has developed components that mimic these APIs called
:class:`LogWriters <nvflare.app_common.tracking.log_writer.LogWriter>`. All clients experiment logs
are streamed over to the FL server (with :class:`ConvertToFedEvent<nvflare.app_common.widgets.convert_to_fed_event.ConvertToFedEvent>`),
where the actual experiment logs are recorded. The components that receive
these logs are called Receivers based on :class:`AnalyticsReceiver <nvflare.app_common.widgets.streaming.AnalyticsReceiver>`.
The receiver component leverages the experiment tracking tool and records the logs during the experiment run.

In a normal setting, we would have pairs of sender and receivers, with some provided implementations in :mod:`nvflare.app_opt.tracking`:

- TBWriter <-> TBAnalyticsReceiver
- MLflowWriter <-> MLflowReceiver
- WandBWriter <-> WandBReceiver

You can also mix and match any combination of LogWriter and Receiver so you can write the ML code using one API
but use any experiment tracking tool or tools (you can use multiple receivers for the same log data sent from one sender).

.. image:: ../resources/experiment_tracking.jpg

*************************
Experiment logs streaming
*************************

On the client side, when a :class:`LogWriters <nvflare.app_common.tracking.log_writer.LogWriter>` writes the
metrics, instead of writing to files, it actually generates an NVFLARE event (of type `analytix_log_stats` by default).
The `ConvertToFedEvent` widget will turn the local event `analytix_log_stats` into a
fed event `fed.analytix_log_stats`, which will be delivered to the server side.

On the server side, the :class:`AnalyticsReceiver <nvflare.app_common.widgets.streaming.AnalyticsReceiver>` is configured
to process `fed.analytix_log_stats` events, which writes received log data to the appropriate tracking solution.

****************************************
Support custom experiment tracking tools
****************************************

There are many different experiment tracking tools, and you might want to write a custom writer and/or receiver for your needs.

There are three things to consider for developing a custom experiment tracking tool.

Data Type
=========

Currently, the supported data types are listed in :class:`AnalyticsDataType <nvflare.apis.analytix.AnalyticsDataType>`, and other data types can be added as needed.

Writer
======
Implement :class:`LogWriter <nvflare.app_common.tracking.log_writer.LogWriter>` interface with the API syntax. For each tool, we mimic the API syntax of the underlying tool,
so users can use what they are familiar with without learning a new API.
For example, for Tensorboard, TBWriter uses add_scalar() and add_scalars(); for MLflow, the syntax is
log_metric(), log_metrics(), log_parameter(), and log_parameters(); for W&B, the writer just has log().
The data collected with these calls will all send to the AnalyticsSender to deliver to the FL server.

Receiver
========

Implement :class:`AnalyticsReceiver <nvflare.app_common.widgets.streaming.AnalyticsReceiver>` interface and determine how to represent different sites' logs. In all three implementations
(Tensorboard, MLflow, WandB), each site's log is represented as one run. Depending on the individual tool, the implementation
can be different. For example, for both Tensorboard and MLflow, we create different runs for each client and map to the
site name. In the WandB implementation, we have to leverage multiprocess and let each run in a different process.

*****************
Examples Overview
*****************

The :github_nvflare_link:`experiment tracking examples <examples/advanced/experiment-tracking>`
illustrate how to leverage different writers and receivers. All examples are based upon the hello-pt example.

TensorBoard
===========
The example in the "tensorboard" directory shows how to use the Tensorboard Tracking Tool (for both the
sender and receiver). See :ref:`tensorboard_streaming` for details.

MLflow
======
Under the "mlflow" directory, the "hello-pt-mlflow" job shows how to use MLflow for tracking with both the MLflow sender
and receiver. The "hello-pt-tb-mlflow" job shows how to use the Tensorboard Sender, while the receiver is MLflow.
See :ref:`experiment_tracking_mlflow` for details.

Weights & Biases
================
Under the :github_nvflare_link:`wandb <examples/advanced/experiment-tracking/wandb>` directory, the
"hello-pt-wandb" job shows how to use Weights and Biases for experiment tracking with
the WandBWriter and WandBReceiver to log metrics.

MONAI Integration
=================

:github_nvflare_link:`Integration with MONAI <integration/monai>` uses the `NVFlareStatsHandler`
:class:`LogWriterForMetricsExchanger <nvflare.app_common.tracking.LogWriterForMetricsExchanger>` to connect to
:class:`MetricsRetriever <nvflare.app_common.metrics_exchange.MetricsRetriever>`. See the job
:github_nvflare_link:`spleen_ct_segmentation_local <integration/monai/examples/spleen_ct_segmentation_local/jobs/spleen_ct_segmentation_local>`
for more details on this configuration.
experiment_tracking/experiment_tracking_apis
experiment_tracking/experiment_tracking_log_writer
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
.. _experiment_tracking_apis:

########################
Experiment Tracking APIs
########################

.. figure:: ../../resources/experiment_tracking_diagram.png
:height: 500px

To track training metrics such as accuracy or loss or AUC, we need to log these metrics with one of the experiment tracking systems.
Here we will discuss the following topics:

- Logging metrics with MLflow, TensorBoard, or Weights & Biases
- Streaming metrics to the FL server
- Streaming to FL clients

Logging metrics with MLflow, TensorBoard, or Weights & Biases
=============================================================

Integrate MLflow logging to efficiently stream metrics to the MLflow server with just three lines of code:

.. code-block:: python
from nvflare.client.tracking import MLflowWriter
mlflow = MLflowWriter()
mlflow.log_metric("loss", running_loss / 2000, global_step)
In this setup, we use ``MLflowWriter`` instead of using the MLflow API directly.
This abstraction is important, as it enables users to flexibly redirect your logging metrics to any destination, which we discuss in more detail later.

The use of MLflow, TensorBoard, or Weights & Biases syntax will all work to stream the collected metrics to any supported experiment tracking system.
Choosing to use TBWriter, MLflowWriter, or WandBWriter is user preference based on your existing code and requirements.

- ``MLflowWriter`` uses the Mlflow API operation syntax ``log_metric()``
- ``TBWriter`` uses the TensorBoard SummaryWriter operation ``add_scalar()``
- ``WandBWriter`` uses the Weights & Biases API operation ``log()``

Here are the APIs:

.. code-block:: python
class TBWriter(LogWriter):
def add_scalar(self, tag: str, scalar: float, global_step: Optional[int] = None, **kwargs):
def add_scalars(self, tag: str, scalars: dict, global_step: Optional[int] = None, **kwargs):
class WandBWriter(LogWriter):
def log(self, metrics: Dict[str, float], step: Optional[int] = None):
class MLflowWriter(LogWriter):
def log_param(self, key: str, value: any) -> None:
def log_params(self, values: dict) -> None:
def log_metric(self, key: str, value: float, step: Optional[int] = None) -> None:
def log_metrics(self, metrics: Dict[str, float], step: Optional[int] = None) -> None:
def log_text(self, text: str, artifact_file_path: str) -> None:
def set_tag(self, key: str, tag: any) -> None:
def set_tags(self, tags: dict) -> None:
After you've modified the training code, you can use the NVFlare's job configuration to configure the system to stream the logs appropriately.

Streaming metrics to FL server
==============================

All metric key values are captured as events, with the flexibility to stream them to the most suitable destinations.
Let's add the ``ConvertToFedEvent`` to convert these metrics events to federated events so they will be sent to the server.

Add this component to config_fed_client.json:

.. code-block:: yaml
{
"id": "event_to_fed",
"name": "ConvertToFedEvent",
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
}
If using the subprocess Client API with the ClientAPILauncherExecutor (rather than the in-process Client API with the InProcessClientAPIExecutor),
we need to add the ``MetricRelay`` to fire fed events, a ``CellPipe`` for metrics, and an ``ExternalConfiguator`` for client api initialization.

.. code-block:: yaml
{
id = "metric_relay"
path = "nvflare.app_common.widgets.metric_relay.MetricRelay"
args {
pipe_id = "metrics_pipe"
event_type = "fed.analytix_log_stats"
read_interval = 0.1
}
},
{
id = "metrics_pipe"
path = "nvflare.fuel.utils.pipe.cell_pipe.CellPipe"
args {
mode = "PASSIVE"
site_name = "{SITE_NAME}"
token = "{JOB_ID}"
root_url = "{ROOT_URL}"
secure_mode = "{SECURE_MODE}"
workspace_dir = "{WORKSPACE}"
}
},
{
id = "config_preparer"
path = "nvflare.app_common.widgets.external_configurator.ExternalConfigurator"
args {
component_ids = ["metric_relay"]
}
}
On the server, configure the experiment tracking system in ``config_fed_server.conf`` using one of the following receivers.
Note that any of the receivers can be used regardless of the which writer is used.

- ``MLflowReceiver`` for MLflow
- ``TBAnalyticsReceiver`` for TensorBoard
- ``WandBReceiver`` for Weights & Biases

For example, here we add the ``MLflowReceiver`` component to the components configuration array:

.. code-block:: yaml
{
"id": "mlflow_receiver_with_tracking_uri",
"path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLflowReceiver",
"args": {
tracking_uri = "file:///{WORKSPACE}/{JOB_ID}/mlruns"
"kwargs": {
"experiment_name": "hello-pt-experiment",
"run_name": "hello-pt-with-mlflow",
"experiment_tags": {
"mlflow.note.content": "markdown for the experiment"
},
"run_tags": {
"mlflow.note.content": "markdown describes details of experiment"
}
},
"artifact_location": "artifacts"
}
}
Notice the args{} are user defined, such as tracking_uri, experiment_name, tags etc., and will be specific to which receiver is configured.

The MLflow tracking URL argument ``tracking_uri`` is None by default, which uses the MLflow default URL, ``http://localhost:5000``.
To make this accessible from another machine, make sure to change it to the correct URL, or point to to the ``mlruns`` directory in the workspace.

::

tracking_uri = <the Mlflow Server endpoint URL>

::

tracking_uri = "file:///{WORKSPACE}/{JOB_ID}/mlruns"

You can change other arguments such as experiments, run_name, tags (using Markdown syntax), and artifact location.

Start the MLflow server with one of the following commands:

::

mlflow server --host 127.0.0.1 --port 5000

::

mlflow ui -port 5000

For more information with an example walkthrough, see the :github_nvflare_link:`FedAvg with SAG with MLflow tutorial <examples/hello-world/step-by-step/cifar10/sag_mlflow/sag_mlflow.ipynb>`.


Streaming metrics to FL clients
===============================

If streaming metrics to the FL server isn't preferred due to privacy or other concerns, users can alternatively stream metrics to the FL client.
In such cases, there's no need to add the ``ConvertToFedEvent`` component on the client side.
Additionally, since we're not streaming to the server side, there's no requirement to configure receivers in the server configuration.

Instead to receive records on the client side, configure the metrics receiver in the client configuration instead of the server configuration.

For example, in the TensorBoard configuration, add this component to ``config_fed_client.conf``:

.. code-block:: yaml
{
"id": "tb_analytics_receiver",
"name": "TBAnalyticsReceiver",
"args": {"events": ["analytix_log_stats"]}
}
Note that the ``events`` argument is ``analytix_log_stats``, not ``fed.analytix_log_stats``, indicating that this is a local event.

If using the ``MetricRelay`` component, we can similarly component event_type value from ``fed.analytix_log_stats`` to ``analytix_log_stats`` for convention.
We then must set the ``MetricRelay`` argument ``fed_event`` to ``false`` to fire local events rather than the default fed events.

.. code-block:: yaml
{
id = "metric_relay"
path = "nvflare.app_common.widgets.metric_relay.MetricRelay"
args {
pipe_id = "metrics_pipe"
event_type = "analytix_log_stats"
# how fast should it read from the peer
read_interval = 0.1
fed_event = false
}
},
Then, the metrics will stream to the client.
Loading

0 comments on commit 8cae824

Please sign in to comment.