cifar10 example update for 2.1 (NVIDIA#497)

* Update cifar10 for 2.1 update best_acc migrating to job configs use number of gpus in resource test admin api scripts run data splitting on the fly as server-side handler add data download step to readme load valid set only at run time (validate function) submit jobs using admin api build in check for admin login explicitly point to model when using FedOpt train for full rounds make alpha configurable automatically refactor create dataset call submit jobs with different alpha values submit jobs using absolute path update readme update plot script update readme doc urls, set max_jobs=2 * remove unused variables from shell scripts Co-authored-by: YuanTingHsieh <yuantinghsieh@gmail.com>
pxLi · May 9, 2022 · 8a68bce · 8a68bce
1 parent 739479c
commit 8a68bce
Show file tree

Hide file tree

Showing 41 changed files with 1,129 additions and 450 deletions.
diff --git a/examples/cifar10/README.md b/examples/cifar10/README.md
@@ -23,14 +23,21 @@ pip install -r ./virtualenv/min-requirements.txt
 pip install -r ./virtualenv/plot-requirements.txt
 ```
 
-## 2. Create your FL workspace 
+## 2. Create your FL workspace and start FL system 
+
+The next scripts will start the FL server and 8 clients automatically to run FL experiments on localhost.
+In this example, we run all 8 clients on one GPU with 12 GB memory per job.
 
 ### 2.1 POC ("proof of concept") workspace
 To run FL experiments in POC mode, create your local FL workspace the below command. 
 In the following experiments, we will be using 8 clients. Press y and enter when prompted. 
 ```
 ./create_poc_workpace.sh 8
 ```
+Then, start the FL system with 8 clients by running
+```
+./start_fl_poc.sh 8
+```
 
 ### 2.2 (Optional) Secure FL workspace
 
@@ -41,11 +48,16 @@ To create the secure workspace, please use the following to build a package and
 to `secure_workspace` for later experimentation.
 ```
 cd ./workspaces
-provision -p ./secure_project.yml
+python3 -m nvflare.lighter.provision -p ./secure_project.yml
 cp -r ./workspace/secure_project/prod_00 ./secure_workspace
 cd ..
 ```
-For more information about secure provisioning see the [documentation](https://nvidia.github.io/NVFlare/user_guide/provisioning_tool.html).
+For more information about secure provisioning see the [documentation](https://nvflare.readthedocs.io/en/dev-2.1/user_guide/provisioning_tool.html).
+
+For starting the FL system with 8 clients in the secure workspace, run
+```
+./start_fl_secure.sh 8
+```
 
 > **_NOTE:_** **POC** stands for "proof of concept" and is used for quick experimentation 
 > with different amounts of clients.
@@ -55,42 +67,59 @@ For more information about secure provisioning see the [documentation](https://n
 > homomorphic encryption (HE) one shown below. These startup kits allow secure deployment of FL in real-world scenarios 
 > using SSL certificated communication channels.
 
+### Multi-tasking
+In this example, we assume two local GPUs with at least 12GB of memory are available. 
+Hence, in the secure project configuration [./workspaces/secure_project.yml](./workspaces/secure_project.yml), 
+we set the available GPU indices as `gpu: [0, 1]` using the `ListResourceManager` and `max_jobs: 2` in `DefaultJobScheduler`.
+
+For the POC workspace, adjust the default values in `nvflare/poc/client/startup/fed_client.json` 
+and `nvflare/poc/server/startup/fed_server.json` in your NVFlare installation. 
+
+For details, please refer to the [documentation](https://nvflare.readthedocs.io/en/dev-2.1/user_guide/job.html).
+
+### 2.3 Download the CIFAR-10 dataset 
+To speed up the following experiments, first download the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset:
+```
+python3 ./pt/utils/cifar10_download_data.py
+```
 
 ## 3. Run automated experiments
 
-Next, FL training will start automatically. 
+Next, we will submit jobs to start FL training automatically. 
 
-The [run_poc.sh](./run_poc.sh) and [run_secure.sh](./run_secure.sh) scripts follows this pattern:
-`./run_secure.sh [n_clients] [config] [run] [alpha]`
+The [submit_job.sh](./submit_job.sh) script follows this pattern:
+```
+./submit_job.sh [config] [alpha]
+```
+If you want to use the poc workspace, append `--poc` to this command, e.g,: 
+```
+./submit_job.sh [config] [alpha] --poc
+```
 
-These scripts will start the FL server and 8 clients automatically to run FL experiments on localhost. 
-Each client can be assigned a GPU using `CUDA_VISIBLE_DEVICES=GPU#`. 
-In this example, we run all 8 clients on one GPU with 12 GB memory.
+In this simulation, the server will split the CIFAR-10 dataset to simulate each client having different data distributions.
 
-It will then download and split the CIFAR-10 dataset to simulate each client having different data distributions.
-The `config` argument controls which experiment to run. The respective folder under `configs` will be selected and 
-uploaded to the server for distribution to each client using the admin API with [run_fl.py](./run_fl.py). 
-The run will time out if not completed in 2 hours. You can adjust this within the `run()` call of the admin API script. 
+The `config` argument controls which experiment job to submit. 
+The respective folder under `job_configs` will be submitted using the admin API with [submit_job.py](./submit_job.py) for scheduling.
+The admin API script ([submit_job.py](./submit_job.py)) also overwrites the alpha value inside the 
+job configuration file depending on the provided commandline argument.
+Jobs will be executed automatically depending on the available resources at each client (see "Multi-tasking" section).
 
 ### 3.1 Varying data heterogeneity of data splits
 
 We use an implementation to generated heterogeneous data splits from CIFAR-10 based a Dirichlet sampling strategy 
 from FedMA (https://github.com/IBM/FedMA), where `alpha` controls the amount of heterogeneity, 
-see https://arxiv.org/abs/2002.06440.
-
-We will run most experiments using the POC workspace which makes it easy to define new clients. 
-Just copy a client startup dir to a new client name, e.g. `cp -r client1 client9`. 
-The `run_*.sh` scripts can then be adjusted to execute the new client.
+see [Wang et al.](https://arxiv.org/abs/2002.06440)
 
 ### 3.2 Centralized training
 
 To simulate a centralized training baseline, we run FL with 1 client for 25 local epochs but only for one round. 
 It takes circa 6 minutes on an NVIDIA TitanX GPU.
 ```
-./run_poc.sh 1 cifar10_central 1 1.0
+./submit_job.sh cifar10_central 0.0
 ```
-You can visualize the training progress by running `tensorboard --logdir=[workspace]/.`
+Note, here `alpha=0.0` means that no heterogeneous data splits are being generated.
 
+You can visualize the training progress by running `tensorboard --logdir=[workspace]/.`
 ![Central training curve](./figs/central_training.png)
 
 ### 3.3 FedAvg on different data splits
@@ -101,73 +130,73 @@ Each run will take about 35 minutes, depending on your system.
 
 You can copy the whole block into the terminal, and it will execute each experiment one after the other.
 ```
-./run_poc.sh 8 cifar10_fedavg 2 1.0
-./run_poc.sh 8 cifar10_fedavg 3 0.5
-./run_poc.sh 8 cifar10_fedavg 4 0.3
-./run_poc.sh 8 cifar10_fedavg 5 0.1
+./submit_job.sh cifar10_fedavg 1.0
+./submit_job.sh cifar10_fedavg 0.5
+./submit_job.sh cifar10_fedavg 0.3
+./submit_job.sh cifar10_fedavg 0.1
 ```
 
 > **_NOTE:_** You can always use the admin console to manually abort the automatically started runs 
-  using `abort all`. If the admin API script is running, the FL system will automatically shut down using
-  the current setting defined in [run_fl.py](./run_fl.py). An automatic shutdown is useful here for development as code changes 
-> in your FL components will only be picked up on a restart of the FL system. 
-> For real-world deployments, the system should be kept running but the admin restart command can be used, 
-> see [here](https://nvidia.github.io/NVFlare/user_guide/admin_commands.html).
+  using `abort_job [RUN_ID]`. 
+> For a complete list of admin commands, see [here](https://nvflare.readthedocs.io/en/dev-2.1/user_guide/admin_commands.html).
+
+> To log into the POC workspace admin console no username is required 
+> (use "admin" for commands requiring conformation with username). 
 
-> To log into the POC workspace admin console, use username "admin" and password "admin". 
 > For the secure workspace admin console, use username "admin@nvidia.com"
 
 After training, each client's best model will be used for cross-site validation. The results can be shown with
 for example
 ```
-  cat ./workspaces/poc_workspace/server/run_2/cross_site_val/cross_site_val.json
+  cat ./workspaces/poc_workspace/server/[RUN_ID]/cross_site_val/cross_site_val.json
 ```
+where [RUN_ID] is the ID assigned by the system when submitting the job.
 
 ### 3.4: Advanced FL algorithms (FedProx, FedOpt, and SCAFFOLD)
 
 Next, let's try some different FL algorithms on a more heterogeneous split:
 
 [FedProx](https://arxiv.org/abs/1812.06127) adds a regularizer to the loss used in `CIFAR10Learner` (`fedproxloss_mu`)`:
 ```
-./run_poc.sh 8 cifar10_fedprox 6 0.1
+./submit_job.sh cifar10_fedprox 0.1
 ```
 [FedOpt](https://arxiv.org/abs/2003.00295) uses a new ShareableGenerator to update the global model on the server using a PyTorch optimizer. 
 Here SGD with momentum and cosine learning rate decay:
 ```
-./run_poc.sh 8 cifar10_fedopt 7 0.1
+./submit_job.sh cifar10_fedopt 0.1
 ```
 [SCAFFOLD](https://arxiv.org/abs/1910.06378) uses a slightly modified version of the CIFAR-10 Learner implementation, namely the `CIFAR10ScaffoldLearner`, which adds a correction term during local training following the [implementation](https://github.com/Xtra-Computing/NIID-Bench) as described in [Li et al.](https://arxiv.org/abs/2102.02079)
 ```
-./run_poc.sh 8 cifar10_scaffold 8 0.1
+./submit_job.sh cifar10_scaffold 0.1
 ```
 
 ### 3.5 Secure aggregation using homomorphic encryption
 
 Next we run FedAvg using homomorphic encryption (HE) for secure aggregation on the server in non-heterogeneous setting (`alpha=1`).
 
-> **_NOTE:_** For HE, we need to use the securely provisioned workspace, i.e. using `run_secure.sh`. 
+> **_NOTE:_** For HE, we need to use the securely provisioned workspace. 
 > It will also take longer due to the additional encryption, decryption, encrypted aggregation, 
 > and increased encrypted messages sizes involved.
 
 FedAvg with HE: 
 ```
-./run_secure.sh 8 cifar10_fedavg_he 9 1.0
+./submit_job.sh cifar10_fedavg_he 1.0
 ```
 
 > **_NOTE:_** Currently, FedOpt is not supported with HE as it would involve running the optimizer on encrypted values.
 
 ### 3.6 Running all examples
 
-You can use `./run_experiments.sh` to execute all above-mentioned experiments sequentially if preferred. 
+You can use `./run_experiments.sh` to submit all above-mentioned experiments at once if preferred. 
 This script uses the secure workspace to also support the HE experiment.
 
 ## 4. Results
 
-Let's summarize the result of the experiments run above. Next, we will compare the final validation scores of 
+Let's summarize the result of the experiments run above. First, we will compare the final validation scores of 
 the global models for different settings. In this example, all clients compute their validation scores using the
 same CIFAR-10 test set. The plotting script used for the below graphs is in 
 [./figs/plot_tensorboard_events.py](./figs/plot_tensorboard_events.py) 
-(see also [./virtualenv/plot-requirements.txt](./virtualenv/plot-requirements.txt)).
+(please install [./virtualenv/plot-requirements.txt](./virtualenv/plot-requirements.txt)).
 
 ### 4.1 Central vs. FedAvg
 With a data split using `alpha=1.0`, i.e. a non-heterogeneous split, we achieve the following final validation scores.
@@ -216,6 +245,6 @@ to update the global model on the server. Both achieve better performance with t
 
 In a real-world scenario, the researcher won't have access to the TensorBoard events of the individual clients. In order to visualize the training performance in a central place, `AnalyticsSender`, `ConvertToFedEvent` on the client, and `TBAnalyticsReceiver` on the server can be used. For an example using FedAvg and metric streaming during training, run:
 ```
-./run_poc.sh 8 cifar10_fedavg_stream_tb 10 1.0
+./submit_job.sh cifar10_fedavg_stream_tb 1.0
 ```
-Using this configuration, a `tb_events` folder will be created under the `run_*` folder of the server that includes all the TensorBoard event values of the different clients.
+Using this configuration, a `tb_events` folder will be created under the `[RUN_ID]` folder of the server that includes all the TensorBoard event values of the different clients.
diff --git a/examples/cifar10/figs/plot_tensorboard_events.py b/examples/cifar10/figs/plot_tensorboard_events.py
@@ -21,31 +21,65 @@
 import seaborn as sns
 import tensorflow as tf
 
+# secure workspace
 client_results_root = "./workspaces/secure_workspace/site-1"
 server_results_root = "./workspaces/secure_workspace/localhost"
 
+# poc workspace
+# client_results_root = "./workspaces/poc_workspace/site-1"
+# server_results_root = "./workspaces/poc_workspace/server"
+
 # 4.1 Central vs. FedAvg
 experiments = {
-    "cifar10_central": {"run": "run_1", "tag": "val_acc_local_model"},
-    "cifar10_fedavg": {"run": "run_2", "tag": "val_acc_global_model"},
-    "cifar10_fedavg_he": {"run": "run_9", "tag": "val_acc_global_model"},
+    "cifar10_central": {"tag": "val_acc_local_model"},
+    "cifar10_fedavg": {"tag": "val_acc_global_model", "alpha": 1.0},
+    "cifar10_fedavg_he": {"tag": "val_acc_global_model", "alpha": 1.0},
 }
 
 # # 4.2 Impact of client data heterogeneity
-# experiments = {"cifar10_fedavg (alpha=1.0)": {"run": "run_2", "tag": "val_acc_global_model"},
-#                 "cifar10_fedavg (alpha=0.5)": {"run": "run_3", "tag": "val_acc_global_model"},
-#                 "cifar10_fedavg (alpha=0.3)": {"run": "run_4", "tag": "val_acc_global_model"},
-#                 "cifar10_fedavg (alpha=0.1)": {"run": "run_5", "tag": "val_acc_global_model"}}
-#
+# experiments = {"cifar10_fedavg (alpha=1.0)": {"tag": "val_acc_global_model", "alpha": 1.0},
+#               "cifar10_fedavg (alpha=0.5)": {"tag": "val_acc_global_model", "alpha": 0.5},
+#               "cifar10_fedavg (alpha=0.3)": {"tag": "val_acc_global_model", "alpha": 0.3},
+#               "cifar10_fedavg (alpha=0.1)": {"tag": "val_acc_global_model", "alpha": 0.1}
+# }
+
 # # 4.3 FedProx vs. FedOpt vs. SCAFFOLD
-# experiments = {"cifar10_fedavg": {"run": "run_5", "tag": "val_acc_global_model"},
-#                "cifar10_fedprox": {"run": "run_6", "tag": "val_acc_global_model"},
-#                "cifar10_fedopt": {"run": "run_7", "tag": "val_acc_global_model"},
-#                "cifar10_scaffold": {"run": "run_8", "tag": "val_acc_global_model"}}
+# experiments = {"cifar10_fedavg": {"tag": "val_acc_global_model", "alpha": 0.1},
+#               "cifar10_fedprox": {"tag": "val_acc_global_model", "alpha": 0.1},
+#               "cifar10_fedopt": {"tag": "val_acc_global_model", "alpha": 0.1},
+#               "cifar10_scaffold": {"tag": "val_acc_global_model", "alpha": 0.1}
+# }
 
 add_cross_site_val = True
 
 
+def find_run_number(workdir, fl_app_name="cifar10_fedavg", alpha=None):
+    """Find the first matching experiment"""
+    # TODO: return several experiment run_numbers with matching settings
+    fl_app_files = glob.glob(os.path.join(workdir, "**", "fl_app.txt"), recursive=True)
+    assert len(fl_app_files) > 0, f"No `fl_app.txt` files found in workdir={workdir}."
+    for fl_app_file in fl_app_files:
+        with open(fl_app_file, "r") as f:
+            _fl_app_name = f.read()
+        if fl_app_name in _fl_app_name:  # alpha will be matched based on value in config file
+            run_number = os.path.basename(os.path.dirname(fl_app_file))
+            if alpha is not None:
+                config_fed_server_file = glob.glob(
+                    os.path.join(os.path.dirname(fl_app_file), "**", "config_fed_server.json"), recursive=True
+                )
+                assert (
+                    len(config_fed_server_file) == 1
+                ), f"No unique server config found in {os.path.dirname(fl_app_file)}"
+                with open(config_fed_server_file[0], "r") as f:
+                    server_config = json.load(f)
+                _alpha = server_config["alpha"]
+                if _alpha == alpha:
+                    return run_number
+            else:
+                return run_number
+    raise ValueError(f"No run number found for fl_app_name={fl_app_name} in workdir={workdir}")
+
+
 def read_eventfile(filepath, tags=["val_acc_global_model"]):
     data = {}
     for summary in tf.compat.v1.train.summary_iterator(filepath):
@@ -86,23 +120,30 @@ def main():
 
     # add event files
     for config, exp in experiments.items():
-        eventfile = glob.glob(os.path.join(client_results_root, exp["run"] + "/**/events.*"), recursive=True)
+        config_name = config.split(" ")[0]
+        alpha = exp.get("alpha", None)
+        run_number = find_run_number(workdir=server_results_root, fl_app_name=config_name, alpha=alpha)
+        print(f"Found run {run_number} for {config_name} with alpha={alpha}")
+        eventfile = glob.glob(os.path.join(client_results_root, run_number, "**", "events.*"), recursive=True)
         assert len(eventfile) == 1, "No unique event file found!"
         eventfile = eventfile[0]
         print("adding", eventfile)
         add_eventdata(data, config, eventfile, tag=exp["tag"])
 
         if add_cross_site_val:
             xsite_file = glob.glob(
-                os.path.join(server_results_root, exp["run"] + "/**/cross_val_results.json"), recursive=True
+                os.path.join(server_results_root, run_number, "**", "cross_val_results.json"), recursive=True
             )
             assert len(xsite_file) == 1, "No unique x-site file found!"
             with open(xsite_file[0], "r") as f:
                 xsite_results = json.load(f)
 
             xsite_data["Config"].append(config)
             for k in xsite_keys:
-                xsite_data[k].append(xsite_results["site-1"][k]["val_accuracy"])
+                try:
+                    xsite_data[k].append(xsite_results["site-1"][k]["val_accuracy"])
+                except BaseException as e:
+                    raise ValueError(f"No val_accuracy for {k} in {xsite_file}!")
 
     print("Training TB data:")
     print(pd.DataFrame(data))

diff --git a/...r10_central/config/config_fed_client.json → ...r10_central/config/config_fed_client.json b/...r10_central/config/config_fed_client.json → ...r10_central/config/config_fed_client.json
@@ -1,8 +1,6 @@
 {
   "format_version": 2,
 
-  "DATASET_ROOT": "/tmp/cifar10_data",
-
   "executors": [
     {
       "tasks": [
@@ -28,7 +26,6 @@
       "id": "cifar10-learner",
       "path": "pt.learners.cifar10_learner.CIFAR10Learner",
       "args": {
-        "dataset_root": "{DATASET_ROOT}",
         "aggregation_epochs": 25,
         "lr": 1e-2,
         "central": true

diff --git a/...r10_central/config/config_fed_server.json → ...r10_central/config/config_fed_server.json b/...r10_central/config/config_fed_server.json → ...r10_central/config/config_fed_server.json
@@ -32,7 +32,7 @@
     },
     {
       "id": "model_selector",
-      "name": "IntimeModelSelectionHandler",
+      "name": "IntimeModelSelector",
       "args": {}
     },
     {

diff --git a/examples/cifar10/job_configs/cifar10_central/meta.json b/examples/cifar10/job_configs/cifar10_central/meta.json
@@ -0,0 +1,19 @@
+{
+  "study_name": "",
+  "name": "cifar10_central",
+  "resource_spec": {
+    "site-1": {
+      "gpu": 1
+    }
+  },
+  "deploy_map": {
+    "cifar10_central": [
+      "server",
+      "site-1"
+    ]
+  },
+  "min_clients": 1,
+  "mandatory_clients": [
+    "site-1"
+  ]
+}
diff --git a/...ar10_fedopt/config/config_fed_client.json → ...ar10_fedavg/config/config_fed_client.json b/...ar10_fedopt/config/config_fed_client.json → ...ar10_fedavg/config/config_fed_client.json
@@ -1,7 +1,7 @@
 {
   "format_version": 2,
 
-  "DATASET_ROOT": "/tmp/cifar10_data",
+  "TRAIN_SPLIT_ROOT": "/tmp/cifar10_splits",
 
   "executors": [
     {
@@ -28,7 +28,7 @@
       "id": "cifar10-learner",
       "path": "pt.learners.cifar10_learner.CIFAR10Learner",
       "args": {
-        "dataset_root": "{DATASET_ROOT}",
+        "train_idx_root": "{TRAIN_SPLIT_ROOT}",
         "aggregation_epochs": 4,
         "lr": 1e-2
       }