From 8d5c2b77918b34b07a6389c6121d6d52252d3587 Mon Sep 17 00:00:00 2001 From: Corentin Henry Date: Wed, 26 Feb 2020 09:46:05 +0100 Subject: [PATCH] PB-314: document the new weight exchange mechanism (#308) --- docs/network_architecture.rst | 90 +++++++++++++++++++++++++++++------ 1 file changed, 76 insertions(+), 14 deletions(-) diff --git a/docs/network_architecture.rst b/docs/network_architecture.rst index 230cbde00..e2f5205e5 100644 --- a/docs/network_architecture.rst +++ b/docs/network_architecture.rst @@ -53,6 +53,74 @@ Federated Machine Learning Flow b. Resume the task 7. Once all rounds are completed the *Coordinator* can just exit +Model weights distribution +-------------------------- + +Models can be massive (several dozens of megabytes), and protobuf is +not suited for exchanging such data. Instead, the participants and the +coordinator use an S3 bucket to exchange their weights. The exact +mechanism is represented by the sequence diagram below. + +At the beginning of a round (1) the selected participants send a +``StartTrainingRound`` request. + +Once it receives a response, the participant fetches the weights for +the current global model from the S3 store (2). S3 buckets are +key-value stores, and the key for global weights is +``/global``. + +Then, the participant trains. Once done, it uploads its local weights +to the S3 bucket (3). The key is ``/``. + +Finally (4), the participant sends its ``EndTrainingRoundRequest``. Before +answering, the coordinator retrieves the local weights the participant +has uploaded. + +.. code:: + + P C Store + 1. | StartTrainingRoundRequest | | + | -----------------------------> | | + | StartTrainingRoundResponse | | + | <----------------------------- | | + | | | + | Get global weights (key="round/global") | + 2. | ------------------------------------------------------> | + | Global weights | + | <------------------------------------------------------ | + | | | + | [train...] | | + | | | + 3. | Set local weights (key="round/participant") | + | ------------------------------------------------------> | + | Ok | + | <------------------------------------------------------ | + | | | + 4. | EndTrainingRoundRequest | | + | -----------------------------> | Get local weights (key="round/participant") + | | ---------------------> | + | | Local weights | + | EndTrainingRoundResponse | <--------------------- | + | <----------------------------- | | + +At the end of the round, the coordinator writes the weights to the s3 +bucket, using the next upcoming round number as key (see the sequence +diagram below). + +.. code:: + + P C Store + | EndTrainingRoundRequest | | + | -----------------------------> | Get local weights (key="round/participant") + | | ---------------------> | + | | Local weights | + | EndTrainingRoundResponse | <--------------------> | + | <----------------------------- | | + | | | + | | Set global weights (key="round+1/participant") + | | ---------------------> | + | | Ok | + | | <--------------------- | Coordinator ----------- @@ -89,13 +157,13 @@ A **Rendezvous** method that allows *Participants* to register with a about the *Participant* in order to keep track of what the *Participant* is doing. -A **StartTrainingRound** method that allows *Participants* to get the current global -model as well as signaling their intent to participate in a given round. +A **StartTrainingRound** method that allows *Participants* to retrieve +the current global model as well as signaling their intent to +participate in a given round. An **EndTrainingRound** method that allows *Participants* to submit their updated models after they finished their training task. - In order to remain agnostic to the machine learning framework *Participants* and *Coordinator* exchange models in the form of numpy arrays. How models are converted from a particular machine learning framework model into numpy arrays @@ -347,27 +415,21 @@ where the request and response data are given as the following protobuf messages message StartTrainingRoundRequest {} + message StartTrainingRoundResponse { - xain_proto.np.NDArray weights = 1; - int32 epochs = 2; - int32 epoch_base = 3; + int32 epochs = 1; + int32 epoch_base = 2; } message EndTrainingRoundRequest { - xain_proto.np.NDArray weights = 1; + string participant_id = 1; int32 number_samples = 2; - map metrics = 3; + string metrics = 3; } message EndTrainingRoundResponse {} -Note that while most of the Python data types to be exchanged can be -"protobuf-erized" (and back), :code:`ndarray` requires more work. Fortunately we -have the -`xain_proto/np `_ -project to help with this conversion. - Training Round Communication ^^^^^^^^^^^^^^^^^^^^^^^^^^^^