Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

PB-398 Round resumption in Coordinator #285

Merged
merged 7 commits into from
Feb 14, 2020
Merged

Conversation

finiteprods
Copy link
Contributor

@finiteprods finiteprods commented Feb 11, 2020

References

PB-398

Summary

Adds proper support for round resumption in the Coordinator.

This merge request adds some resilience to the Coordinator in the face of Participant dropouts. In particular, when there is a dropout mid-round, the Coordinator pauses the round by going to standby. When enough Participants connect again, the round is resumed. While the idea of round resumption has been in the Coordinator for some time (and has recently been supported properly in the Participant too), up until now it has not been properly exercised, and this MR fixes the obvious faults, so that:

  • remove_participant previously ignored the collection of selected Participants for a round; now it also takes that into account.
  • _handle_rendezvous previously forced a complete re-selection each time a round resumes; the selection logic has been adjusted to only select from the connected pool while keeping the already selected survivors intact.

In addition:

  • test_select_outstanding tests the above functionality.
  • test_remove_participant has been split into 2 cases to properly test the above.
  • tighter checks in the functions is_finished and get_weight_updates of the Round class.

Are there any open tasks/blockers for the ticket?

This MR does not attempt to enable full fault-tolerance nor repair all the potential concurrency-related problems in the Coordinator e.g. ensuring various state collections are thread-safe, but these will be follow-up tasks going forward.


Reviewer checklist

Reviewer agreement:

  • Reviewers assign themselves at the start of the review.
  • Reviewers do not commit or merge the merge request.
  • Reviewers have to check and mark items in the checklist.

Merge request checklist

  • Conforms to the merge request title naming XP-XXX <a description in imperative form>.
  • Each commit conforms to the naming convention XP-XXX <a description in imperative form>.
  • Linked the ticket in the merge request title or the references section.
  • Added an informative merge request summary.

Code checklist

  • Conforms to the branch naming XP-XXX-<a_small_stub>.
  • Passed scope checks.
  • Added or updated tests if needed.
  • Added or updated code documentation if needed.
  • Conforms to Google docstring style.
  • Conforms to XAIN structlog style.

@finiteprods finiteprods changed the title Pb 398 round resume PB-398 Proper round resumption in Coordinator Feb 11, 2020
@finiteprods finiteprods changed the title PB-398 Proper round resumption in Coordinator PB-398 Round resumption in Coordinator Feb 11, 2020
@finiteprods finiteprods marked this pull request as ready for review February 11, 2020 08:14
@little-dude little-dude self-requested a review February 13, 2020 09:34
Copy link
Contributor

@janpetschexain janpetschexain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a general note: this mr currently contains quite some changes wich doesn't belong to the scope (like stuff for docker, logging, config) and which make the mr difficult to read. rebasing this feature branch on the latest dev branch should clean that up.

xain_fl/coordinator/coordinator.py Show resolved Hide resolved
xain_fl/coordinator/round.py Outdated Show resolved Hide resolved
little-dude
little-dude previously approved these changes Feb 13, 2020
Copy link
Contributor

@little-dude little-dude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your detailed explanations :)
As discussed it would be nice to add a few extra checks, but that looks good to me already.

janpetschexain
janpetschexain previously approved these changes Feb 13, 2020
Copy link
Contributor

@Robert-Steiner Robert-Steiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the testing of your PR I got the following error:

xain-fl-dev_1      | Traceback (most recent call last):
xain-fl-dev_1      |   File "/usr/local/bin/coordinator", line 11, in <module>
xain-fl-dev_1      |     load_entry_point('xain-fl', 'console_scripts', 'coordinator')()
xain-fl-dev_1      |   File "/app/xain_fl/__main__.py", line 52, in main
xain-fl-dev_1      | ['ipv4:172.22.0.1:35554' 'ipv4:172.22.0.1:35544' 'ipv4:172.22.0.1:35548']
xain-fl-dev_1      |     serve(coordinator=coordinator, server_config=config.server)
xain-fl-dev_1      |   File "/app/xain_fl/serve.py", line 46, in serve
xain-fl-dev_1      |     monitor_heartbeats(coordinator, terminate_event)
xain-fl-dev_1      |   File "/app/xain_fl/coordinator/heartbeat.py", line 46, in monitor_heartbeats
xain-fl-dev_1      |     coordinator.remove_participant(participant_id)
xain-fl-dev_1      |   File "/app/xain_fl/coordinator/coordinator.py", line 233, in remove_participant
xain-fl-dev_1      |     self.round.remove_selected(participant_id)
xain-fl-dev_1      |   File "/app/xain_fl/coordinator/round.py", line 43, in remove_selected
xain-fl-dev_1      |     self.participant_ids.remove(participant_id)
xain-fl-dev_1      | AttributeError: 'numpy.ndarray' object has no attribute 'remove'

What I did is:

  • Set min_participants = 3
  • start 3 participants
  • kill one participant after the first round

I will check why it happens.

@Robert-Steiner
Copy link
Contributor

Robert-Steiner commented Feb 14, 2020

During the testing of your PR I got the following error:

xain-fl-dev_1      | Traceback (most recent call last):
xain-fl-dev_1      |   File "/usr/local/bin/coordinator", line 11, in <module>
xain-fl-dev_1      |     load_entry_point('xain-fl', 'console_scripts', 'coordinator')()
xain-fl-dev_1      |   File "/app/xain_fl/__main__.py", line 52, in main
xain-fl-dev_1      | ['ipv4:172.22.0.1:35554' 'ipv4:172.22.0.1:35544' 'ipv4:172.22.0.1:35548']
xain-fl-dev_1      |     serve(coordinator=coordinator, server_config=config.server)
xain-fl-dev_1      |   File "/app/xain_fl/serve.py", line 46, in serve
xain-fl-dev_1      |     monitor_heartbeats(coordinator, terminate_event)
xain-fl-dev_1      |   File "/app/xain_fl/coordinator/heartbeat.py", line 46, in monitor_heartbeats
xain-fl-dev_1      |     coordinator.remove_participant(participant_id)
xain-fl-dev_1      |   File "/app/xain_fl/coordinator/coordinator.py", line 233, in remove_participant
xain-fl-dev_1      |     self.round.remove_selected(participant_id)
xain-fl-dev_1      |   File "/app/xain_fl/coordinator/round.py", line 43, in remove_selected
xain-fl-dev_1      |     self.participant_ids.remove(participant_id)
xain-fl-dev_1      | AttributeError: 'numpy.ndarray' object has no attribute 'remove'

What I did is:

* Set `min_participants = 3`

* start 3 participants

* kill one participant after the first round

I will check why it happens.

The issues comes from RandomController
It returns a numpy array:

In controller.py

def select_ids(self, participant_ids: List[str]) -> List[str]:
...
        return np.random.choice(participant_ids, size=num_ids_to_select, replace=False)

Is called in coordinator.py line 242

selected_ids = self.controller.select_ids(self.participants.ids())
self.round = Round(selected_ids)

Copy link
Contributor

@Robert-Steiner Robert-Steiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works👍

@finiteprods finiteprods merged commit 9d75daa into development Feb 14, 2020
@finiteprods finiteprods deleted the PB-398-round_resume branch February 14, 2020 11:43
little-dude pushed a commit that referenced this pull request Feb 25, 2020
* PB-398 add/remove selected in round

* PB-398 order controller for selecting init segment of sorted

* PB-398 split test_remove_participants; test_select_outstanding

* PB-398 fix remove_participant; fix _handle_rendezvous

* PB-398 preconditions; remove from list suggestion

* PB-398 black format

* PB-398 fix random controller non-list
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants