Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chainSync: Avoid reputation bans for slower timed out peer block requests #4987

Closed
wants to merge 4 commits into from

Conversation

lexnv
Copy link
Contributor

@lexnv lexnv commented Jul 9, 2024

lexnv added 4 commits July 8, 2024 21:28
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv added A1-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). R0-silent Changes should not be mentioned in any release notes I5-enhancement An additional feature request. D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. labels Jul 9, 2024
@lexnv lexnv self-assigned this Jul 9, 2024
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6663244

@lexnv
Copy link
Contributor Author

lexnv commented Jul 16, 2024

Superseded by: #5029

  • the new PR handles the behavior for all sync strategies as opposed to just chainSync

@lexnv lexnv closed this Jul 16, 2024
github-merge-queue bot pushed a commit that referenced this pull request Aug 5, 2024
…g the same request multiple times (#5029)

This PR avoids submitting the same block or state request multiple times
to the same slow peer.

Previously, we submitted the same request to the same slow peer, which
resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to
submit queries to, although a better candidate may exist.

Instead, in this PR we:
- introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only
track the state of disconnected peers with a request in flight
- when the `DisconnectedPeers` detects a peer disconnected with a
request in flight, the peer is backed off
  - on the first disconnection: 60 seconds
  - on second disconnection: 120 seconds
- on the third disconnection the peer is banned, and the peer remains
banned until the peerstore decays its reputation
  
This PR lifts the pressure from overloaded nodes that cannot process
requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.

Theoretically, submitting the same request multiple times can still
happen when:
- (a) we backoff and ban the peer 
- (b) the network does not discover other peers -- this may also be a
test net
- (c) the peer gets reconnected after the reputation decay and is still
slow to respond



Aims to improve:
- #4924
- #531

Next Steps:
- Investigate the network after this is deployed, possibly bumping the
keep-alive timeout or seeing if there's something else misbehaving




This PR builds on top of:
- #4987


### Testing Done
- Added a couple of unit tests where test harness were set in place

- Local testnet

```bash
13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD

13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false

13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true

13:16:49.108  WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting.
```

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Aug 28, 2024
…g the same request multiple times (paritytech#5029)

This PR avoids submitting the same block or state request multiple times
to the same slow peer.

Previously, we submitted the same request to the same slow peer, which
resulted in reputation bans on the slow peer side.
Furthermore, the strategy selected the same slow peer multiple times to
submit queries to, although a better candidate may exist.

Instead, in this PR we:
- introduce a `DisconnectedPeers` via LRU with 512 peer capacity to only
track the state of disconnected peers with a request in flight
- when the `DisconnectedPeers` detects a peer disconnected with a
request in flight, the peer is backed off
  - on the first disconnection: 60 seconds
  - on second disconnection: 120 seconds
- on the third disconnection the peer is banned, and the peer remains
banned until the peerstore decays its reputation
  
This PR lifts the pressure from overloaded nodes that cannot process
requests in due time.
And if a peer is detected to be slow after backoffs, the peer is banned.

Theoretically, submitting the same request multiple times can still
happen when:
- (a) we backoff and ban the peer 
- (b) the network does not discover other peers -- this may also be a
test net
- (c) the peer gets reconnected after the reputation decay and is still
slow to respond



Aims to improve:
- paritytech#4924
- paritytech#531

Next Steps:
- Investigate the network after this is deployed, possibly bumping the
keep-alive timeout or seeing if there's something else misbehaving




This PR builds on top of:
- paritytech#4987


### Testing Done
- Added a couple of unit tests where test harness were set in place

- Local testnet

```bash
13:13:25.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Added first time peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD

13:14:39.102 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 2, last_disconnect: Instant { tv_sec: 93355, tv_nsec: 942016062 } }, should ban: false

13:16:49.107 DEBUG tokio-runtime-worker sync::persistent_peer_state: Remove known peer 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD state: DisconnectedPeerState { num_disconnects: 3, last_disconnect: Instant { tv_sec: 93485, tv_nsec: 947551051 } }, should ban: true

13:16:49.108  WARN tokio-runtime-worker peerset: Report 12D3KooWHdiAxVd8uMQR1hGWXccidmfCwLqcMpGwR6QcTP6QRMuD: -2147483648 to -2147483648. Reason: Slow peer after backoffs. Banned, disconnecting.
```

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A1-insubstantial Pull request requires no code review (e.g., a sub-repository hash update). D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. I5-enhancement An additional feature request. R0-silent Changes should not be mentioned in any release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants