Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Block requests sometimes take a lot of time #5831

Closed
tomaka opened this issue Apr 29, 2020 · 7 comments
Closed

Block requests sometimes take a lot of time #5831

tomaka opened this issue Apr 29, 2020 · 7 comments
Labels
I4-annoyance The client behaves within expectations, however this “expected behaviour” itself is at issue.

Comments

@tomaka
Copy link
Contributor

tomaka commented Apr 29, 2020

When syncing Kusama from scratch for example, one can see the sync speed go up and down a lot. The reason for that is that we spend a lot of time waiting for block responses.

This should be investigated.

@tomaka tomaka added the I4-annoyance The client behaves within expectations, however this “expected behaviour” itself is at issue. label Apr 29, 2020
@tomaka
Copy link
Contributor Author

tomaka commented Apr 30, 2020

0.7.32 includes the metrics added by #5811 and will be helpful to understand that

@tomaka
Copy link
Contributor Author

tomaka commented May 5, 2020

I can now see on Grafana that the on_block_request method usually takes less than 1ms, but occasionally takes up to 4-6 seconds for an unknown reason.

We've also deployed #5854 on two validator nodes, and I can see on these two nodes that the networking overhead is around a constant 200ms. This is actually pretty high in the absolute, and should also be fixed. I suppose it's because the network worker is quite busy.

EDIT: the 200ms also include putting the response in the TCP buffer to send to the remote, which could explain it

@arkpar
Copy link
Member

arkpar commented May 6, 2020

This does not explain the 40 second timeouts that happen all the time when syncing. Would be good to have a sync log for a machine that has on_block_request taking more than 40 seconds.

@tomaka
Copy link
Contributor Author

tomaka commented May 27, 2020

Update: from the serving side, the fix to the "network freezes" seems to have improved the situation.

I think the next step here would be to continue smoothing the network events-propagation/performances story, and see if it still happens afterwards.

Even if the network events-propagation/performance issue is not directly responsible, fixing it first would eliminate the noise.

@tomaka
Copy link
Contributor Author

tomaka commented Jul 29, 2020

I feel like this was caused by the network worker being overloaded. #6692 should considerably improve the situation. We'll see after it has been deployed.

@tomaka
Copy link
Contributor Author

tomaka commented Jan 28, 2021

I think this is fixed. While it's hard to definitely claim that blocks requests time is correct, here's for example the 99% percentile of block requests time over 48 hours:

image

The occasional peak is most likely just a timeout.

@tomaka tomaka closed this as completed Jan 28, 2021
@tomaka
Copy link
Contributor Author

tomaka commented Jan 28, 2021

Perhaps more important, the 99th percentile of the time between receiving a request and sending back an answer:

image

There is an occasional peak, which is a bit intriguing, but it's overall flat.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I4-annoyance The client behaves within expectations, however this “expected behaviour” itself is at issue.
Projects
None yet
Development

No branches or pull requests

2 participants