In sc-network, isolate accesses to client, txpool, finality proof builder, ... in separate tasks. #559

tomaka · 2020-03-27T12:04:02Z

Right now the sc-network directly calls the client, tx-pool, finality proof builder, and so on (full list in the config module).

However whenever one of these components takes a long time to respond, the entire networking stack freezes for everyone.
Not only does this happens right now due to some runtime performance issues, but it also makes us quite vulnerable to attacks.

Instead of performing straight calls, we should in my opinion spawn a few background tasks (or even a new background task per call when that is possible) and perform the calls there.

My personal opinion is that isolating the calls to the clients/tx-pool/... should be performed by sc-service, but from a pragmatic perspective let's do it in sc-network.

Instead of passing directly an Arc<dyn Client> to the sync state machine, we should instead pass some sort of Isolated<Arc<dyn Client>> where Isolated exposes an asynchronous API.

The text was updated successfully, but these errors were encountered:

seunlanlege · 2020-03-30T09:39:43Z

Isn't this related to #3230. In terms of effort, Wouldn't it be best to just coordinate work on #3230

tomaka · 2020-03-30T09:50:53Z

Yes and no. It depends how paritytech/substrate#3230 is implemented.

What I had in mind for paritytech/substrate#3230 is that when we call client.call(...) we should not put the thread to sleep while waiting for something to happen. For example, the example I gave in paritytech/substrate#3230 is that we shouldn't block the thread while waiting for a network message to be received.

But even if we do that, it is still possible for client.call(...) to take several seconds of CPU operations (because of a lot of calculations) during which the network will be frozen.

But we could also consider using separate tasks as part of paritytech/substrate#3230, I don't know.

seunlanlege · 2020-03-30T10:32:04Z

We should start a channel on matrix and continue there.

bkchr · 2020-03-30T11:13:07Z

We should start a channel on matrix and continue there.

Why? This is the right place to discuss the implementation.

tomaka · 2020-04-16T13:29:45Z

The code that is relevant for this issue is here: https://github.com/paritytech/substrate/blob/ae36c62223a8191e5bbca14ca5df3bc7d0edb6ea/client/network/src/chain.rs#L45
Ideally all the traits mentioned in this file would have asynchronous methods.

Since this is far from being trivial, we should, as mentioned above, add to the networking code an IsolatedClient struct and an IsolatedFinalityProofProvider struct that wrap around implementations of these traits and do the work in a background task.

When it comes to making the networking code async-friendly, it shouldn't be very hard to adjust light_client_handler.rs and block_requests.rs. However protocol.rs would be very tough to change.

I suggest we keep the synchronousity of protocol.rs and only adjust the new protocols.

arkpar · 2020-04-16T13:42:52Z

However whenever one of these components takes a long time to respond, the entire networking stack freezes for everyone.

How so? Isn't it run on a thread pool?

I don't see how this proposed change will solve anything. Simply moving a bottleneck to a different place won't help much. You'd just get an overflow in the task queue.

tomaka · 2020-04-16T14:00:57Z

How so? Isn't it run on a thread pool?

Answering ping, parsing messages, and everything "close to the socket" is done in a thread pool. Answering requests when access to the client is needed is not.

Simply moving a bottleneck to a different place won't help much.

Right now if the client takes a long time we just accumulate network messages from nodes. After this change, we would instead process all messages quickly/immediately and instead accumulate messages that we need to answer.

This means that:

1- We could detect when the client is over-capacity and return some sort of "sorry we're busy" error to the remote.
2- We could continue processing the block announces, transactions, and so on that the nodes send us.

As a more general problem, we call the client from within Future::poll, which is supposed to return quickly. Blocking in Future::poll is normally a no-no.

arkpar · 2020-04-16T15:04:19Z

Answering requests when access to the client is needed is not.

Why not? Client itself is thread safe. I'm pretty sure handling requests does not lock anything for the duration of the client query. At least it was not initially.

1- We could detect when the client is over-capacity and return some sort of "sorry we're busy" error to the remote.

Request handling needs to be fair. E.g. bootnode resources need to scale evenly for all connecting peers. If the node is overloaded the request needs to wait till timeout and not be immediately denied.

2- We could continue processing the block announces, transactions, and so on that the nodes send us.

See above. I don't see why it is not the case right now.

This sounds like a lot of additional complexity to work around the actual problem. Handling network requests should be fast. If your network handler takes seconds to execute it needs to be optimized first, and not shoved into the queue. And if such a queue was to be organized, i'd argue that it should be managed explicitly, instead of relying on tokio executor or creating another unbounded channel. As it will most certainly require some kind of per-peer throttling.

* Add runtime API `ConvertTransactionRuntime API` * ref: make transaction_converter optional * rustfmt

tomaka added I3-bug labels Mar 27, 2020

tomaka mentioned this issue May 5, 2020

Spawn the network worker with spawn_blocking paritytech/substrate#5909

Merged

tomaka mentioned this issue Jul 10, 2020

Implement request-responses protocols paritytech/substrate#6634

Merged

altonen added the U3-nice_to_have label Dec 14, 2022

altonen transferred this issue from paritytech/substrate Aug 24, 2023

the-right-joyce added I2-bug The node fails to follow expected behavior. I4-refactor Code needs refactoring. and removed I3-bug labels Aug 25, 2023

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024

Add runtime API ConvertTransactionRuntime API (paritytech#559)

d5d4f94

* Add runtime API `ConvertTransactionRuntime API` * ref: make transaction_converter optional * rustfmt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In sc-network, isolate accesses to client, txpool, finality proof builder, ... in separate tasks. #559

In sc-network, isolate accesses to client, txpool, finality proof builder, ... in separate tasks. #559

tomaka commented Mar 27, 2020 •

edited

Loading

seunlanlege commented Mar 30, 2020 •

edited

Loading

tomaka commented Mar 30, 2020 •

edited

Loading

seunlanlege commented Mar 30, 2020

bkchr commented Mar 30, 2020

tomaka commented Apr 16, 2020 •

edited

Loading

arkpar commented Apr 16, 2020

tomaka commented Apr 16, 2020 •

edited

Loading

arkpar commented Apr 16, 2020

In sc-network, isolate accesses to client, txpool, finality proof builder, ... in separate tasks. #559

In sc-network, isolate accesses to client, txpool, finality proof builder, ... in separate tasks. #559

Comments

tomaka commented Mar 27, 2020 • edited Loading

seunlanlege commented Mar 30, 2020 • edited Loading

tomaka commented Mar 30, 2020 • edited Loading

seunlanlege commented Mar 30, 2020

bkchr commented Mar 30, 2020

tomaka commented Apr 16, 2020 • edited Loading

arkpar commented Apr 16, 2020

tomaka commented Apr 16, 2020 • edited Loading

arkpar commented Apr 16, 2020

tomaka commented Mar 27, 2020 •

edited

Loading

seunlanlege commented Mar 30, 2020 •

edited

Loading

tomaka commented Mar 30, 2020 •

edited

Loading

tomaka commented Apr 16, 2020 •

edited

Loading

tomaka commented Apr 16, 2020 •

edited

Loading