Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(congestion): reject new transactions on RPC level #11419

Merged
merged 5 commits into from
Jun 4, 2024

Conversation

jakmeier
Copy link
Contributor

@jakmeier jakmeier commented May 29, 2024

Summary: In this PR, we introduce a new failure mode on the RPC level when a transaction is submitted under congestion. The error is of type InvalidTxError and called ShardCongested with a single field shard_id referencing the congested shard.

Details

With cross-shard congestion control being stabilized soon, we must deal with the case when a shard rejects new transactions.

On the chunk producer level, all transactions going to a congested shard will be dropped. This keeps the memory requirements of chunk producers bounded. Further, we decided to go for a relatively low threshold in order to keep the latency of accepted transactions low, preventing new transactions as soon as we hit 25% congestion on a specific shard. Consequently, when shards are congested, it will not be long before transactions are rejected.

This has consequences for the users. On the positive side, they will no longer have to wait for a long time not knowing if their transaction will be accepted or not. Either, it is executed within a bounded time (at most 20 blocks after inclusion) or it will be rejected immediately.

But on the negative side, when a shard is congested, they will have to actively retry sending the transaction until it gets accepted. We hope that this can be automated by wallets, which can also provide useful live updates to the user about what is happening. But for this, they will need to understand and handle the new error ShardCongested differently from existing errors. The key difference is that the same signed transaction can be sent again and will be accepted if congestion has gone down.

@jakmeier
Copy link
Contributor Author

One of the points raised in near/NEPs#539 was that we have to clearly propagate the information about rejected transactions to users.

This PR is my suggestion how to expose the information on the RPC node level.

Summary: In this PR, we introduce a new failure mode on the RPC level
when a transaction is submitted under congestion. The error is of type
`InvalidTxError` and called `ShardCongested` with a single field
`shard_id` referencing the congested shard.

## Details

With [cross-shard congestion
control](near/NEPs#539) being stabilized soon,
we want to reject new transactions as early as possible when the
receiver shard is already overloaded with traffic.

On the chunk producer level, all transactions going to a congested shard
will be dropped. This keeps the memory requirements of chunk producers
bounded. Further, we decided to go for a relatively low threshold in
order to keep the latency of accepted transactions low, preventing new
transactions as soon as we hit 25% congestion on a specific shard.
Consequently, when shards are congested, it will not be long before
transactions are rejected.

This has consequences for the users. On the positive side, they will no
longer have to wait for a long time not knowing if their transaction
will be accepted or not. Either, it is executed within a bounded time
(at most 20 blocks after inclusion) or it will be rejected immediately.

But on the negative side, when a shard is congested, they will have to
actively retry sending the transaction until it gets accepted.

We hope that this can be automated by wallets, which can also provide
useful live updates to the user about what is happening. But for this,
they will need to understand and handle the new error `ShardCongested`
different from existing errors.
Some tests should run without congestion control to function the same
way the did. And some congestion control tests will observe the new
error, which is expected.
Copy link

codecov bot commented May 31, 2024

Codecov Report

Attention: Patch coverage is 88.09524% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 71.42%. Comparing base (96af8f7) to head (f1b492d).
Report is 1 commits behind head on master.

Files Patch % Lines
chain/client/src/client.rs 91.66% 0 Missing and 2 partials ⚠️
core/primitives/src/errors.rs 0.00% 2 Missing ⚠️
chain/chain/src/runtime/mod.rs 93.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11419      +/-   ##
==========================================
- Coverage   71.45%   71.42%   -0.03%     
==========================================
  Files         785      785              
  Lines      158807   158846      +39     
  Branches   158807   158846      +39     
==========================================
- Hits       113473   113463      -10     
- Misses      40429    40479      +50     
+ Partials     4905     4904       -1     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.37% <0.00%> (-0.01%) ⬇️
integration-tests 37.46% <88.09%> (-0.06%) ⬇️
linux 68.95% <59.52%> (+0.03%) ⬆️
linux-nightly 70.92% <88.09%> (-0.02%) ⬇️
macos 52.54% <59.52%> (-0.01%) ⬇️
pytests 1.59% <0.00%> (-0.01%) ⬇️
sanity-checks 1.38% <0.00%> (-0.01%) ⬇️
unittests 66.18% <78.57%> (-0.02%) ⬇️
upgradability 0.28% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jakmeier jakmeier marked this pull request as ready for review May 31, 2024 08:15
@jakmeier jakmeier requested a review from a team as a code owner May 31, 2024 08:15
@jakmeier
Copy link
Contributor Author

@wacban and @bowenwang1996 this PR is what I propose for how we handle rejected transactions.

In summary:

  • RPC nodes return InvalidTxError::ShardCongested when they see that the receiver shard is congested. (introduced in this PR)
  • When an RPC forwards a TX, chunk producers may drop it later due to congestion. (no change in this PR)

In the second case, RPC nodes will wait for a tx_status timeout to occur and provide no useful information back to the API user. This may happen every time a shard transitions from not congested to congested, with slightly unlucky tx submission timing.

I don't really like that it is marked as InvalidTxError but that's minor IMO. But I really don't like the case where we rely on timeouts. If transactions are sent with TxExecutionStatus::Included or stronger guarantees, getting dubious timeouts due to congestion is a bad developer and user experience. I just don't know anything better. I only see two alternatives:

  1. Don't drop transactions from the tx pool due to congestion, with the downside that, during congestion, the pool will quickly fill up with transactions that cannot currently be processed. Potentially that starves all other shards.
  2. Or, somehow signal the transactions dropped from the transaction pool back to the RPCs. That seems it would be a major effort. First, we send multiple RoutedMessageBody::ForwardTx from one RPC to several chunk producers. Then, transactions are inserted in the pool successfully, or perhaps the tx already exsisted, or maybe the tx pool is full. Already at this point we don't report back to the RPC in any way, regardless of congestion. RPCs just poll the chain state for inclusion of the tx. With congestion control enabled, we add the possibility that we drop a transaction from the pool in the future, at which point we have no idea which RPC(s) originally sent us the transaction. So yeah, reporting it back without a large overhead seems tricky to me.

Please let me know if you have any ideas. Otherwise, I suggest we discuss it again in our next congestion control call.

Also cc @Akashin and @telezhnaya. I believe for dropped transactions on a full transaction pool, we already have the same problem today, right? The problem being that RPC nodes just time out, without a useful error message. What is the recommended way for wallets to deal with it in that case?

@bowenwang1996
Copy link
Collaborator

If transactions are sent with TxExecutionStatus::Included or stronger guarantees, getting dubious timeouts due to congestion is a bad developer and user experience.

I agree. I also don't have a good solution in mind and would like to hear what others think. At the same time, this would only be a problem when a shard goes from noncongested to congested, so overall the impact should not be that large.

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we test this behavior?

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
We can also consider doing the same for the transaction signer shard but it's optional and can be done separately.

Comment on lines 181 to 182
/// The receiver shard of the transaction is too congestion to accept new
/// transactions at the moment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: too congested

@@ -643,9 +643,25 @@ impl RuntimeAdapter for NightshadeRuntime {
verify_signature: bool,
epoch_id: &EpochId,
current_protocol_version: ProtocolVersion,
receiver_congestion_info: Option<ExtendedCongestionInfo>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the congestion info of the sender? If the sender is congested their tx limit will be lowered and some transactions may also be rejected because of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we reject them. We just lower the gas limit. When we reach the gas limit, we keep following transactions in the pool rather than rejecting (dropping) them. Eventually, new transactions are dropped but that's because the pool is full, not due to something we added in nep-539.

We could still try to be smart about it, like reject something on the RPC if we think the pool is filling up with transactions all using the same sender shard. But it seems a bit tricky to get the heuristics right. If we are too aggressive, we risk frustrating users by rejecting their transactions for no good reason. Not to mention that "rogue RPCs" could abuse the system by filling the transaction pool beyond what our RPCs do, which would starve the honest RPCs completely under congestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, we can observe what happens in the wild and make adjustments if we find it necessary.

congestion_info.congestion_info,
congestion_info.missed_chunks_count,
);
if !congestion_control.shard_accepts_transactions() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may consider making that more flexible by making the threshold a configurable value rather than using the runtime parameter of 0.25. The drawback is that those would need to be kept somewhat in sync. I also dislike that some rpc node may artificially lower it in order to get their users priority but this is something they can do anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can consider adding the flexibility.

But just to be clear, there should not much to be gained by setting a higher threshold. Unless the congestion level lowers between now and when the chunk producer considers it, it will be rejected and dropped anyway. Most of the time, the RPC will just end up waiting for a timeout before it inevitably fails, annoying the user and wasting resources on the RPC in the process.
At least that is my theory, I guess it could be different in practice. Maybe the congestion level keeps going below and above the threshold, which could potentially give an advantage to more aggressive forwarding on the RPC node compared to the retrying on the wallet/client side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some users the UX doesn't matter as much as just maximizing the chances of getting their transaction in. For those it may be useful. It may also be useful for us if we discover that we did the math wrong and need to change it quickly :)

@jakmeier
Copy link
Contributor Author

jakmeier commented Jun 3, 2024

Could we test this behavior?

Yes makes sense. I added an integration test checking that we the new error is returned when the receiver is congested.

@jakmeier jakmeier added this pull request to the merge queue Jun 4, 2024
Merged via the queue into near:master with commit dd1e278 Jun 4, 2024
29 checks passed
@jakmeier jakmeier deleted the rpc-rejects-tx branch June 4, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants