Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

XCM: General XCM Queue Pallet #6129

Closed
gavofyork opened this issue Oct 7, 2022 · 9 comments · Fixed by paritytech/substrate#12485 or #6271
Closed

XCM: General XCM Queue Pallet #6129

gavofyork opened this issue Oct 7, 2022 · 9 comments · Fixed by paritytech/substrate#12485 or #6271
Labels
J0-enhancement An additional feature request.

Comments

@gavofyork
Copy link
Member

gavofyork commented Oct 7, 2022

Right now we have several different XCM message queues spread across UMP, DMP, HRMP/XCMP and (a yet to be done) bridge/loopback. They all implement roughly the same thing - accept one or more messages, then execute them either immediately if the queue is empty, or store them for later execution. They then execute the contents of the queue at regular intervals, paying attention to the weight limits. When a message takes too much weight to execute autonomously, it is generally placed into its own storage item for later execution (or deletion after a timeout).

Instead of this, we should have a single XCM queue pallet. It should implement a new trait EnqueueMessage and dispatch messages regularly. DMP, UMP, HRMP/XCMP and bridge code can use this pallet through its trait to enqueue messages which should (eventually) be executed, either autonomously if it is light enough or manually if overweight.

The EnqueueMessage trait should look like:

pub enum Priority {
	Regular,
	High,
}
pub trait EnqueueMessage {
	/// Enqueue a single encoded `VersionedXcm` `message` from a specific `origin` at a
	/// specific `priority`.
	fn enqueue_message(message: &[u8], origin: MultiLocation, priority: Priority);

	/// Enqueue multiple encoded VersionedXcm `messages` from a specific `origin`.
	/// 
	/// Priority is always regular.
	fn enqueue_messages(messages: &[&[u8]], origin: MultiLocation);
}

In terms of the state data format, it could be implemented in a few ways:

  1. Single storage item for for the entire set of messages.
  2. One storage item for each queued message.
  3. One storage item per page of messages, with a maximum item size and message count per page. We assume 64KB pages supporting which will hold 256 messages of up to 256 bytes each.

These options can also be augmented with the possibility of using Preimages pallet to store any large messages in its own storage item. We'll ignore that for now since the benefit and overhead will be mostly the same between them.

Let's assume we tend to have two modes of operation: Mostly-Processing (MP) where we receive 100 messages per block to enqueue but have 10,000 messages from previous blocks to execute, of which we process 1,000; and Mostly-Receiving (MR) where we receive 1,000 messages per block to be enqueued to the existing queue which is 2,000 messages big and of which we process 500.

We'll also assume messages (including metadata) range between 64 bytes and 256 bytes, averaging 128 bytes and occasionally going as high as 16KB.

Our storage item trie proofs require the key and value plus an trie overhead of 16 items * 32 bytes per level; the number of levels will typically be 2 + log_16(size), so around (2 * 16 * 32) = 1KB for singular values, 2KB for maps with up to around 200-300 values, 3KB for maps with up to around 50,000 values.

For option 1/MP we'll need a proof of a regular value (1KB) plus all items in the queue (128 * 10000) = ~1,200 KB, and 1/MR would be 1 KB + 128 * 2,000 = ~240 KB. That's quite a lot of overhead.

For option 2/MP, it's 3KB * 1,100 = 3.3MB and 2/MR is 3KB * 1,500 = 4.5MB. This is impossibly high.

For option 3/MP, it's 1 page access for enqueuing and 4 pages access for processing. We'll be conservative and assume a total of 6 since we may be on a boundary. That's 6 * (2KB + 64KB) =~ 400 KB. For option 3/MR, we must deposit 2 pages worth and enqueue 4; again being conservative that's 7 pages and therefore 7 * (2KB + 64KB) =~ 460 KB.

Storing items in a linked-list heap may lead to substantial gains in performance since:

  • No Vec<u8>s are decoded or instantiated for each message.
  • The 64 KB (the page size) is packed efficiently, leading to a MaxEncodedLen which actually reflects the data stored (a BoundedVec<BoundedVec<>> will always be quite an inaccurate MEL unless the inner BoundedVec has very little variance).

Furthermore, depending on realistic message sizes and processing dynamics, option 3 can be optimised in terms of page size.

Suggested Storage Format

A reasonable storage format might be:

/// A message and the origin which sent it, as versioned XCM. Will not decode if the XCM
/// version becomes ancient, and authorize unpermissioned deletion in this case.
pub struct MessageItem {
	origin: VersionedMultiLocation,
	message: VersionedXcm,
}

/// Type for identifying a page.
type PageIndex = u32;
/// Type for representing the size of an offset into a page heap.
type Offset = u16;
/// Type for representing the size of an individual encoded `MessageItem`.
type Size = u16;
/// Maximum size of a page's heap.
const HEAP_SIZE: u32 = 1u32 << 16;
/// Data encoded and prefixed to the encoded `MessageItem`.
pub struct ItemHeader {
	next: Offset,
	len: Size,
}
/// A page of messages. Pages always contain at least one item.
pub struct Page {
	/// The next page with some items in it or `None` if this is the last page.
	next_page: Option<PageIndex>,
	/// The heap-offset of the `(ItemHeader, MessageItem)` of the first message item in
	/// this page.
	offset: Offset,
	/// The heap. If `self.offset == self.heap.len()` then the page is empty and should
	/// be deleted.
	heap: BoundedVec<u8, HEAP_SIZE>,
}
/// The index of the first (non-empty) regular-priority page.
value RegularHead: Option<PageIndex>;
/// The index of the first (non-empty) high-priority page.
value PriorityHead: Option<PageIndex>;
/// The lowest known unused page index.
value NextPage: PageIndex;
/// The map of page indices to pages.
map PageIndex => Page;

/// An overweight message.
pub struct Overweight {
	/// Deletion allowed if realisation fails (indicates that XCM version no longer supported or the preimage is not being stored).
	payload: Bounded<MessageItem>,
}
/// First item of key is block number when message arrived. This can be used to authorise
/// deletion of stale messages.
map (BlockNumber, u32) => Overweight;

Messages are placed in the overweight index if either the message is beyond max_message_size (something like 1024) or the weight required for execution is beyond overweight_threshold (something like 0.5% of block weight).

Inline Execution

If the queue is empty when non-overweight messages arrive, then it may be reasonable to have them be executed immediately to avoid state churn.

QoS

This basic design has only very limited QoS safeguards: the two levels of prioritisation allow for high-priority messages to take precedence over regular priority messages, assuming that they can be identified at the delivery stage. However, if all peer parachains' messages are dispatched with regular priority, then one parachain could effectively pay to delay the execution of any messages from other parachains by introducing a large amount of heavy messages into the queue immediately prior to the point they want to DoS other parachains.

This attack can be mitigated quite easily and cheaply (in terms of both performance and code complexity) by modestly expanding the number of queues to one per message origin. These queues would then need to be enumerated and processed in some fair fashion, round-robin (with a cycling starting origin) would be the simplest, but shuffling them with a secure random seed would bring extra security guarantees.

Note: This only works for the immediate delivery source. It will fail if a single delivery origin can represent and enqueue messages from multiple upstream origins which it represents, since it will be able to "sybil attack" the QoS system.

@gavofyork gavofyork added the J0-enhancement An additional feature request. label Oct 7, 2022
@rphmeier
Copy link
Contributor

rphmeier commented Oct 11, 2022

Extracting out a common utility makes sense, but I have a couple of comments on the approach:

DMP and HRMP (as implemented) and XCMP (as designed) both use a Message Queue Chain (MQC) system designed to separate messages out into their own hash-chain. This means that systems looking to import and dequeue messages only need a single state proof of a recent MQC entry. Any other 'synchronization' between the system's previous MQC and the new MQC can be handled without further relay-chain state proofs.

The queue should also have support for blob items. DMP, HRMP and XCMP are all blob transfer protocols at the moment with XCM being a higher-level concern.

@gavofyork
Copy link
Member Author

Yeah - I'm happy to move things over to blob and abstract away much of XCM.

MQC is cool, but AFAIK we process a minimum of one MQC page at a time. This means that we'll need to deal with a whole page worth of messages at once. It might easily be that we cannot execute a whole page worth of messages, so we'll need to be able to deal with storing pages and incrementally executing messages over several blocks, hence the need for this primitive.

@rphmeier
Copy link
Contributor

rphmeier commented Oct 12, 2022

At the moment there is no MQC paging mechanism in VMP or HRMP. The outstanding MQC contents in each case are stored in a Big Unbounded Vector and should be paged, but that's not the current state of master (hrmp, ump, dmp), and we've only had scattered discussions on how to improve the data structures. This issue or paritytech/polkadot-sdk#489 would be good places to continue those discussions - but this issue seems more directly relevant than paritytech/polkadot-sdk#489.

It is true that we cannot expect chains to execute all of or even large swathes of the outstanding MQC entries in a single block and therefore there's a need for state machines to separate synchronization of the MQC (backwards) from processing of the MQC (forwards). It should never be necessary for state machines to evaluate more than a small constant number of relay-chain state proofs when updating their knowledge about the MQC head and processing messages, since MQCs already provide cryptographic guarantees of queue contents between some previous known link and the head. My mental model here is that parachain runtimes should have a multi-block algorithm something like this:

  1. if the trunk descending from the most recently processed MQC entry is partially or completely known, process messages forward via an inherent
  2. if not, get a recent MQC head from the relay-chain state (1 state proof). This could be any MQC head easily indexable through relay-chain state. The most recent MQC head is the easiest but not the only option.
  3. walk the MQC backwards, potentially over the course of multiple blocks, until we have a partial or complete trunk descending from the most recently processed MQC entry.

We could introduce some more clever relay-chain indexing of recent MQC heads for optimization of (2) and (3), and I think there are a few viable solutions like Hierarchical MQCs, Merkle Mountain Ranges, or power-of-two indexes. Pagination makes more sense at the message storage level than the MQC level itself to me. But since MQC entries are quite small (68 bytes apiece) a parachain runtime can update its local knowledge of the MQC trunk by ~15,000 messages per MB of PoV space and then process the messages once updated - it's not clear to me that this needs optimization.

By processing, I don't mean execution but rather import/queuing/discarding. It seems to me that we have two related problems here: 1. queuing messages within a state machine and 2. retrieval and indexing of metadata about un-processed portions of message queues stored outside of a state machine. The original post of this issue is mostly focused on the former but HRMP/DMP (on both the relay-chain & parachain side) are also quite reliant on the former. I bring up (2) because it needs to be accounted for in any general pallet that's used on the relay-chain side of HRMP/DMP, but if this is just about the parachain side of things we can certainly separate those concerns.

Regardless of storage optimizations, in HRMP/XCMP we need a mechanism to be able to clear ancient message storage via a 'watermark' block number for backwards compatibility. There are possibly better approaches than the watermark but those would require a lot of parachains-protocol level changes that would be quite involved and are best bundled with a new major version of the candidate format.

@gavofyork
Copy link
Member Author

gavofyork commented Oct 12, 2022

Ok - I'm really not wanting to get deeply involved in the MQC side of things. I just want to figure out a reasonable API to connect the output of MQC reading/dequeuing to the input of the (potentially deferring, as required) message processing system.

My understanding from originally integrating this with HRMP is that each chain has at most one large data chunk (inter-chain message blob or ICMB) arriving into it in a block from any other chain. This one large data block is actually comprised of many smaller (higher-level) message blobs (likely in XCM format) concatenated together, but this is not a concern of the transport layer.

I don't know if it's reasonable to expect to be able to process (by which I essentially mean enqueue, though in some circumstances it may be reasonable to execute some or all of the constituent XCMs also) ICMBs in integer amounts per block: it would simplify things for sure, but if they're too big then they could cause too much PoV bloat to the point that blocks become exhausted just from message enqueuing.

It would be helpful for me to understand the exact logistics of ICMBs - how large they are, where they are being fed into a block and what expectations there are over how many are fed in at a time.

@gavofyork
Copy link
Member Author

Draft PR open and in progress: paritytech/substrate#12485

@rphmeier
Copy link
Contributor

rphmeier commented Oct 12, 2022

Ok - I'm really not wanting to get deeply involved in the MQC side of things. I just want to figure out a reasonable API to connect the output of MQC reading/dequeuing to the input of the (potentially deferring, as required) message processing system.

Yeah, that makes sense. I misinterpreted the scope of the issue at first and we can continue the MQC conversations elsewhere. But to be clear, my underlying point is that the 'generalized XCM queue' is more useful for the Cumulus side of XCMP/HRMP/DMP than the relay-chain runtime side.

My understanding from originally integrating this with HRMP is that each chain has at most one large data chunk (inter-chain message blob or ICMB) arriving into it in a block from any other chain. This one large data block is actually comprised of many smaller (higher-level) message blobs (likely in XCM format) concatenated together, but this is not a concern of the transport layer.

That is true. the maximum size of these blobs is determined by governance via the parachainsConfiguration pallet and currently sits at 100KB for HRMP and 50KB on Polkadot and Kusama. In HRMP, the watermark mechanism only allows them to be cleared based on the block number, but each channel is only allowed to send 100KB of messages per block. The amount of messages a parachain needs to process is based on the watermark advancement rule. At the moment there is a single watermark for all inbound channels and we should probably consider other mechanisms.

@gavofyork
Copy link
Member Author

But to be clear, my underlying point is that the 'generalized XCM queue' is more useful for the Cumulus side of XCMP/HRMP/DMP than the relay-chain runtime side.

Yes, though it has applications on the Relay-chain too since there is UMP which needs to be serviced safely.

@gavofyork
Copy link
Member Author

It would be helpful to confirm that the given API in paritytech/substrate#12485 (trait EnqueueMessage) is adequate for the ICMB handler in UMP/DMP/XCMP. Performance-wise, if we make the blob format and the page format identical, then we could perhaps get away with a single memcpy per page rather than two memcpys per message. But I'm not sure it's really that important, and not sure even if it is that it should be addressed right now. The main thing is that there can be no PoV exhaustion attacks and as long as only one 100KB page needs be processed at once, then I don't see how there could be.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-release-analysis-v0-9-36/1529/1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J0-enhancement An additional feature request.
Projects
Status: Done
3 participants