Add design for velero backup performance improvements #7628

sseago · 2024-04-05T17:55:54Z

Thank you for contributing to Velero!

Please add a summary of your change

This PR adds a design document to enhance single backup performance with multithreaded item backup. This design will also implement some of the prerequisites for future support of VolumeGroupSnapshots.

Does your change fix a particular issue?

Fixes #7474

Please indicate you've done the following:

[x ] Accepted the DCO. Commits without the DCO will delay acceptance.
[x ] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

sseago · 2024-04-05T17:56:10Z

/kind changelog-not-required

codecov · 2024-04-05T18:06:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.77%. Comparing base (f85f877) to head (0288ab7).
Report is 200 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7628      +/-   ##
==========================================
- Coverage   61.88%   58.77%   -3.11%     
==========================================
  Files         266      345      +79     
  Lines       29462    28766     -696     
==========================================
- Hits        18232    16908    -1324     
- Misses       9938    10428     +490     
- Partials     1292     1430     +138

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

design/backup-performance-improvements.md

draghuram · 2024-04-15T21:45:21Z

I finally got a chance to read the proposal and it looks pretty good. However, I have few comments.

What is the impact of "orderedResources" setting (in Backup/Schedule CR) on the concurrency? It seems to me that if users specify a particular order for resources - say PVCs - backup of the item blocks that contain the ordered resources must be serialized.

Also, I think overall backup performance is determined by how fast CSI snapshot operations and data mover operations are done, rather than by the concurrency of the resource manifest backup. In that sense, the concurrency proposed in this document will only benefit if there are multiple item blocks - each with its own set of PVCs. If that is not the case (say, you are backing up a single namespace with a pod that has 5 PVCs), users won't see much benefit even if the worker count is set to high.

My point in general is that this proposal should be studied from the point of view of its impact on the PVC backup. Apart from what I mentioned above, the concurrency can result in more CSI snapshot API calls, more data mover operations to take place. Perhaps, it is worth explicitly calling it out in the document.

sseago · 2024-04-15T22:05:53Z

I finally got a chance to read the proposal and it looks pretty good. However, I have few comments.

What is the impact of "orderedResources" setting (in Backup/Schedule CR) on the concurrency? It seems to me that if users specify a particular order for resources - say PVCs - backup of the item blocks that contain the ordered resources must be serialized.

This is actually a great point. I don't think that's accounted for right now. Current design takes the item collector list and generates item blocks from that. orderedResources are moved to the beginning of the list, but to take a specific example, say we have item1-item5 from ordered resources, none of which add additional items via plugins, if we have a worker pool of size 5, that means with the current design, all 5 items will be processed concurrently. I think your proposal would be that everything that comes via orderedResources should be forced to be executed sequentially. If we decide to take this approach, this means (at a minimum), everything that's in orderedResources must belong to the same ItemBlock -- that guarantees that these items will be processed in order, in the same thread.

But that may not be sufficient. I think we may also want to make sure that we don't process any other items until orderedResources are complete. Is this correct? If yes, then I think the following design modification would resolve it:

change the item collector return struct to include a boolean indicating whether each resource in question is part of orderedResources. With the current design, they're currently first, but we don't know when we're done processing them. This would resolve that.
instead of one loop passing over the full item list, we have 2 loops. First loop stops iterating when it hits an item not identified in 1) as an ordered resource. If there are no ordered resources, this means we don't process anything here. We do the usual itemBlock processing, except instead of one block per item, all items identified as ordered resources are appended together into a single ItemBlock. We now back up this one ItemBlock consisting of all OrderedResources synchronously. Once it's done, we then go into the item processing loop described in the current design (with ordered resources already having been removed/handled).

@draghuram does the above resolve this appropriately?
@shawn-hurley what do you think?

Also, I think overall backup performance is determined by how fast CSI snapshot operations and data mover operations are done, rather than by the concurrency of the resource manifest backup. In that sense, the concurrency proposed in this document will only benefit if there are multiple item blocks - each with its own set of PVCs. If that is not the case (say, you are backing up a single namespace with a pod that has 5 PVCs), users won't see much benefit even if the worker count is set to high.

For fs-backup scenarios, with multiple pods, this design will probably help us. It will also help in cases where there are a great many CSI snapshot operations. One case we ran into recently was one where there were a large number of VMs being backed up where the storage capacity for each was relatively small, but the synchronous component of CSI operations added up to many hours for the whole backup -- if we had a pool with 10 workers, that synchronous component would have been greatly reduced.

My point in general is that this proposal should be studied from the point of view of its impact on the PVC backup. Apart from what I mentioned above, the concurrency can result in more CSI snapshot API calls, more data mover operations to take place. Perhaps, it is worth explicitly calling it out in the document.

datamover operations have their own concurrency limitations -- number of node agent pods*per-node-agent concurrency. In backups where that's the limiting factor, this design won't do much. In designs where that is not the limiting factor, I'd expect some improvement here.

Also note that there are two out-of-scope (see Non Goals) enhancements that need parts of this design to build on top of that meet other future needs:

VolumeGroupSnapshot support will require the ItemBlock refactoring along with the BIAv3 support designed here to properly group groups of pods which mount volumes in the same VolumeGroup
Running multiple backups in parallel will make use of the same worker pool defined here to back up ItemBlocks, but with that enhancement, if we have a worker pool of 10, instead of all 10 workers backing up ItemBlocks from a single backup, they'll be backing up ItemBlocks from more than one backup.

sseago · 2024-04-16T02:23:30Z

@draghuram @shawn-hurley Actually, a slight modification to this suggestion above probably makes sense:

"instead of one loop passing over the full item list, we have 2 loops. First loop stops iterating when it hits an item not identified in 1) as an ordered resource. If there are no ordered resources, this means we don't process anything here. We do the usual itemBlock processing, except instead of one block per item, all items identified as ordered resources are appended together into a single ItemBlock. We now back up this one ItemBlock consisting of all OrderedResources synchronously. Once it's done, we then go into the item processing loop described in the current design (with ordered resources already having been removed/handled)."

Rather than doing this initial OrderedResources ItemBlock completely synchronously, we should probably send it to the worker channel just like we'll do later for the other ItemBlocks -- it's just we won't send any additional ItemBlocks until this one has completed, otherwise once we get to multiple backups at once, the worker count for "how many items to back up at the same time" will be off by one while this synchronous ItemBlock backup is happening. Instead, we should probably just send this first block to the channel, wait for it to complete, and then do the second pass through everything as proposed above.

ywk253100 · 2024-04-16T10:58:04Z

The k8s resources are already downloaded from the API server to the local disk by the item collection process before the concurrent processing proposed by this design, so seems the concurrency isn't very necessary.

And I agree with @draghuram that the overall backup performance is determined by how fast the data is handled, not the k8s resources.

draghuram · 2024-04-16T12:16:14Z

@sseago I was thinking that you would still create multiple item blocks spanning ordered resources, but you would process them in order, rather than sending them to the workers concurrently. Your proposal of using one item block would work too but the difference is that other item block level operations such as hooks may be affected by putting all the resources in one item block. By that I mean, we may be executing hooks for a group of pods that may not be strictly related. To be sure, I don't now if this is such a big concern. I haven't used "orderedResources" feature myself a great deal.

Also, I agree with you that the proposal helps with the performance by parallelizing operations such as CSI and other time taking operations. BTW, were you referring to KubeVirt plug-in when you said "VM operations"? And yes, item block functionality would come handy when it is time to implement volume group snapshots.

sseago · 2024-04-16T12:30:53Z

@draghuram Ahh, yes, you're right. If pods are included, we'd want to treat them as separate blocks (since hooks run before/after ItemBlocks) -- so basically, we'd still need to identify which items in the collector output come from ordered resources, so we first process the ordered resources, creating ItemBlocks normally, but passing one at a time to be processed, waiting for the response before going to the next, once we hit the first ItemBlock without ordered resources, we can start processing them in parallel.

@ywk253100 See the comment above, but we've seen scenarios where if there are a large number of volumes, the backup took many hours due to the synchronous first part of the CSI backup process -- that's one example where processing items in parallel should shorten overall backup time. Also note the other two use cases which this design enables:

Running multiple backups concurrently (since the worker pool created in this implementation will be shared across backups when the follow-on work is done, building on top of this
Supporting VolumeGroupSnapshots will require the Phase 1 work here (ItemBlock refactor), but not the concurrency.

Signed-off-by: Scott Seago <sseago@redhat.com>

design/backup-performance-improvements.md

Lyndon-Li · 2024-04-23T02:59:18Z

@sseago
Firstly, let's go back to the requirements and see what the current design helps:
1. Large number of Kubernetes resource items backup
The current design cannot solve the bottleneck. The bottleneck in this case lays on the resource item enumeration and cache (as the current behavior, write a single/small file for each item). However, the current design focus on the workflow after the items are cached

2. Volume group snapshot
The current design helps in this case, so specifically, the idea of ItemBlock helps.
Merely, we need to discuss the priority of this requirement, since the upstream feature is in Beta and even if it is GA, it takes time for the storage providers to support it.

3. Large number of volumes backup
If there are large number of volumes in one backup and we need to take snapshot for each of them, it will take a long time. So this design (work group thread pool + item buffer channel) helps in this case.

4. The CSI snapshot corner case --- the CSI driver retries for a long time before CSI handle appears
The current design helps in this case though we identify it as a corner case.

5. Parallel backups
This design may helps for parallel backups feature, however, we also believe that this is not the only way to solve it. So wether we use the ItemBlock in this design or we design it separately is still TBD. cc @reasonerjt

Secondly, let me clarify the major concern from us:
We think the introduction of BIAv3 and the new constraints it brings to the open architecture of BIA is a big problem:

It is not practical. As the current design, once there is one BIAv2/BIAv1, the BIAv3 will degraded to BIAv2. While, except the BIA plugins managed by Velero, the most BIA plugins are managed by vendors/community, so if we go in this way, we are afraid BIAv3 won't be fully functional for a very long time
The current GetAdditionalItems interface may not be the best idea. It requires to know the full content of each item, which makes it almost the same as the backup process, however, ideally, it is a discovery/collecting process which could be done during the time of backup's collection or pre-backup.

Therefore, we suggest below adjustment:

Narrow down the scope -- we only consider the group of volumes and parallelism of volume groups. We can still have the name of ItemBlock, but we only take pods and PVCs/PVs.
We don't introduce BIAv3, instead, we implement it in the current backup workflow of Velero server.
The idea of GetAdditionalItems is only applied to pods (or some well-known resources that link to volumes) --- by the time of backing up a pod, we interpret its content and get the volumes and then create the ItemBlock dynamically.

sseago · 2024-04-23T17:44:20Z

Secondly, let me clarify the major concern from us: We think the introduction of BIAv3 and the new constraints it brings to the open architecture of BIA is a big problem:

It is not practical. As the current design, once there is one BIAv2/BIAv1, the BIAv3 will degraded to BIAv2. While, except the BIA plugins managed by Velero, the most BIA plugins are managed by vendors/community, so if we go in this way, we are afraid BIAv3 won't be fully functional for a very long time

@Lyndon-Li I spent some time discussing this with @shawn-hurley today, and I think we may be able to drop the "everything must be v3" requirement, as long as we document clearly for plugin authors that for any pod-related dependencies that would break if the deps were done in parallel, they need to provide the v3 plugin. Effectively, the only use case that we know would break here would be a scenario with multiple pods that need to be backed up together. But that's a use case that's already somewhat broken, since inter-pod dependencies and hooks aren't handled at all right now.

The current GetAdditionalItems interface may not be the best idea. It requires to know the full content of each item, which makes it almost the same as the backup process, however, ideally, it is a discovery/collecting process which could be done during the time of backup's collection or pre-backup.

On your second concern, we're adding the new call after velero already has the full content in the current code, which we need anyway to back up the item. The only extra memory usage, then, is related to the buffer size of the channel -- if we have a buffer size of 10, then 10 ItemBlocks have content in-memory, which is a relatively small amount of memory, all things considered, especially since Kubernetes metadata is limited in size for each resource. You cannot have arbitrarily large k8s resources.

Also, since we already have the logic in the existing BIAs to identify items needed (see the pod_action upgrade example in the design), we're just moving logic around to reuse it. If we were to move this into discovery, we'd need to replicate that logic elsewhere and modify the item collector output to be block-aware. It also would not be extensible at all, but we really do need to do this in a plugin-aware way (more on that below).

The main objection we have to hard-coding the logic around pod->PVC dependencies and the multi-pod dependencies for VolumeGroup support is that this prevents people from extending this concept via plugins. There are other use cases that require pods to be backed up together that internal Velero logic won't know about. For example, if I'm maintaining an operator that I want my users to be able to back up with Velero, with the plugin approach, I can write a BackupItemAction which operates on pods in my operator that return the other pods in the GetAdditionalItems method, and this will ensure that 1) pod hooks are run for my operator pods before and after the group of pods is backed up, and 2) these pods and their associated PVCs will be backed up in a single worker, since they're in one ItemBlock.

sseago · 2024-04-23T17:50:04Z

To summarize my last comment, regarding the two major points of disagreement:

BIA fallback to v2 if not all plugins are v3: at this point, we're willing to make that change as long as we document clearly the sorts of cases that plugin authors need to keep in mind.
Need for GetAdditionalItems BIAv3 plugin API method: Here, I think we need to stick with it. Your objection to implementing it is mostly that Velero would use more memory than you'd like. My objection to skipping it is that if we hard-code the ItemBlock grouping dependencies, we prevent plugin authors from being able identify resources that need to be backed up together in a single thread (and have coordinated hooks), and if we create a completely new ItemCollectorPlugin to manage the grouping elsewhere, we add significant complexity vs. just adding one new method to an existing infrastructure, as well as complicating the process for plugin authors, now required to maintain a separate set of plugins.

Or an even shorter summary: on point 1 above, we agree with you now and will update the design, but on point 2, we're going to keep the BIAv3 as-is, since it enables some use cases that we need which aren't handled without it.

sseago · 2024-04-23T21:46:34Z

A little more detail on the concerns around calling the new plugin methods after item collection is done. First, as for the objection that we need the full object in memory -- we already need it in memory to process it a bit later when it gets to the worker thread, so worst case scenario is we're holding on to a few of these longer than we would be without ItemBlocks. How much memory are we talking about? Kubernetes resource metadata is limited in size -- somewhere around 1mb max. So we're talking about resourceSizebufferSizeitemsPerBlock -- if we assume a buffer size of 10 and 5 items per block (I expect most items will be in blocks of one, and a few items will be in blocks larger than 5) -- this means we're holding an extra 50mb of item data in memory, at most.

As for the suggestion of doing this during item collection, there are a couple of problems with that. Currently the item collector grabs one GroupResource at a time via a List call, then dumps each item to disk, then moves on to the next. While on the surface it might make sense to process item dependencies here, there are some problems:

Things are not ordered yet -- both the OrderedResources field and the general ordering of pods, then PVCs, then PVs needs to be applied before grouping the items, since the Pods pull in their PVCs, and PVCs pull in their PVs -- out of order, and this may not happen properly.
Once we get the related items for the current item, we need to get the related item for those items, but in the item collector we're just making the first pass to grab things from the cluster and store them in the indexed temp files, so the associated items may not have even been loaded yet -- there's a good chance we don't have the data we need to process the full item block.

By processing the ItemBlock at the end of item collection, the items we need will already be loaded from the cluster and referenced in the items list, so when the current item returns "item 2" as a needed item, we can just grab that item from the list, mark it as "in a block already" (so later it won't be added to another one), and then call this item's additional items method, etc. This can all be done with minimal disruption to the rest of the workflow, and we're limiting the number of items in memory to the size of the input channel's buffer. This also preserves the ability for plugin authors to create their own ItemBlocks as necessary for their backup needs.

design/backup-performance-improvements.md

…rnatives Signed-off-by: Scott Seago <sseago@redhat.com>

sseago · 2024-04-24T21:36:55Z

I've updated the design based on the conversations earlier this week:

I've removed the requirement that all registered BIAs implement v3 in order to make use of the new functionality.
I've added a section to "Alternatives considered" to discuss the alternate approach of adding a new ItemBlockAction plugin instead of extending BIA to v3. It changes a few things in the overall design, but most things are unaffected.

sseago · 2024-04-29T22:50:36Z

Concerns about the design from review and meeting comments

At this point, there seem to be two primary objections to the design in the review (and community meeting) comments:

Creating a new plugin API for identifying ItemBlock contents is too complicated when this can be done for internal velero use cases in a simpler manner by hard-coding the logic directly into the velero backup workflow.
External use cases which would require a plugin approach don't exist today -- they're hypothetical future use cases that we don't understand well enough to design it into a plugin now. We can add the new plugin API in the future once we have a real use case.

Responses to concern 1

There is already too much special-case logic in the backup workflow. Adding more conditional statements if groupResource == kuberesource.Pods into the code to handle all of the known cases for Pod->PVC, PVC->PV and (soon) Pod->Pod (where we need to find all the pods which mount volumes in the same VolumeGroup) will add significant complexity to the workflow logic, especially if we attempt to add in new use cases as they come up. Using a plugin will encapsulate this complexity elsewhere, allowing the backup workflow to treat all resources in exactly the same way when generating ItemBlocks. This will make the flow much more readable.
In addition, if we do this via plugin, then no additional work is needed to handle the external use cases, as we will have already provided a plugin interface to do it.
If we take the internal/hard-coded approach, when we do eventually add the plugin API changes in a future release, we will either have something even more complicated than we started with, having to mix the internal inline logic with the newly-added plugin logic, or we'll have to re-implement the internal logic using the plugin APIs, meaning we'll have to implement this logic twice. If we just did the plugin work initially, we'll be done.

Responses to concern 2

The plugin API proposed is generic enough to handle any external use case where the items associated with the current item can be determined from the content of the current item. The API method takes in the current item content (and backup CR), and returns a list of items. This should be sufficient to handle even those use cases which we don't know about yet. That's why the API proposed is generic and not tied to a specific use case, since the whole point of plugin interfaces is to allow use cases that are not built into the application directly.
All of the potential uses we've been able to come up with so far are easily handled with the proposed API.
Even the not-yet-existing implementation of VolumeGroupSnapshot support would be handled by the proposed plugin API, which is an important data point suggesting that the API as proposed is a useful one.
See below for an actual real-world use case for an application that people are already deploying that would need additional plugin APIs to back up properly with hooks (Cassandra).
We need the plugin to handle use cases we don't have yet, because that allows users who run into these things to have a way to back up their workloads now, rather than having to create in an issue and wait a few months for Velero to support this in the next release.

Specific use cases that would require the proposed plugin

Cassandra (cass-operator)

For users of Cassandra, cass-operator creates a StatefulSet with three pods, each of which has its own mounted volume. With a plugin API to associate related entities, we can make sure that all three pods are included in the same ItemBlock. The internal-to-velero pod plugin will make sure that the volumes for each pod are included. This will make sure that pre-backup hooks on all three Cassandra pods are run first, followed by CSI snapshot creation, followed by the post-backup hooks.

Alternate design considerations

New plugin type ItemBlockAction

On the last community meeting, Daniel suggested we might consider creating a new ItemBlockAction plugin type instead of extending BackupItemAction. While this will require more upfront work than extending BIA, once the plugin API is added, there's very little in the overall design here that's changed. I expect that it will work just as well (no better, no worse) as the proposed BIAv3 design. The work required for plugin authors is also approximately the same. Because of this, while I personally have a mild preference for the BIAv3 design, I don't have any strong objections to the ItemBlockAction approach. This has been added to the "Alternatives considered" section of the design.

Lyndon-Li · 2024-05-07T02:27:59Z

@sseago
@blackpiglet and me discussed further this design and we have some common feelings/questions as below:

We at least have two item grouping requirements --- a. user provided grouping which shows no rationality for why grouping the items together; b. automatically grouping items logically associated, e.g., pod-pvc pairs.
For user provided grouping, we cannot solve all the scenarios with the BIA and GetAdditionalItem method. E.g., as a valid performance improvement case, I want to group every 100 items which these items have no relationship with each other. How do we handle this case?
For logically associated item grouping, the backup item collection is with alphabet order, we cannot assure the higher level item is visited before the lower level item. E.g., for pod-pvc pair, it is possible that PVC is visited first, then we may create an item block for the PVC; then we visit the pod, we may create another item block for the pod and then get the PVC by GetAdditionalItem method. Finally, we have two item blocks which should be merged together. How do we handle this case?
Why do we need to have all the items in the item blocks initially? We think we only need to group the "top level" items, then lower level items go into the same item block naturally since they will be returned by BIA' Execute. If so GetAdditionalItem method is not required. If the only reason to do this is to avoid conflict of two item blocks (items are in more than one item blocks), we don't think it is a good bargain as it changes things heavily and much more complex

Finally, we want to share our suggestions as below:

We make the collection/creating of item blocks into two steps.
For the first step, which happens in the existing item collection code, we don't use BIA. We create the item blocks according to user input (for user provided grouping) for the top level items. Top level items mean the well known items with no dependencies. If we don't know whether a item is a top level item, we don't support group of it. After this step, we've created all the item blocks
If point 4 above is necessary, we do the second step -- we call GetAdditionalItem method for existing items in each block, but during this step, no new item block is created. This can be done in parallel since we have fixed number of item blocks

sseago · 2024-05-13T18:41:56Z

@sseago @blackpiglet and me discussed further this design and we have some common feelings/questions as below:

We at least have two item grouping requirements --- a. user provided grouping which shows no rationality for why grouping the items together; b. automatically grouping items logically associated, e.g., pod-pvc pairs.

I think this distinction is incorrect. All of the groupings are for logically associated items. The distinction is as follows:

Related item associations known to Velero developers and applies to core kubernetes resources (pod-pvc-pv, etc.)
Related item associations not known to velero developers or only apply to external types or applications (see the cass-operator example given)

In particular, the external groupings are related to item dependencies/associations, just like the internal ones. Also, these aren't provided by end users but by application or infrastructure developers. End users are not expected to know or understand these dependencies at the individual resource level.

For user provided grouping, we cannot solve all the scenarios with the BIA and GetAdditionalItem method. E.g., as a valid performance improvement case, I want to group every 100 items which these items have no relationship with each other. How do we handle this case?

Lets make a distinction here: Why do we need ItemBlocks? To allow us to make performance improvements such as backing up items in parallel. What belongs to a particular ItemBlock? That is determined by resource interdependencies and the need for data integrity of the backup. In other words, what goes into a particular ItemBlock relates to data integrity concerns, so "I want to group every 100 items" is irrelevant here. If we determined that this sort of grouping actually made things faster (and I don't think it will), then that would be resolved by backing up 100 ItemBlocks together, not by shoving 100 unrelated items into an ItemBlock.
These aren't "user-provided" but "application/infrastructure developer provided", so requiring end users to define them per-backup does not meet the requirements
The ItemBlockAction(IBA)/GetRelatedItems enhancement proposed does solve all the scenarios. It's flexible enough to allow us to handle any interdependencies that can be determined from the item itself.

For logically associated item grouping, the backup item collection is with alphabet order, we cannot assure the higher level item is visited before the lower level item. E.g., for pod-pvc pair, it is possible that PVC is visited first, then we may create an item block for the PVC; then we visit the pod, we may create another item block for the pod and then get the PVC by GetAdditionalItem method. Finally, we have two item blocks which should be merged together. How do we handle this case?

For pod/pvc/pv cases in particular, this is handled by the ItemCollector by ordering Pods, PVCs, and PVs (in that order) before the alphabetical grouping of everything else.
For any solution requiring plugins that is also pod-centric (for example the cass-operator case described already), the existing order, starting with pods, handles it alreaedy. Basically, for any cass-operator pod, the registered cass-operator plugin will return the other pods for any pod belonging to a cass-operator ReplicaSet, so whichever order the pods are processed in, the first one to be processed will group all of them into the same ItemBlock.
For any external plugin situation not involving pods, the plugin author should express the dependency in both directions -- if the block includes Item1 and Item2, then Item1's GetRelatedItems returns Item2, and Item2's GetRelatedItems returns Item1. Whichever is processed first will add both to the ItemBlock, so we are guaranteeed that both items will be in the same block.

Why do we need to have all the items in the item blocks initially? We think we only need to group the "top level" items, then lower level items go into the same item block naturally since they will be returned by BIA' Execute. If so GetAdditionalItem method is not required. If the only reason to do this is to avoid conflict of two item blocks (items are in more than one item blocks), we don't think it is a good bargain as it changes things heavily and much more complex

We need them initially because otherwise two ItemBlocks could run concurrently and end up backing up two items that should be in the same ItemBlock separately instead. For example:
- Lets say that we're processing Pods from the item collector, so we put each pod in an ItemBlock. If we're not calling GetRelatedItems, then we don't yet know that they may have PVCs that belong in the same block.
- If we only have one pod, that one goes into block1, add it to the worker pool input channel and move to next item in collector output.
- Next up is a PVC, so we group it into its own pod, add it to the worker channel.
- Two workers grab these two items and start processing them. Pod and PVC (which should have been in the same block) start processing at the same time.
- block1 runs pod hooks, then calls BIAs on the pod, the existing BIAv1 Execute returns the PVC as an additional item
- parallel to the above, block2 starts backing up the PVC, starting snapshot creation -- this may start before the pre hooks complete
- block1 processing takes the PVC additional item return, calls backupItem on it, which returns immediately, since this PVC is alreay being backed up it is listed in the BackedUpItems map.
- post hooks run on the pod (snapshot creation isn't done yet, since that's being done in the wrong worker thread)
- block1 is done
- block2 completes
Basically, if we don't have all the items in the block, we gain no value from the ItemBlock concept. The ability to have a complete block (except for items that don't actually exist yet, such as VolumeSnapshots) was the motivation to create this in the first place -- it's how we prevent out-of-order processing like in the above example.
"we don't think it's a good bargain as it changes things heavily an much more complex" --
handling dependencies is inherently more complex than not handling them. If we don't introduce new complexities, then we will be unable to resolve the provided use cases. Is there any disagreement that we must solve the use cases in front of us, even if it introduces some new complexities?

Finally, we want to share our suggestions as below:

We make the collection/creating of item blocks into two steps.

For the first step, which happens in the existing item collection code, we don't use BIA. We create the item blocks according to user input (for user provided grouping) for the top level items. Top level items mean the well known items with no dependencies. If we don't know whether a item is a top level item, we don't support group of it. After this step, we've created all the item blocks

This can't come from user input -- it comes from application/operator developers, which means even if we did it during item collection, it would need to be a plugin of some sort. That plugin would need to take in a resource (with content) and return a list of items that need to be associated with it. So doing this in the ItemCollector won't be any simpler than defininig a new ItemBlockAction plugin which takes in the exact same thing and returns the same list of resources. Doing it in the item collector would be more complicated, though, since then we'll need to change the whole output structure to return groups of items rather than just a list, and on top of that since the associated items may not have been listed from the cluster yet, we would need to pull them out individually with a Get call, and then we'd need to check for every future item as we collect it to make sure it's not already in a separate group.

Essentially we'd be taking all of the item block processing logic that the design proposes post-item-collection and do it during item collection, but since we don't have the full item list at this point, it's much less efficient. Is there any disagreement here?

If point 4 above is necessary, we do the second step -- we call GetAdditionalItem method for existing items in each block, but during this step, no new item block is created. This can be done in parallel since we have fixed number of item blocks

We don't have a fixed number of ItemBlocks because as we process the list we get from the collector, we end up adding items from further down into the current block. By the time we have teh full list of item blocks, we are almost done with processing. More to the point, items returned by the IBA plugin API call don't create new ItemBlocks, they add later items to the current block.

shawn-hurley · 2024-05-14T21:11:35Z

Hey folks,

I want to add a small summary to try and see if we can get agreement on some points. The goal here is to lay out some high-level concepts to make sure that we agree with what is proposed and that it solves the problem.

Points

Item blocks solve two problems
1. They allow for the grouping of resources that must be run in a way where all pod hooks are run, then backed up, and then post hooks are run. This allows us to get closer to crash consistency. While Volume Group Snapshot (VGS) is the only way to do this with 100% certainty, it might not be available for every CSI driver. This helps us fix long-standing hidden issues in the current backup flow.
2. This allows for the safe processing of parts of a backup in parallel. @sseago has laid out why we need to be concerned about this.
Creating Item Blocks should happen after item collection with a plugin.
1. The first reason is exposing this for other cluster components and/or operators to move their applications to more crash consistency.
2. A generic way to determine this grouping will make the code simpler as there will be one time and place it happens, instead of hard-coded logic potentially all over the place.
3. Doing this in item collection will make it harder to implement both the plugin and the logic to create blocks. It will also make the whole process slower because once a block is created, we will not be able to send it to start the backup when we start processing in parallel.
Processing backup items in parallel will make the system faster.
1. Processing 1 item in a thread with a buffer will make the system faster. The only reason to group items for processing in a thread is safety, as outlined above. Grouping items for any other reason will cause a slowdown.
2. having threads handle item blocks instead of individual items will speed up the system considerably.
3. While there is a potential for some extra memory usage, we are talking about a single struct with HOPEFULLY read-only pointers to things already in memory. Because of this, I suspect that memory usage should be negligible.

I hope that we can gain agreement on these three points, if we can see what we agree on and don't agree on to make the disagreements clear from here. I am worried that I am not clear which points we have discussion on and what we agree on so a level set at this level will be helpful.

mateusoliveira43 · 2024-05-20T20:54:00Z

design/backup-performance-improvements.md

+#### Proto definition (compiled into golang by protoc)
+
+The ItemBlockAction plugin type is defined as follows:
+```


all code blocks are Go code examples, right?

would add language to these code blocks for syntax highlighting, to improve reading

``` ```go code ---> code ``` ```

@mateusoliveira43 most of them, yes. I'll add it where appropriate. The one exception is the protobuf API spec section.

design/backup-performance-improvements.md

reasonerjt · 2024-05-21T08:44:10Z

design/backup-performance-improvements.md

+- If `item` is already in `itemsInBlock`, continue. This one has already been processed.
+- Add `item` to `itemsInBlock`.
+- Load item from ItemCollector file. Close/remove file after loading (on error return or not, possibly with similar anonymous func to current impl)
+- Get matching IBA plugins for item, call `GetRelatedItems` for each. For each item returned, get full item content from ItemCollector (if present in item list, pulling from file, removing file when done) or from cluster (if not present in item list), add item to the current block, add item to `itemsInBlock` map, and then recursively apply current step to each (i.e. call BIA method, add to block, etc.)


(i.e. call BIA method, add to block, etc.)

Do you mean call "BIA" method or "IBA" method?

@reasonerjt good catch. This was originally BIA, but we recently changed the design to use the new plugin type and I forgot to change this one. I'll fix it.

design/backup-performance-improvements.md

reasonerjt · 2024-05-21T09:03:15Z

design/backup-performance-improvements.md

+
+After backing up the items in the block, we now execute post hooks using the same filtered item list we used for pre hooks, again taking the logic from `itemBackupper.backupItemInternal`).
+
+#### `itemBackupper.backupItemInternal` cleanup


will backupItem be removed?

@reasonerjt No. BackupItemBlock calls BackupItem in turn on each item in the ItemBlock, so that will remain. It will just be called from a different place than it is now -- and it will also be called within BackupItemInternal on the returned additional items.

Sorry this is related to another comment of mine. I was thinking backupItemBlock will be part of kubernetesBackupper and it will replace kubernetesBackupper.backupItem.

reasonerjt · 2024-05-21T09:09:39Z

design/backup-performance-improvements.md

+```
+type ItemBlock struct {
+    log           logrus.FieldLogger
+    itemBackupper *itemBackupper


Do we need different itemBackupper instances in different ItemBlocks?
There are trackers in the itemBackupper struct which collects information through out the backup procedure, if we allow different itemBackuppers we'll also need to consolidate the info collected by each tracker.

I also have doubt about having logger within ItemBlock, but this is trivial...

The ItemBackupper is connected to the backup, so we won't want a different instance for each block. The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access. This is the same reason we need the logger -- so that when the worker thread calls BackupItemBlock it can pass in the correct logger for the currentbackup so that the logs get uploaded to the BSL, etc. Basically everything that's currently passed in to BackupItem needs to be made available in the ItemBlock struct.

The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access.

Thanks for the reply, IMO we need to ensure the constraint that different itemBlocks belong to one backup share one itemBackupper instance. For example, add a reference to BackupRequest to the struct of itemBlock and add a reference to the itemBackupper to the Or at least add a comment to highlight that in the definition of the ItemBlock struct.

I think adding the comment is probably sufficient here -- there's only one place in the code where ItemBlocks are created, and that will always pass a reference to the existing itemBackupper for the backup.

reasonerjt · 2024-05-21T09:09:47Z

design/backup-performance-improvements.md

+
+Method signature for new func `backupItemBlock` is as follows:
+```
+func backupItemBlock(block ItemBlock) []schema.GroupResource


is it func(kb *kubernetesBackupper) backupItemBlock(....)?

I'm not quite sure. It's not an interface method for the Backupper interface, and we're not calling it from one of the specific KubernetesBackupper interface method implementations -- since it's being called from the worker threads, it's probably better as a top-level func.

IMO this should be a part of kubernetesBackupper, but we can double check during the implementation.

Hmm. So if we added this to kubernetesBackupper, then we'd need to pass a reference to that, in addition to the ItemBlock, into the worker thread -- it might complicate things too much. But I think this is a detail we can work through during the implementation -- if it makes more sense to do it that way, we can consider it.

design/backup-performance-improvements.md

reasonerjt · 2024-05-21T09:37:05Z

design/backup-performance-improvements.md

+
+After implementing backup hooks in `backupItemBlock`, hook processing should be removed from `itemBackupper.backupItemInternal`.
+
+### Phase 2: Process ItemBlocks for a single backup in multiple threads


I think more details should be added to this section, like how to process the returned value from returnChan

Lyndon-Li · 2024-05-21T11:32:05Z

@sseago @shawn-hurley
One more question about a concrete case:

We have Pod1->PVC1->PV1 and Pod2->PVC2->PV1, that is PV1 is a ReadWriteMany/ReadOnlyMany PV mounted to both Pod1 and Pod2.
What happens with this case? Ideally, Pod1, Pod2, PVC1, PVC2 and PV1 should be in the same itemBlock since we need the hooks for both Pod1 and Pod2 to be executed before backing up the PV. Can we achieve this and how?

sseago · 2024-05-22T15:45:51Z

@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up).

I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions.

So we'd have the following IBA plugins:

pod plugin: returns a list of mounted PVCs
pvc plugin: returns a list of bound PVs and (new addition to handle RWX/ROX) a list of pods that this PVC belongs to.

Lyndon-Li · 2024-05-23T03:07:26Z

@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up).

I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions.

So we'd have the following IBA plugins:

pod plugin: returns a list of mounted PVCs

pvc plugin: returns a list of bound PVs and (new addition to handle RWX/ROX) a list of pods that this PVC belongs to.

@sseago
Let's say the IBA is executed in order of Pod1, Pod2, PVC1

After IBA of Pod1, an itemBlock (ItemBlock1) is created with Pod1, PVC1, PV1
After IBA of Pod2, some more additional items are returned PVC1, PV1. What are we going to do by this time? Will we create another itemBlock, or we merge Pod2 to the existing ItemBlock1? We may not merge them by this time right? Otherwise, we will not need IBA of PVC to return its associated pods?
After IBA for PVC1, from what you mentioned, the additional items are Pod1, Pod2, PV1, looks like we need to consolidate all the items together right? What algorithm do we use to do this?

So could you also add some details for how to create, maintain an itemBlock in various cases into the document?

sseago · 2024-05-28T13:24:27Z

@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up).
I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions.
So we'd have the following IBA plugins:

pod plugin: returns a list of mounted PVCs

pvc plugin: returns a list of bound PVs and (new addition to handle RWX/ROX) a list of pods that this PVC belongs to.

@sseago Let's say the IBA is executed in order of Pod1, Pod2, PVC1

After IBA of Pod1, an itemBlock (ItemBlock1) is created with Pod1, PVC1, PV1
Note that as described above, PVC1's IBA returns Pod1 and Pod2, meaning that ItemBlock1 will be created with Pod1, PVC1, PV1, Pod2

After IBA of Pod2, some more additional items are returned PVC1, PV1. What are we going to do by this time? Will we create another itemBlock, or we merge Pod2 to the existing ItemBlock1? We may not merge them by this time right? Otherwise, we will not need IBA of PVC to return its associated pods?
No merging needed, since Pod2 was already added to ItemBlock1 in the first place, so when the ItemBlock creation loop gets to Pod2, it's skipped since Pod2 already belongs to an ItemBlock

After IBA for PVC1, from what you mentioned, the additional items are Pod1, Pod2, PV1, looks like we need to consolidate all the items together right? What algorithm do we use to do this?
The alrgorithm as described in the design does this. We have the 2 IBAs as defined above:

pod plugin: returns a list of mounted PVCs

pvc plugin: returns a list of bound PVs and a list of pods that this PVC belongs to.
Lets follow the logic defined in the design here:

We have a list of items from the item collector: Pod1, Pod2, PVC1, PV1
First loop iteration
- Add Pod1 to ItemBlock1.
- Call registered IBAs for Pod1: PVC1 returned.
- Add PVC1 to ItemBlock1
- Call registered IBAs for PVC1: PV1, Pod1, Pod2 returned.
- Add PV1 to ItemBlock1 (no IBAs to run for PV1)
- Pod1 already in an ItemBlock, nothing to do for it.
- Add Pod2 to ItemBlock1
- Call registered IBAs for Pod2: PVC1 returned
- PVC1 already in an ItemBlock, nothing to do for it
Next loop iteration: Pod2 is already in an ItemBlock, continue
Next loop iteration: PVC1 is already in an ItemBlock, continue
Next loop iteration: PV1 is already in an ItemBlock, continue

So could you also add some details for how to create, maintain an itemBlock in various cases into the document?

It sounds like what's missing is the specific example of an RWX volume shared by multiple pods -- I can add this to the ItemBlock examples in the document -- but to be clear, the only thing necessary to completely handle this use case is an IBA for PVCs which return the list of mounting pods. I'll make sure the document makes that clear as well.

Signed-off-by: Scott Seago <sseago@redhat.com>

design/backup-performance-improvements.md

Signed-off-by: Scott Seago <sseago@redhat.com>

shubham-pampattiwar · 2024-07-12T05:08:50Z

@anshulahuja98 Can you please take another look and re-ack the design if all your queries/concerns were addressed by @sseago ?

anshulahuja98 · 2024-07-12T05:27:01Z

@anshulahuja98 Can you please take another look and re-ack the design if all your queries/concerns were addressed by @sseago ?

Approved.

github-actions bot requested review from anshulahuja98 and reasonerjt April 5, 2024 17:56

github-actions bot added the Area/Design Design Documents label Apr 5, 2024

github-actions bot assigned sseago Apr 5, 2024

github-actions bot added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Apr 5, 2024

sseago closed this Apr 5, 2024

sseago reopened this Apr 5, 2024

github-actions bot requested review from shubham-pampattiwar and ywk253100 April 5, 2024 17:56

sseago force-pushed the backup-perf-design branch 2 times, most recently from e3ccdb2 to b6a814f Compare April 5, 2024 18:03

ywk253100 requested a review from Lyndon-Li April 8, 2024 06:53

anshulahuja98 requested changes Apr 8, 2024

View reviewed changes

design/backup-performance-improvements.md Outdated Show resolved Hide resolved

design/backup-performance-improvements.md Show resolved Hide resolved

design/backup-performance-improvements.md Show resolved Hide resolved

design/backup-performance-improvements.md Show resolved Hide resolved

sseago force-pushed the backup-perf-design branch from b6a814f to d4edf1a Compare April 8, 2024 19:23

Lyndon-Li reviewed Apr 9, 2024

View reviewed changes

sseago force-pushed the backup-perf-design branch from d4edf1a to fb600f4 Compare April 12, 2024 19:49

Add design for velero backup performance improvements

9219e58

Signed-off-by: Scott Seago <sseago@redhat.com>

sseago force-pushed the backup-perf-design branch from fb600f4 to 9219e58 Compare April 18, 2024 20:03

Lyndon-Li reviewed Apr 22, 2024

View reviewed changes

design/backup-performance-improvements.md Show resolved Hide resolved

design/backup-performance-improvements.md Outdated Show resolved Hide resolved

shubham-pampattiwar reviewed Apr 23, 2024

View reviewed changes

design/backup-performance-improvements.md Outdated Show resolved Hide resolved

updated design to remove biav3 requirement for everything, added alte…

7873ced

…rnatives Signed-off-by: Scott Seago <sseago@redhat.com>

sseago force-pushed the backup-perf-design branch from 2ce6bc6 to 7873ced Compare April 24, 2024 21:32

sseago force-pushed the backup-perf-design branch 2 times, most recently from 04e7449 to 1893a1d Compare May 8, 2024 21:28

mateusoliveira43 reviewed May 20, 2024

View reviewed changes

reasonerjt reviewed May 21, 2024

View reviewed changes

sseago force-pushed the backup-perf-design branch from 1893a1d to 3eb6804 Compare May 22, 2024 16:58

replaced BIAv3 with new ItemBlockAction plugin type

3c2d77f

Signed-off-by: Scott Seago <sseago@redhat.com>

sseago force-pushed the backup-perf-design branch from 3eb6804 to 3c2d77f Compare June 25, 2024 18:51

anshulahuja98 requested changes Jun 26, 2024

View reviewed changes

design/backup-performance-improvements.md Show resolved Hide resolved

design/backup-performance-improvements.md Show resolved Hide resolved

anshulahuja98 requested changes Jun 26, 2024

View reviewed changes

design/backup-performance-improvements.md Show resolved Hide resolved

add Restore improvements to non-goals

0288ab7

Signed-off-by: Scott Seago <sseago@redhat.com>

Lyndon-Li approved these changes Jul 11, 2024

View reviewed changes

shubham-pampattiwar approved these changes Jul 11, 2024

View reviewed changes

anshulahuja98 approved these changes Jul 12, 2024

View reviewed changes

shubham-pampattiwar merged commit 3e9f6cc into vmware-tanzu:main Jul 18, 2024
7 of 8 checks passed

Lyndon-Li mentioned this pull request Aug 5, 2024

Multi-goroutine backup/restore #8079

Closed


		After backing up the items in the block, we now execute post hooks using the same filtered item list we used for pre hooks, again taking the logic from `itemBackupper.backupItemInternal`).

		#### `itemBackupper.backupItemInternal` cleanup


		After implementing backup hooks in `backupItemBlock`, hook processing should be removed from `itemBackupper.backupItemInternal`.

		### Phase 2: Process ItemBlocks for a single backup in multiple threads

Add design for velero backup performance improvements #7628

Add design for velero backup performance improvements #7628

Conversation

sseago commented Apr 5, 2024

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

sseago commented Apr 5, 2024

codecov bot commented Apr 5, 2024 • edited Loading

Codecov Report

draghuram commented Apr 15, 2024

sseago commented Apr 15, 2024

sseago commented Apr 16, 2024

ywk253100 commented Apr 16, 2024

draghuram commented Apr 16, 2024

sseago commented Apr 16, 2024

Lyndon-Li commented Apr 23, 2024

sseago commented Apr 23, 2024

sseago commented Apr 23, 2024

sseago commented Apr 23, 2024

sseago commented Apr 24, 2024

sseago commented Apr 29, 2024

Concerns about the design from review and meeting comments

Responses to concern 1

Responses to concern 2

Specific use cases that would require the proposed plugin

Cassandra (cass-operator)

Alternate design considerations

New plugin type ItemBlockAction

Lyndon-Li commented May 7, 2024 • edited Loading

sseago commented May 13, 2024

shawn-hurley commented May 14, 2024

Points

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lyndon-Li commented May 21, 2024 • edited Loading

sseago commented May 22, 2024

Lyndon-Li commented May 23, 2024 • edited Loading

sseago commented May 28, 2024

shubham-pampattiwar commented Jul 12, 2024

anshulahuja98 commented Jul 12, 2024

codecov bot commented Apr 5, 2024 •

edited

Loading

Lyndon-Li commented May 7, 2024 •

edited

Loading

Lyndon-Li commented May 21, 2024 •

edited

Loading

Lyndon-Li commented May 23, 2024 •

edited

Loading