-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design for velero backup performance improvements #7628
Add design for velero backup performance improvements #7628
Conversation
/kind changelog-not-required |
e3ccdb2
to
b6a814f
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #7628 +/- ##
==========================================
- Coverage 61.88% 58.77% -3.11%
==========================================
Files 266 345 +79
Lines 29462 28766 -696
==========================================
- Hits 18232 16908 -1324
- Misses 9938 10428 +490
- Partials 1292 1430 +138 ☔ View full report in Codecov by Sentry. |
b6a814f
to
d4edf1a
Compare
d4edf1a
to
fb600f4
Compare
I finally got a chance to read the proposal and it looks pretty good. However, I have few comments. What is the impact of "orderedResources" setting (in Backup/Schedule CR) on the concurrency? It seems to me that if users specify a particular order for resources - say PVCs - backup of the item blocks that contain the ordered resources must be serialized. Also, I think overall backup performance is determined by how fast CSI snapshot operations and data mover operations are done, rather than by the concurrency of the resource manifest backup. In that sense, the concurrency proposed in this document will only benefit if there are multiple item blocks - each with its own set of PVCs. If that is not the case (say, you are backing up a single namespace with a pod that has 5 PVCs), users won't see much benefit even if the worker count is set to high. My point in general is that this proposal should be studied from the point of view of its impact on the PVC backup. Apart from what I mentioned above, the concurrency can result in more CSI snapshot API calls, more data mover operations to take place. Perhaps, it is worth explicitly calling it out in the document. |
This is actually a great point. I don't think that's accounted for right now. Current design takes the item collector list and generates item blocks from that. orderedResources are moved to the beginning of the list, but to take a specific example, say we have item1-item5 from ordered resources, none of which add additional items via plugins, if we have a worker pool of size 5, that means with the current design, all 5 items will be processed concurrently. I think your proposal would be that everything that comes via orderedResources should be forced to be executed sequentially. If we decide to take this approach, this means (at a minimum), everything that's in orderedResources must belong to the same ItemBlock -- that guarantees that these items will be processed in order, in the same thread. But that may not be sufficient. I think we may also want to make sure that we don't process any other items until orderedResources are complete. Is this correct? If yes, then I think the following design modification would resolve it:
@draghuram does the above resolve this appropriately?
For fs-backup scenarios, with multiple pods, this design will probably help us. It will also help in cases where there are a great many CSI snapshot operations. One case we ran into recently was one where there were a large number of VMs being backed up where the storage capacity for each was relatively small, but the synchronous component of CSI operations added up to many hours for the whole backup -- if we had a pool with 10 workers, that synchronous component would have been greatly reduced.
datamover operations have their own concurrency limitations -- number of node agent pods*per-node-agent concurrency. In backups where that's the limiting factor, this design won't do much. In designs where that is not the limiting factor, I'd expect some improvement here. Also note that there are two out-of-scope (see Non Goals) enhancements that need parts of this design to build on top of that meet other future needs:
|
@draghuram @shawn-hurley Actually, a slight modification to this suggestion above probably makes sense: "instead of one loop passing over the full item list, we have 2 loops. First loop stops iterating when it hits an item not identified in 1) as an ordered resource. If there are no ordered resources, this means we don't process anything here. We do the usual itemBlock processing, except instead of one block per item, all items identified as ordered resources are appended together into a single ItemBlock. We now back up this one ItemBlock consisting of all OrderedResources synchronously. Once it's done, we then go into the item processing loop described in the current design (with ordered resources already having been removed/handled)." Rather than doing this initial OrderedResources ItemBlock completely synchronously, we should probably send it to the worker channel just like we'll do later for the other ItemBlocks -- it's just we won't send any additional ItemBlocks until this one has completed, otherwise once we get to multiple backups at once, the worker count for "how many items to back up at the same time" will be off by one while this synchronous ItemBlock backup is happening. Instead, we should probably just send this first block to the channel, wait for it to complete, and then do the second pass through everything as proposed above. |
The k8s resources are already downloaded from the API server to the local disk by the item collection process before the concurrent processing proposed by this design, so seems the concurrency isn't very necessary. And I agree with @draghuram that the overall backup performance is determined by how fast the data is handled, not the k8s resources. |
@sseago I was thinking that you would still create multiple item blocks spanning ordered resources, but you would process them in order, rather than sending them to the workers concurrently. Your proposal of using one item block would work too but the difference is that other item block level operations such as hooks may be affected by putting all the resources in one item block. By that I mean, we may be executing hooks for a group of pods that may not be strictly related. To be sure, I don't now if this is such a big concern. I haven't used "orderedResources" feature myself a great deal. Also, I agree with you that the proposal helps with the performance by parallelizing operations such as CSI and other time taking operations. BTW, were you referring to KubeVirt plug-in when you said "VM operations"? And yes, item block functionality would come handy when it is time to implement volume group snapshots. |
@draghuram Ahh, yes, you're right. If pods are included, we'd want to treat them as separate blocks (since hooks run before/after ItemBlocks) -- so basically, we'd still need to identify which items in the collector output come from ordered resources, so we first process the ordered resources, creating ItemBlocks normally, but passing one at a time to be processed, waiting for the response before going to the next, once we hit the first ItemBlock without ordered resources, we can start processing them in parallel. @ywk253100 See the comment above, but we've seen scenarios where if there are a large number of volumes, the backup took many hours due to the synchronous first part of the CSI backup process -- that's one example where processing items in parallel should shorten overall backup time. Also note the other two use cases which this design enables:
|
Signed-off-by: Scott Seago <sseago@redhat.com>
fb600f4
to
9219e58
Compare
@sseago 2. Volume group snapshot 3. Large number of volumes backup 4. The CSI snapshot corner case --- the CSI driver retries for a long time before CSI handle appears 5. Parallel backups Secondly, let me clarify the major concern from us:
Therefore, we suggest below adjustment:
|
@Lyndon-Li I spent some time discussing this with @shawn-hurley today, and I think we may be able to drop the "everything must be v3" requirement, as long as we document clearly for plugin authors that for any pod-related dependencies that would break if the deps were done in parallel, they need to provide the v3 plugin. Effectively, the only use case that we know would break here would be a scenario with multiple pods that need to be backed up together. But that's a use case that's already somewhat broken, since inter-pod dependencies and hooks aren't handled at all right now.
On your second concern, we're adding the new call after velero already has the full content in the current code, which we need anyway to back up the item. The only extra memory usage, then, is related to the buffer size of the channel -- if we have a buffer size of 10, then 10 ItemBlocks have content in-memory, which is a relatively small amount of memory, all things considered, especially since Kubernetes metadata is limited in size for each resource. You cannot have arbitrarily large k8s resources. Also, since we already have the logic in the existing BIAs to identify items needed (see the pod_action upgrade example in the design), we're just moving logic around to reuse it. If we were to move this into discovery, we'd need to replicate that logic elsewhere and modify the item collector output to be block-aware. It also would not be extensible at all, but we really do need to do this in a plugin-aware way (more on that below). The main objection we have to hard-coding the logic around pod->PVC dependencies and the multi-pod dependencies for VolumeGroup support is that this prevents people from extending this concept via plugins. There are other use cases that require pods to be backed up together that internal Velero logic won't know about. For example, if I'm maintaining an operator that I want my users to be able to back up with Velero, with the plugin approach, I can write a BackupItemAction which operates on pods in my operator that return the other pods in the GetAdditionalItems method, and this will ensure that 1) pod hooks are run for my operator pods before and after the group of pods is backed up, and 2) these pods and their associated PVCs will be backed up in a single worker, since they're in one ItemBlock. |
To summarize my last comment, regarding the two major points of disagreement:
Or an even shorter summary: on point 1 above, we agree with you now and will update the design, but on point 2, we're going to keep the BIAv3 as-is, since it enables some use cases that we need which aren't handled without it. |
A little more detail on the concerns around calling the new plugin methods after item collection is done. First, as for the objection that we need the full object in memory -- we already need it in memory to process it a bit later when it gets to the worker thread, so worst case scenario is we're holding on to a few of these longer than we would be without ItemBlocks. How much memory are we talking about? Kubernetes resource metadata is limited in size -- somewhere around 1mb max. So we're talking about resourceSizebufferSizeitemsPerBlock -- if we assume a buffer size of 10 and 5 items per block (I expect most items will be in blocks of one, and a few items will be in blocks larger than 5) -- this means we're holding an extra 50mb of item data in memory, at most. As for the suggestion of doing this during item collection, there are a couple of problems with that. Currently the item collector grabs one GroupResource at a time via a
By processing the ItemBlock at the end of item collection, the items we need will already be loaded from the cluster and referenced in the items list, so when the current item returns "item 2" as a needed item, we can just grab that item from the list, mark it as "in a block already" (so later it won't be added to another one), and then call this item's additional items method, etc. This can all be done with minimal disruption to the rest of the workflow, and we're limiting the number of items in memory to the size of the input channel's buffer. This also preserves the ability for plugin authors to create their own ItemBlocks as necessary for their backup needs. |
…rnatives Signed-off-by: Scott Seago <sseago@redhat.com>
2ce6bc6
to
7873ced
Compare
I've updated the design based on the conversations earlier this week:
|
Concerns about the design from review and meeting commentsAt this point, there seem to be two primary objections to the design in the review (and community meeting) comments:
Responses to concern 1
Responses to concern 2
Specific use cases that would require the proposed pluginCassandra (cass-operator)For users of Cassandra, cass-operator creates a StatefulSet with three pods, each of which has its own mounted volume. With a plugin API to associate related entities, we can make sure that all three pods are included in the same ItemBlock. The internal-to-velero pod plugin will make sure that the volumes for each pod are included. This will make sure that pre-backup hooks on all three Cassandra pods are run first, followed by CSI snapshot creation, followed by the post-backup hooks. Alternate design considerationsNew plugin type ItemBlockActionOn the last community meeting, Daniel suggested we might consider creating a new ItemBlockAction plugin type instead of extending BackupItemAction. While this will require more upfront work than extending BIA, once the plugin API is added, there's very little in the overall design here that's changed. I expect that it will work just as well (no better, no worse) as the proposed BIAv3 design. The work required for plugin authors is also approximately the same. Because of this, while I personally have a mild preference for the BIAv3 design, I don't have any strong objections to the ItemBlockAction approach. This has been added to the "Alternatives considered" section of the design. |
@sseago
Finally, we want to share our suggestions as below:
|
04e7449
to
1893a1d
Compare
I think this distinction is incorrect. All of the groupings are for logically associated items. The distinction is as follows:
In particular, the external groupings are related to item dependencies/associations, just like the internal ones. Also, these aren't provided by end users but by application or infrastructure developers. End users are not expected to know or understand these dependencies at the individual resource level.
This can't come from user input -- it comes from application/operator developers, which means even if we did it during item collection, it would need to be a plugin of some sort. That plugin would need to take in a resource (with content) and return a list of items that need to be associated with it. So doing this in the ItemCollector won't be any simpler than defininig a new ItemBlockAction plugin which takes in the exact same thing and returns the same list of resources. Doing it in the item collector would be more complicated, though, since then we'll need to change the whole output structure to return groups of items rather than just a list, and on top of that since the associated items may not have been listed from the cluster yet, we would need to pull them out individually with a Get call, and then we'd need to check for every future item as we collect it to make sure it's not already in a separate group. Essentially we'd be taking all of the item block processing logic that the design proposes post-item-collection and do it during item collection, but since we don't have the full item list at this point, it's much less efficient. Is there any disagreement here?
We don't have a fixed number of ItemBlocks because as we process the list we get from the collector, we end up adding items from further down into the current block. By the time we have teh full list of item blocks, we are almost done with processing. More to the point, items returned by the IBA plugin API call don't create new ItemBlocks, they add later items to the current block. |
Hey folks, I want to add a small summary to try and see if we can get agreement on some points. The goal here is to lay out some high-level concepts to make sure that we agree with what is proposed and that it solves the problem. Points
I hope that we can gain agreement on these three points, if we can see what we agree on and don't agree on to make the disagreements clear from here. I am worried that I am not clear which points we have discussion on and what we agree on so a level set at this level will be helpful. |
#### Proto definition (compiled into golang by protoc) | ||
|
||
The ItemBlockAction plugin type is defined as follows: | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all code blocks are Go code examples, right?
would add language to these code blocks for syntax highlighting, to improve reading
``` ```go
code ---> code
``` ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mateusoliveira43 most of them, yes. I'll add it where appropriate. The one exception is the protobuf API spec section.
- If `item` is already in `itemsInBlock`, continue. This one has already been processed. | ||
- Add `item` to `itemsInBlock`. | ||
- Load item from ItemCollector file. Close/remove file after loading (on error return or not, possibly with similar anonymous func to current impl) | ||
- Get matching IBA plugins for item, call `GetRelatedItems` for each. For each item returned, get full item content from ItemCollector (if present in item list, pulling from file, removing file when done) or from cluster (if not present in item list), add item to the current block, add item to `itemsInBlock` map, and then recursively apply current step to each (i.e. call BIA method, add to block, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i.e. call BIA method, add to block, etc.)
Do you mean call "BIA" method or "IBA" method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonerjt good catch. This was originally BIA, but we recently changed the design to use the new plugin type and I forgot to change this one. I'll fix it.
|
||
After backing up the items in the block, we now execute post hooks using the same filtered item list we used for pre hooks, again taking the logic from `itemBackupper.backupItemInternal`). | ||
|
||
#### `itemBackupper.backupItemInternal` cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will backupItem
be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonerjt No. BackupItemBlock calls BackupItem in turn on each item in the ItemBlock, so that will remain. It will just be called from a different place than it is now -- and it will also be called within BackupItemInternal on the returned additional items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is related to another comment of mine. I was thinking backupItemBlock
will be part of kubernetesBackupper
and it will replace kubernetesBackupper.backupItem
.
``` | ||
type ItemBlock struct { | ||
log logrus.FieldLogger | ||
itemBackupper *itemBackupper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need different itemBackupper instances in different ItemBlocks?
There are trackers in the itemBackupper struct which collects information through out the backup procedure, if we allow different itemBackuppers we'll also need to consolidate the info collected by each tracker.
I also have doubt about having logger within ItemBlock, but this is trivial...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ItemBackupper is connected to the backup, so we won't want a different instance for each block. The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access. This is the same reason we need the logger -- so that when the worker thread calls BackupItemBlock it can pass in the correct logger for the currentbackup so that the logs get uploaded to the BSL, etc. Basically everything that's currently passed in to BackupItem needs to be made available in the ItemBlock struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that the reference is in the ItemBlock struct is that this is how the worker will get access to the ItemBackupper, since in the future when we have multiple backups in parallel, the workers may be handling ItemBlocks from different backups, so the ItemBlock needs to contain everything it needs to access.
Thanks for the reply, IMO we need to ensure the constraint that different itemBlocks belong to one backup share one itemBackupper instance. For example, add a reference to BackupRequest to the struct of itemBlock and add a reference to the itemBackupper to the Or at least add a comment to highlight that in the definition of the ItemBlock struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the comment is probably sufficient here -- there's only one place in the code where ItemBlocks are created, and that will always pass a reference to the existing itemBackupper for the backup.
|
||
Method signature for new func `backupItemBlock` is as follows: | ||
``` | ||
func backupItemBlock(block ItemBlock) []schema.GroupResource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it func(kb *kubernetesBackupper) backupItemBlock(....)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure. It's not an interface method for the Backupper interface, and we're not calling it from one of the specific KubernetesBackupper interface method implementations -- since it's being called from the worker threads, it's probably better as a top-level func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should be a part of kubernetesBackupper
, but we can double check during the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So if we added this to kubernetesBackupper, then we'd need to pass a reference to that, in addition to the ItemBlock, into the worker thread -- it might complicate things too much. But I think this is a detail we can work through during the implementation -- if it makes more sense to do it that way, we can consider it.
|
||
After implementing backup hooks in `backupItemBlock`, hook processing should be removed from `itemBackupper.backupItemInternal`. | ||
|
||
### Phase 2: Process ItemBlocks for a single backup in multiple threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more details should be added to this section, like how to process the returned value from returnChan
@sseago @shawn-hurley
|
@Lyndon-Li That is a good question, actually. First, to clarify, a PV can only be bound by one PVC, even if it's RWX, so your scenario above would be "Pod1->PVC1->PV1 and Pod2->PVC1->PV1" -- the same PVC mounted by both pods. This is an example of something that we don't actually handle well right now at all, since we process everything sequentially, so with current Velero 1.14, Pod1 pre hooks run, we back up PVC1, PV1, then Pod1 post hooks, then Pod 2 gets backed up, pre-hooks run (but PV has already been backed up). I think this can be resolved without any changes to the design. We just need the PVC IBA plugins to return dependencies in both directions. So we'd have the following IBA plugins:
|
1893a1d
to
3eb6804
Compare
@sseago
So could you also add some details for how to create, maintain an itemBlock in various cases into the document? |
It sounds like what's missing is the specific example of an RWX volume shared by multiple pods -- I can add this to the ItemBlock examples in the document -- but to be clear, the only thing necessary to completely handle this use case is an IBA for PVCs which return the list of mounting pods. I'll make sure the document makes that clear as well. |
Signed-off-by: Scott Seago <sseago@redhat.com>
3eb6804
to
3c2d77f
Compare
Signed-off-by: Scott Seago <sseago@redhat.com>
@anshulahuja98 Can you please take another look and re-ack the design if all your queries/concerns were addressed by @sseago ? |
Approved. |
Thank you for contributing to Velero!
Please add a summary of your change
This PR adds a design document to enhance single backup performance with multithreaded item backup. This design will also implement some of the prerequisites for future support of VolumeGroupSnapshots.
Does your change fix a particular issue?
Fixes #7474
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.