Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal: Support Dynamic Primary Pod NIC Name #324

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

orelmisan
Copy link
Member

What this PR does / why we need it:
This proposal proposes a mechanism to dynamically determine the primary pod interface for KuveVirt virtual machines (
VMs).
The goal is to move away from the hard coded eth0 interface and allow flexible, environment-driven interface
selection.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

None

@kubevirt-bot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Sep 15, 2024
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alonakaplan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member Author

@orelmisan orelmisan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved unanswered comments from another document here.


## User Stories

- As a cluster administrator I want to use a CNI that configures pods’ primary interface to a name other than `eth0`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev commented: By its own, this story has nothing to do with Kubevirt.
Please try to tell a full story in one paragraph if all details are related.
Please provide stories for core bindings and plugin bindings, to make it clear we are interesting to cover them all.

Also, please elaborate about the correctness of the interface status reporting. E.g. that the information is collected from the correct pod interface.

I would also mention a story that assures existing setups with "eth0" continue working as expected.

Also, please provide a story for a migration. At the source it uses "eth0" and at the target it is using something else. An upgrade path may introduce this scenario, so lets see if we have a problem with it or not.


- https://github.com/kubevirt/kubevirt

# Design
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi commented: From what I understand, we need to pass yet another piece of information from pod annotation down to virt-handler and virt-launcher.
Given that virt-controller have access to the pod and to the subject information.
And there is a mechanism that enable passing info from virt-controller to virt-handler, that is kubevirt.io/network-info annotation.
Can we use this mechanism to pass the primary network interface name as well?
Is so, I think it will save from extending the status API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi I've added your suggestion as an alternative, it will be visible in the next push.

Since the kubevirt.io/network-info annotation is set on the virt-launcher pod (and not on the VMI object), virt-handler doesn't have access to it.

Since this information is needed in virt-handler as well (for network setup and network status report), a new mechanism should be created in order to share this information.

Another con to this suggestion, is the need to have the downward API for every virt-launcher pod, in comparison to just the ones derived from VMIs with SR-IOV NICs or network binding plugins that require it.

]
```

This annotation could be parsed to infer the default (primary) network interface instead of relying on a hard coded
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi commented: Does the check relay on default=true?
Is so, I think it should be noted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the primary interface will be the first one which has default == true.
The Kubernetes Network Custom Resource Definition De-facto Standard states that there could be just one, but recently a bug was discovered that multiple interfaces could be marked as having default == true.

I've added it into the document, it will be visible in the next push.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed already, and is available in multus 4.1.1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing it @maiqueb :)

This annotation could be parsed to infer the default (primary) network interface instead of relying on a hard coded
interface name (like eth0).

If the network-status annotation is missing or doesn’t specify a default interface, KubeVirt will fall back to using
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi commented: Would kubevirt network-setup require change? (e.g.: be executed again in case annotation is missing)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, the network setup is agnostic to the interface it configures - meaning it doesn't care if it the interface being configured is bound to the primary pod interface (connected to the cluster network) or to a secondary interface (connected to some other network).

The logic that would probably have to change in the virt-handler is the one calculating pod interfaces' names:
https://github.com/kubevirt/kubevirt/blob/d50cc6c7cc02c5f5c33ffb77fc6710828370376c/pkg/network/setup/netpod/netpod.go#L240

If we will go with the alternative that adds each pod interface's name to the VMI.status.interfaces.[x], we will not have to calculate it, and just take the information directly from the VMI object.

This will enable us to change the naming scheme more easily in the future in case we will need to.

- No configuration is required in KubeVirt
- Provides backward compatibility

### Cons
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdDev commented:

  • The added information is probably not useful for a user.
  • The added information is not useful for anything beyond this primary network and it opens a big question: What is the primary network and when it is even relevant.


- https://github.com/kubevirt/kubevirt

# Design
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi commented: Can we have Kubevirt realize which interface is primary (e.g.: by default route) as an alternative?
Would that work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it should be KubeVirt's responsibility.
Please note that a secondary network could also have a default route.

@orelmisan
Copy link
Member Author

/uncc @aburdenthehand @cwilkers
/cc @EdDev

@kubevirt-bot kubevirt-bot requested review from EdDev and removed request for aburdenthehand and cwilkers September 15, 2024 16:58
Copy link
Contributor

@maiqueb maiqueb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to configure a cluster-wide preferred interface to latch on to in the KubeVirt CR ?

I.e. the admin would indicate what if the preferred pod interface name to attach to; when not present, default to eth0, as done today.

It's a bit tied to the use-case, but has the advantages of not requiring new API (... in KubeVirt), and doesn't expose pointless info in the status.

@Acedus
Copy link

Acedus commented Sep 16, 2024

The proposal assumes that KubeVirt is used with a multi-homing CNI plugin, necessitating a clear distinction between interfaces and their intended roles. For instance, if a VM uses a single interface from a CNI where the interface name is different from the standard eth0, it should be straightforward to identify this interface as primary by selecting the only available one. Consequently, the proposal is relevant only if the Pod uses n >= 2 non-standard interfaces (where "non-standard" refers to interfaces not defined by the user or CNI).

A few questions arise:

  1. How is the primary interface defined? Is it the interface used to connect to the default cluster network as specified by Multus? What if a VM only uses bridge-CNI and does not connect to the cluster network (e.g., using two bridge interfaces)? Determining the primary interface of a Pod seems challenging without a clear definition, especially with complex scenarios like OVN-Kubernetes' UDNs. It seems overly context-specific to determine this dynamically with sufficient accuracy.

  2. Regarding the n >= 2 remark, if we rely on Multus for inference, other multi-homing methods will need to adhere to the fallback logic implemented (e.g., defaulting to eth0). Are there plans to address this?

  3. Instead of relying on interface names to manage network plumbing, could we use other, more easily inferred parameters? Would it be more advantageous to let users specify the primary interface directly, as deviations from the standard eth0 are exceptions rather than the norm?

Copy link

@qinqon qinqon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, while the doc is finished.


## Alternative 1 - Report Pod Interface Name Per Interface

Currently, the `VirtualMachineInstanceStatus` struct holds a slice of interface status objects:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the k8s subresource to define something is not a good idea, since depending on how it's implemented it may be ignored by reconcile.

Also do we really need to configure this PR vm or per network ? I doubt a per VM make much sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the k8s subresource to define something is not a good idea, since depending on how it's implemented it may be ignored by reconcile.

AFAIK, ATM the VirtualMachineInstance CRD does not have a /status sub-resource [1].
Could you please elaborate on the scenarios on which updating the VMI status could be ignored?

Also do we really need to configure this PR vm or per network ? I doubt a per VM make much sense.

Currently both per-interface and once per VMI options are described.
Both has their pros and cons.

Could you please elaborate why do you think having the information per VM does not make sense?

[1] kubevirt/kubevirt#3889

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the k8s subresource to define something is not a good idea, since depending on how it's implemented it may be ignored by reconcile.

I do not understand this. The status of any resource is always reconciled, that is the idea of having a status that represents the current state.
If it is a subresource or not is an implementation detail that is suppose to improve safety.

AFAIK, ATM the VirtualMachineInstance CRD does not have a /status sub-resource [1].

The VMI has a status subresource.

Copy link

@qinqon qinqon Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the k8s subresource to define something is not a good idea, since depending on how it's implemented it may be ignored by reconcile.

I do not understand this. The status of any resource is always reconciled, that is the idea of having a status that represents the current state. If it is a subresource or not is an implementation detail that is suppose to improve safety.

If you have a CRD with a status the put a condition about "success" and a Reconcile cycle that update that, you don't want to enter that reconcile if you update it since it will be a loop, so depending on the implementation that is skip, but maybe not the case for kubevirt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the k8s subresource to define something is not a good idea, since depending on how it's implemented it may be ignored by reconcile.

I do not understand this. The status of any resource is always reconciled, that is the idea of having a status that represents the current state. If it is a subresource or not is an implementation detail that is suppose to improve safety.

AFAIK, ATM the VirtualMachineInstance CRD does not have a /status sub-resource [1].

The VMI has a status subresource.

The VMI CRD does not have a status sub-resource.
Please see:

https://github.com/kubevirt/kubevirt/blob/dc6b76f494ffdb95305fed5ce038963539d469a1/pkg/virt-operator/resource/generate/components/crds.go#L167

in comparison to the VirtualMachine CRD that does have it:
https://github.com/kubevirt/kubevirt/blob/dc6b76f494ffdb95305fed5ce038963539d469a1/pkg/virt-operator/resource/generate/components/crds.go#L205

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently both per-interface and once per VMI options are described. Both has their pros and cons.

Could you please elaborate why do you think having the information per VM does not make sense?

[1] kubevirt/kubevirt#3889

As far as I have see primary interface is common between all the VMs on the same network, so it do not make sense to configure it per VM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VMI CRD does not have a status sub-resource.

This is surprising.

As far as I have see primary interface is common between all the VMs on the same network, so it do not make sense to configure it per VM.

I am not familiar with an entity that is describing a VM collection per namespace.
The Kubevirt CR is at the cluster level and creating a new resource kind for this does not make sense.

But I also do not understand how is this even relevant for the line this comment thread refers to.
Are you suggesting another alternative? Can you suggest to add one at a proper location with all the details?

design-proposals/dynamic-primary-pod-nic-name.md Outdated Show resolved Hide resolved
Copy link

@oshoval oshoval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

## Using Multus’ network-status annotation

When KubeVirt is used in a cluster that has Multus as its default CNI, Multus adds the
`k8s.v1.cni.cncf.io/network-status` annotation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it worth to consider looking on k8s.ovn.org/pod-networks ?
or k8s.v1.cni.cncf.io/network-status is preferable, because it is the CNI answer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since KubeVirt could be used with an arbitrary CNI, I don't think it should look for CNI-specific annotations.
Having said that, Multus is an exception since KubeVirt is currently coupled to it.

design-proposals/dynamic-primary-pod-nic-name.md Outdated Show resolved Hide resolved
}
```

VMI controller will set this field's value from Multus' network-status annotation, or fallback to `eth0` in case Multus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth to remove the common part from the alternatives please, as the alternatives differ about what to report,
but the how is common


## Alternative 1 - Report Pod Interface Name Per Interface

Currently, the `VirtualMachineInstanceStatus` struct holds a slice of interface status objects:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the k8s subresource to define something is not a good idea, since depending on how it's implemented it may be ignored by reconcile.

I do not understand this. The status of any resource is always reconciled, that is the idea of having a status that represents the current state.
If it is a subresource or not is an implementation detail that is suppose to improve safety.

AFAIK, ATM the VirtualMachineInstance CRD does not have a /status sub-resource [1].

The VMI has a status subresource.

Comment on lines 17 to 15
However, users might require custom network setups where the pod’s primary interface is not named `eth0`.
Currently, KubeVirt attaches the VM’s interface destined to the pod network to the pod’s `eth0` interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two sentences probably should be swapped. First describe how it is now and then what is missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.
Done.

Comment on lines 20 to 22
When creating a VM with secondary networks, KubeVirt specifies the pod interfaces’ names for all secondary interfaces.
It is currently not possible to specify the primary interface’s name when templating the virt-launcher pod, since there
is no K8s API that enables it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a solution to me. Defining in advance the name of the interface is what we do for secondary and it could have been an option to solve it in the same way, but we cannot. Perhaps you can add this as a note or as part of an alternative that is not choosen due to lack of an ability to implement this.
(I must admit, it sounds strange that this cannot be done using an env variable on the node, but this is not really something we need to explore further at this stage)


- Allow the primary pod interface name to be dynamically determined based on the environment.
- Ensure backward compatibility, so existing VMs with `eth0` as the primary interface do not break.
- Handle the scenario of a non-eth0 primary interface with a binding plugin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this goal. If we can consume a non standard name, why is this needed?

Comment on lines 32 to 33
- Change Kubernetes' API.
- Change Multus' API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not goals, these are implementations.

A non-goal will be for example: Support dynamic consumption of a primary pod interface name on deployments without Multus.
But it may be better not to mention anything.

...
interfaces:
- infoSource: 'domain, guest-agent, multus-status'
podInterfaceName: eth0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you use in the examples a different name to emphasize the need. Same for the other places.
Perhaps sync the name with the network-status dump used before.


### Pros

- Less API than the first alternative
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either drop this or move it to the bottom. I see no special pro here... on the contrary.

Comment on lines 238 to 237
## Alternative 3 - Use annotations instead of K8s API

<!-- TODO: Complete this section -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add it unless you have content.


## Network Binding Plugins

A Virual Machine's NIC could be bound to the pod's primary NIC using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this section.
It already does, so repeating this here adds no special value.

If you want to express that it should consume a non standard name, then please elaborate and explain if you need or do not need to change something.

A Virual Machine's NIC could be bound to the pod's primary NIC using
a [network binding plugin](https://kubevirt.io/user-guide/network/network_binding_plugins/).

## CNI
Copy link
Member

@EdDev EdDev Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this belongs under the Network Binding Plugins section, not in the same level.

But in continuation to the prev comment, I do not understand why there is a need to give such a long intro here about the CNI of a binding plugin is. You can just leave the reference to the user-guide or the design-proposal.

In general, we are not providing as input to the binding plugin any information today about it.
I see no option to do it, so that should be the statement here.

Adding a recommendation on what a binding plugin author can do to mitigate this is possible, but we do not expect anything more than that.

@orelmisan
Copy link
Member Author

Would it make sense to configure a cluster-wide preferred interface to latch on to in the KubeVirt CR ?

I.e. the admin would indicate what if the preferred pod interface name to attach to; when not present, default to eth0, as done today.

It's a bit tied to the use-case, but has the advantages of not requiring new API (... in KubeVirt), and doesn't expose pointless info in the status.

@maiqueb I've added this alternative here.

Signed-off-by: Orel Misan <omisan@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants