Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more checks on generic plugin to discover discrepancies from the desired state #587

Conversation

vasrem
Copy link
Contributor

@vasrem vasrem commented Jan 17, 2024

Summary

This PR does the following things:

  • Adds additional test coverage for OnNodeStateChange() function to indicate when we are draining the node
  • Adjusts NeedToUpdateSriov() helper function to ensure that:
    • VF MAC Address is set when we are in Ethernet mode
    • VF GUID is set when we are in Ethernet mode and RDMA is enabled
    • VF GUID is set when we are in Infiniband mode
    • PF Link is up
  • (API Change) Adds necessary fields in the SriovNetworkNodeState struct to facilitate the above

Partial Reconciliation

Today we reconcile on .spec change since we skip reconciliation if the generation of the object is the same as the object that was last reconciled successfully. Therefore, changes in .status, like what we try to detect with that PR, are not reconciled unless the .spec changes when the config daemon runs continuously (i.e. no restart). On daemon restart, the .status field will be reconciled the first time until the last successfully reconciled generation is saved in memory.

Decisions

  • Decided to add tests on the OnNodeStateChange() instead of NeedToUpdateSriov() because that one looks to be the most impactful for the whole system (i.e. decides whether to drain). I can add on both Apply() and OnNodeStateChange() if we choose to split the NeedToUpdateSriov() (see open questions below).
  • Decided to just look for Node GUID and not the Port GUID since we expect both of them to be populated together on bind/unbind. This is the case because we parse the GUID from the RDMA link, and the node GUID is populated there on bind/unbind.

Open questions

  • Is it enough to partially reconcile or do we need to change the logic to take into account changes on .status (i.e. full reconciliation of changes that the controller does to the system)? (relatively big change in the operator behaviour I suppose)
  • Do we need to drain on any of the discrepancies we find? (e.g. link is not up). If not, we will need to adjust the function that is used in Apply() and OnNodeStateChange() to be something different than NeedToUpdateSriov().
  • Are there additional end to end tests to be added? I would appreciate if you can point out the place I should be putting those (if we want to test end to end).
  • Do we prefer getting info by reading directly /sys/class/net/* or use netlink? I see both approaches:
    • func (n *network) GetNetDevMac(ifaceName string) string {
      log.Log.V(2).Info("GetNetDevMac(): get Mac", "device", ifaceName)
      macFilePath := filepath.Join(vars.FilesystemRoot, consts.SysClassNet, ifaceName, "address")
      data, err := os.ReadFile(macFilePath)
      if err != nil {
      log.Log.Error(err, "GetNetDevMac(): fail to read Mac file", "path", macFilePath)
      return ""
      }
      return strings.TrimSpace(string(data))
      }
    • func (s *sriov) GetLinkType(ifaceStatus sriovnetworkv1.InterfaceExt) string {
      log.Log.V(2).Info("GetLinkType()", "device", ifaceStatus.PciAddress)
      if ifaceStatus.Name != "" {
      link, err := netlink.LinkByName(ifaceStatus.Name)
      if err != nil {
      log.Log.Error(err, "GetLinkType(): failed to get link", "device", ifaceStatus.Name)
      return ""
      }
      linkType := link.Attrs().EncapType
      if linkType == "ether" {
      return consts.LinkTypeETH
      } else if linkType == "infiniband" {
      return consts.LinkTypeIB
      }
      }
      return ""
      }

      Based on the answer I will use or discard d18aae1

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 7d4380a to 2e88940 Compare January 17, 2024 11:33
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 2e88940 to f942ac6 Compare January 17, 2024 11:36
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from f942ac6 to 54a1148 Compare January 17, 2024 11:53
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Jan 17, 2024

Pull Request Test Coverage Report for Build 9206688050

Details

  • 54 of 127 (42.52%) changed or added relevant lines in 7 files are covered.
  • 6 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.06%) to 39.655%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/host/internal/lib/netlink/netlink.go 0 6 0.0%
pkg/host/internal/sriov/sriov.go 1 12 8.33%
pkg/host/internal/network/network.go 19 35 54.29%
pkg/helper/mock/mock_helper.go 0 20 0.0%
pkg/host/mock/mock_host.go 0 20 0.0%
Files with Coverage Reduction New Missed Lines %
controllers/drain_controller.go 1 68.06%
controllers/generic_network_controller.go 5 74.53%
Totals Coverage Status
Change from base Build 9203023396: 0.06%
Covered Lines: 5175
Relevant Lines: 13050

💛 - Coveralls

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 54a1148 to efe6e46 Compare January 17, 2024 12:45
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

1 similar comment
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 3826a39 to c07958d Compare January 17, 2024 14:11
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from c07958d to 69c9dd5 Compare January 17, 2024 14:13
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 69c9dd5 to d18aae1 Compare January 17, 2024 14:15
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem marked this pull request as ready for review January 18, 2024 09:05
@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from d18aae1 to 7fe0497 Compare January 18, 2024 14:31
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 7fe0497 to 4d50701 Compare January 18, 2024 14:32
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem
Copy link
Contributor Author

vasrem commented Jan 19, 2024

/test-e2e-all

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 4d50701 to f9e4337 Compare January 19, 2024 06:58
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from f9e4337 to 1ffcbc3 Compare January 29, 2024 07:53
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem
Copy link
Contributor Author

vasrem commented Jan 29, 2024

@adrianchiris PTAL

api/v1/helper.go Outdated Show resolved Hide resolved
Copy link

github-actions bot commented May 8, 2024

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from edcd15e to b30ab24 Compare May 8, 2024 10:36
Copy link

github-actions bot commented May 8, 2024

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

api/v1/helper.go Outdated
@@ -300,6 +306,28 @@ func NeedToUpdateSriov(ifaceSpec *Interface, ifaceStatus *InterfaceExt) bool {
return true
}

if strings.EqualFold(ifaceStatus.LinkType, consts.LinkTypeETH) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any chance to have one check to avoid duplication ?

e.g if (ETH && RDMA) || IB { ....}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, makes sense. Addressing.

Copy link
Collaborator

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, one minor nit, sorry for not raising it in the previous round.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from b30ab24 to 83c8397 Compare May 8, 2024 16:08
Copy link

github-actions bot commented May 8, 2024

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@@ -187,6 +190,36 @@ func (n *network) GetNetDevMac(ifaceName string) string {
return link.Attrs().HardwareAddr.String()
}

// GetNetDevNodeGUID returns the network interface node GUID if device is RDMA capable otherwise returns empty string
func (n *network) GetNetDevNodeGUID(pciAddr string) string {
log.Log.V(2).Info("GetNetDevNodeGUID(): get node GUID", "pciAddr", pciAddr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function gets called during the polling of NIC status, for each configured VF.
I would avoid this logging statement, as it increases config-daemon logs without adding specific value.

You can have a look at the end2end job artifact
https://github.com/k8snetworkplumbingwg/sriov-network-operator/actions/runs/9005014290/artifacts/1484967983

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed.

@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 83c8397 to 0ecd227 Compare May 10, 2024 08:26
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@adrianchiris
Copy link
Collaborator

@zeeke @SchSeba can we get that merged ?

@vasrem
Copy link
Contributor Author

vasrem commented May 21, 2024

@zeeke and @SchSeba bump on that one if you could take a look.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry it took soo much time to merge this one.

It's almost remove just small comments and we can merge this PR

}

if link.Attrs().Flags&net.FlagUp == 0 {
return "down"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return "down"
}

return "up"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put this one as a const

@@ -571,7 +574,7 @@ func (s *sriov) configSriovDevice(iface *sriovnetworkv1.Interface, skipVFConfigu
if err != nil {
return err
}
if pfLink.Attrs().OperState != netlink.OperUp {
if pfLink.Attrs().Flags&net.FlagUp == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put this one as part of the netlinkLib lets use it here and on the other place in the code

Today, we are missing 2 checks on settings that sriov netop is
configuring:
* PF is up
* GUID is set

Without checking for these 2, we risk that changes made by the user
directly to the system are not reconciled by the netop leaving the
system in a bad state.

This commit is adding those checks.

Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>
@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 0ecd227 to 5f8d1a1 Compare May 23, 2024 10:30
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@vasrem
Copy link
Contributor Author

vasrem commented May 23, 2024

@SchSeba please have a look, I addressed your comments.

Signed-off-by: Vasilis Remmas <vremmas@nvidia.com>
@vasrem vasrem force-pushed the fix/enhance-checks-on-generic-plugin branch from 5f8d1a1 to 5f3c4e9 Compare May 23, 2024 10:50
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

sorry about the amount of time the review took

@SchSeba SchSeba merged commit 8d32f42 into k8snetworkplumbingwg:master May 23, 2024
12 checks passed
@vasrem vasrem deleted the fix/enhance-checks-on-generic-plugin branch May 27, 2024 11:55
zeeke added a commit to zeeke/sriov-network-operator-1 that referenced this pull request Jun 25, 2024
Merge issue coming from
- k8snetworkplumbingwg#690
- k8snetworkplumbingwg#587

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
@zeeke zeeke mentioned this pull request Jun 25, 2024
zeeke added a commit to zeeke/sriov-network-operator-1 that referenced this pull request Jun 25, 2024
Merge issue coming from
- k8snetworkplumbingwg#690
- k8snetworkplumbingwg#587

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
maze88 pushed a commit to Mellanox/sriov-network-operator that referenced this pull request Jul 2, 2024
Merge issue coming from
- k8snetworkplumbingwg#690
- k8snetworkplumbingwg#587

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants