-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infiniband interface state Down #249
Comments
Hi @aym-frikha could you please share your OpenSM config? defmember should be set to 'full' |
Hi @e0ne, how do I check the OpenSM config ? should it be on the underlay host or on the pod ? I tried to install it on the pod (apt install opensm) but the /etc/opensm/ directory is empty, do you have a configuration example ? BTW, I tried to do ibping between the hosts works fine even without enabling the Opensm. Thank you |
if you are using SR-IOV you must use opensm from OFED not inbox OpenSM. |
Hi @moshe010 , I added OFED using this small script: #!/bin/bash MOFED_VERSION=5.4-1.0.3.0 wget http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}.tgz But still the port is down: When tried to start opensm it says:
|
Hi @moshe010, could you elaborate on what the significant difference is? Are the necessary change(s) available in upstream OpenSM? I'm wondering if this isn't something we could pull into Ubuntu. |
@aym-frikha it seem that you have another instance of OpenSM running "Perhaps another instance of OpenSM is already running" make sure other nodes in the fabric don't run it as well |
What happened:
The Infiniband interface is mounted on the pod with link status up (Physically connected) but the status is Down.
root@test-sriov-ib-pod:~# ibstat mlx5_7
CA 'mlx5_7'
CA type: MT4124
Number of ports: 1
Firmware version: 20.28.4000
Hardware version: 0
Node GUID: 0x9aecdde4e1e642ba
System image GUID: 0x043f720300df3e00
Port 1:
State: Down
Physical state: LinkUp
Rate: 200
Base lid: 65535
LMC: 0
SM lid: 1
Capability mask: 0x2651ec48
Port GUID: 0x9aecdde4e1e642ba
Link layer: InfiniBand
What you expected to happen:
The Infiniband interface is mounted on the pod with link status up and state is up in ibstate.
How to reproduce it (as minimally and precisely as possible):
helm install -f ./values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator --version v1.0.0
Anything else we need to know?:
Logs:
NicClusterPolicy CR spec and state:
Output of:
kubectl -n nvidia-network-operator-resources get -A
: operator-state.txtNetwork Operator version: 1.0.0
Logs of Network Operator controller:
whereabouts-2b4l2.txt
sriov-network-config-daemon-7dktz.txt
sriov-device-plugin-ksfns.txt
sriov-device-plugin-8q85p.txt
sriov-cni-7fm9n-sriov-cni.txt
sriov-cni-7fm9n.txt
sriov-cni-7fm-sriov-infiniband-cni.txt
rdma-shared-dp-ds-dblbm.txt
network-operator-sriov-network-operator-76dc5c7879-lgwkn.txt
network-operator-node-feature-discovery-worker-mp8z6.txt
network-operator-node-feature-discovery-worker-8j2wr.txt
network-operator-node-feature-discovery-master-596fb8b7cb-xkxt9.txt
network-operator-547cb8d999-2fhwc.txt
Logs of the various Pods in
nvidia-network-operator-resources
namespace:Helm Configuration (if applicable):
helm config.txt
Kubernetes' nodes information (labels, annotations and status):
kubectl get node -o yaml
:Environment:
Kubernetes version (use
kubectl version
): 1.21Hardware configuration: DGX2 server
[PN] Part number: MC
[EC] Engineering changes: AC
[V2] Vendor specific: MCX653105A-HDAT
[SN] Serial number: MT2043T04073
OS (e.g:
cat /etc/os-release
): ubuntu focalKernel (e.g.
uname -a
): Linux 5.4.0-86-generic 97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021 x86_64 x86_64 x86_64 GNU/LinuxOthers:
The text was updated successfully, but these errors were encountered: