Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infiniband interface state Down #249

Closed
aym-frikha opened this issue Oct 1, 2021 · 6 comments
Closed

Infiniband interface state Down #249

aym-frikha opened this issue Oct 1, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@aym-frikha
Copy link

aym-frikha commented Oct 1, 2021

What happened:
The Infiniband interface is mounted on the pod with link status up (Physically connected) but the status is Down.
root@test-sriov-ib-pod:~# ibstat mlx5_7
CA 'mlx5_7'
CA type: MT4124
Number of ports: 1
Firmware version: 20.28.4000
Hardware version: 0
Node GUID: 0x9aecdde4e1e642ba
System image GUID: 0x043f720300df3e00
Port 1:
State: Down
Physical state: LinkUp
Rate: 200
Base lid: 65535
LMC: 0
SM lid: 1
Capability mask: 0x2651ec48
Port GUID: 0x9aecdde4e1e642ba
Link layer: InfiniBand

What you expected to happen:
The Infiniband interface is mounted on the pod with link status up and state is up in ibstate.

How to reproduce it (as minimally and precisely as possible):
helm install -f ./values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator --version v1.0.0

Anything else we need to know?:

Logs:

Environment:

  • Kubernetes version (use kubectl version): 1.21

  • Hardware configuration: DGX2 server

    • Network adapter model and firmware version: Mellanox Technologies MT28908 Family [ConnectX-6]
      [PN] Part number: MC
      [EC] Engineering changes: AC
      [V2] Vendor specific: MCX653105A-HDAT
      [SN] Serial number: MT2043T04073
  • OS (e.g: cat /etc/os-release): ubuntu focal

  • Kernel (e.g. uname -a): Linux 5.4.0-86-generic 97-Ubuntu SMP Fri Sep 17 19:19:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

  • Others:

@aym-frikha aym-frikha added the bug Something isn't working label Oct 1, 2021
@e0ne
Copy link
Collaborator

e0ne commented Oct 4, 2021

Hi @aym-frikha could you please share your OpenSM config? defmember should be set to 'full'

@aym-frikha
Copy link
Author

Hi @e0ne, how do I check the OpenSM config ? should it be on the underlay host or on the pod ? I tried to install it on the pod (apt install opensm) but the /etc/opensm/ directory is empty, do you have a configuration example ?

BTW, I tried to do ibping between the hosts works fine even without enabling the Opensm.

Thank you

@moshe010
Copy link
Collaborator

moshe010 commented Oct 4, 2021

if you are using SR-IOV you must use opensm from OFED not inbox OpenSM.

@aym-frikha
Copy link
Author

Hi @moshe010 , I added OFED using this small script:

#!/bin/bash
apt install build-essential cmake tcsh tcl tk make git curl vim wget ca-certificates iputils-ping net-tools ethtool perl lsb-release python-libxml2 iproute2 pciutils libnl-route-3-200 kmod libnuma1 lsof openssh-server swig libelf1 automake libglib2.0-0 autoconf graphviz chrpath flex libnl-3-200 m4 debhelper autotools-dev gfortran libltdl-dev

MOFED_VERSION=5.4-1.0.3.0
OS_VERSION=ubuntu18.04
PLATFORM=x86_64

wget http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}.tgz
tar -xvf MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}.tgz
MLNX_OFED_LINUX-${MOFED_VERSION}-${OS_VERSION}-${PLATFORM}/mlnxofedinstall --user-space-only --without-fw-update -q

But still the port is down:
root@test-sriov-ib-pod2:~# ibv_devinfo
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 20.28.4000
node_guid: 4cdc:c189:9792:6a40
sys_image_guid: 043f:7203:00df:3df4
vendor_id: 0x02c9
vendor_part_id: 4124
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 65535
port_lmc: 0x00
link_layer: InfiniBand

When tried to start opensm it says:
root@test-sriov-ib-pod:~# opensm

OpenSM 5.9.0.MLNX20210617.c9f2ade
Command Line Arguments:
Log File: /var/log/opensm.log

OpenSM 5.9.0.MLNX20210617.c9f2ade

Using default GUID 0x43f720300df3dac
Entering DISCOVERING state

Error from osm_opensm_bind (0x2A)
Perhaps another instance of OpenSM is already running
Exiting SM

root@test-sriov-ib-pod:~# service opensmd status
OpenSM is not running.

@dannf
Copy link

dannf commented Oct 5, 2021

if you are using SR-IOV you must use opensm from OFED not inbox OpenSM.

Hi @moshe010, could you elaborate on what the significant difference is? Are the necessary change(s) available in upstream OpenSM? I'm wondering if this isn't something we could pull into Ubuntu.

@moshe010
Copy link
Collaborator

@aym-frikha it seem that you have another instance of OpenSM running "Perhaps another instance of OpenSM is already running" make sure other nodes in the fabric don't run it as well
@dannf SR-IOV support in opensm wasn't upstream. I think it best to use the OFED opensm. You can contact me at moshele@nvidia.com for further discussion

@moshe010 moshe010 closed this as completed Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants