Update netlink messages handler #2233

liorghub · 2022-04-18T19:57:30Z

What I did
Ignore netlink DELLINK messages if port has master, this is applicable to the case where port was part of VLAN bridge or LAG.

Why I did it
Netlink messages handler in portsyncd was ignoring all messages that had master.
Therefore we ignored messages on interfaces that belong to LAG (not only interfaces belong to bridge as intended).
The result was "netdev_oper_status" down in PORT_TABLE in state DB for port which is part of LAG although it is actually up.

How I verified it
Check "netdev_oper_status" in PORT_TABLE in state DB for port which is part of LAG.

Details if related

zhenggen-xu · 2022-04-20T04:26:24Z

@liorghub Thanks for the fix. Can you please add vs tests to cover the cases where the port being part of Bridge or LAG?

prsunny · 2022-04-20T22:29:06Z

portsyncd/linksync.cpp

    if (master)
    {
-        return;
+        LinkCache &linkCache = LinkCache::getInstance();


AFAIK, this would be handled by teamsyncd. Can you check?

@prsunny I checked, teamsyncd is handling messages being sent for the port-channel interface itself, those messages are marked with type="team". The bug I fixed concerns the handling of messages for ports that belongs to port-channel. These messages are not marked with type="team".

ok, @judyjoseph , can you check this? This seems to be basic change and missed. @liorghub, What is the functional impact?

The functional impact is in LLDP, there we check state DB PORT_TABLE for "netdev_oper_status" up before sending LLDP commands. If "netdev_oper_status" is down, LLDP command is not being sent causing wrong LLDP behavior.

See the following code in lldpmgrd.
https://github.com/Azure/sonic-buildimage/blob/cc30771f6b97234a6dd19d8f97d5dfd44551cf20/dockers/docker-lldp/lldpmgrd#L170

ok. lgtm. As Xu suggested, please add VS tests to cover this.

ok, @judyjoseph , can you check this? This seems to be basic change and missed. @liorghub, What is the functional impact?

@prsunny I did a quick check .. noting down the events from syslog. I find that the 'netdev_oper_status' is set much earlier for an interface as long as the interface is connected and up. The teamd member addition happens earlier.

Apr 26 18:33:56.812132 str2---1 NOTICE swss0#orchagent: :- initializePort: Initializing port alias:Ethernet4 pid:1000000000006 Apr 26 18:33:56.817494 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:0 oper:0 addr:40:7c:7d:bb:26:0b ifindex:22 master:0 Apr 26 18:33:56.817741 str2---1 NOTICE swss0#portsyncd: :- onMsg: Publish Ethernet4(ok:down) to state db Apr 26 18:33:56.818394 str2---1 NOTICE swss0#orchagent: :- addHostIntfs: Create host interface for port Ethernet4 Apr 26 18:33:56.833381 str2---1 NOTICE swss0#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet4 Apr 26 18:33:56.833450 str2---1 NOTICE swss0#orchagent: :- initPort: Initialized port Ethernet4 Apr 26 18:33:56.897841 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:0 Apr 26 18:33:56.898243 str2---1 NOTICE swss0#portsyncd: :- onMsg: Publish Ethernet4(ok:up) to state db Apr 26 18:33:56.898260 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2 Apr 26 18:33:56.898310 str2---1 NOTICE swss0#portsyncd: message repeated 2 times: [ :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2] Apr 26 18:33:56.900044 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2 Apr 26 18:33:56.901037 str2---1 INFO kernel: [ 140.005295] PortChannel102: Port device Ethernet4 added Apr 26 18:33:56.901375 str2---1 NOTICE teamd0#teammgrd: :- addLagMember: Add Ethernet4 to port channel PortChannel102 Apr 26 18:33:56.912638 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2

@liorghub could you share a bit more details on when you observe this behavior -- is it seen always with lldp ? for all port channel member interfaces ( or only for interface which were initially oper down, after a while they become oper up as they become part of portchannel ? )

@judyjoseph
Hi judy,
Issue happens when switch is booting.
Ethernet0 is part of port-channel.

As you can see below, portsyncd gets several netlink messages for Ethernet0,
The last message that arrives without "master" (master:0) is at 07:19:15.359655 and it is oper down.
Later we get more messages for Ethernet0 with oper up but we ignore them since they are marked with "master".
Interfaces that have master can be either part of vlan bridge or part of port-channel.
We want to ignore only vlan bridge (confirmed with @zhenggen-xu)

Since the last massage for Ethernet0 we handle is with oper down, state DB holds "netdev_oper_status" = "down", this is causing wrong LLDP behaviour.
Issue is persistent and occurs after each reboot.

See below logs:

root@r-tigon-20:/home/admin# grep -e "nlmsg type" -e Publish /var/log/syslog | egrep "Ethernet0" Apr 28 07:19:15.287582 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:0 oper:0 addr:1c:34:da:c9:60:68 ifindex:77 master:0 type:sx_netdev Apr 28 07:19:15.287898 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db Apr 28 07:19:15.291418 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:0 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:0 type:sx_netdev Apr 28 07:19:15.291972 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db Apr 28 07:19:15.359292 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:0 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:0 type:sx_netdev Apr 28 07:19:15.359510 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db Apr 28 07:19:15.359655 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:0 type:sx_netdev Apr 28 07:19:15.359866 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db Apr 28 07:19:15.360309 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev Apr 28 07:19:15.360352 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev Apr 28 07:19:15.365219 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev Apr 28 07:19:15.367925 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev Apr 28 07:19:27.880041 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:1 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev Apr 28 07:19:28.011930 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:1 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev

liorghub · 2022-05-01T10:10:53Z

@liorghub Thanks for the fix. Can you please add vs tests to cover the cases where the port being part of Bridge or LAG?

@judyjoseph I added vs test as requested.

liorghub · 2022-05-01T10:44:46Z

/azpw run Azure.sonic-swss

mssonicbld · 2022-05-01T10:44:48Z

/AzurePipelines run Azure.sonic-swss

azure-pipelines · 2022-05-01T10:44:55Z

Azure Pipelines successfully started running 1 pipeline(s).

liorghub · 2022-05-01T14:53:23Z

Azure.sonic-swss (BuildArm arm64) is failing in download artifacts.
Download from the specified build: #94906 Download artifact to: /data/myagent/_work/1/s/sonic-buildimage.centec-arm64 Using default max parallelism. Max dedup parallelism: 192 ApplicationInsightsTelemetrySender will correlate events with X-TFS-Session d7dd35bc-155c-4fca-a243-832655dcc403 DedupManifestArtifactClient will correlate http requests with X-TFS-Session d7dd35bc-155c-4fca-a243-832655dcc403 Minimatch patterns: [**] Filtered 719 files from the Minimatch filters supplied. Downloaded 0.0 MB out of 5,916.6 MB (0%). Downloaded 362.1 MB out of 5,916.6 MB (6%). ##[error]Exit code 139 returned from process: file name '/data/myagent/bin.2.200.2/Agent.PluginHost', arguments 'task "Agent.Plugins.PipelineArtifact.DownloadPipelineArtifactTaskV2_0_0, Agent.Plugins"'.

liorghub · 2022-05-01T14:53:42Z

/azpw run Azure.sonic-swss

mssonicbld · 2022-05-01T14:53:43Z

/AzurePipelines run Azure.sonic-swss

azure-pipelines · 2022-05-01T14:53:52Z

Azure Pipelines successfully started running 1 pipeline(s).

liorghub · 2022-05-02T05:54:52Z

The following tests failed, trying to rerun.

test_NeighborAddRemoveIpv6LinkLocal
test_PortMirrorToLagAddRemove
test_RouteAddRemoveIpv4Route
test_RouteAddRemoveIpv4RouteUnresolvedNeigh
test_RouteAddRemoveIpv4RouteWithVrf
test_RouteAddRemoveIpv4BlackholeRoute

liorghub · 2022-05-02T05:55:11Z

/azpw run Azure.sonic-swss

mssonicbld · 2022-05-02T05:55:13Z

/AzurePipelines run Azure.sonic-swss

azure-pipelines · 2022-05-02T05:55:23Z

Azure Pipelines successfully started running 1 pipeline(s).

liorghub · 2022-05-08T05:36:09Z

/azpw run Azure.sonic-swss

mssonicbld · 2022-05-08T05:36:11Z

/AzurePipelines run Azure.sonic-swss

azure-pipelines · 2022-05-08T05:36:20Z

Azure Pipelines successfully started running 1 pipeline(s).

liorghub · 2022-05-09T07:34:24Z

/azpw run Azure.sonic-swss

mssonicbld · 2022-05-09T07:34:30Z

/AzurePipelines run Azure.sonic-swss

azure-pipelines · 2022-05-09T07:34:39Z

Azure Pipelines successfully started running 1 pipeline(s).

prgeor · 2022-05-11T23:15:30Z

@liorghub can you test the following scenario :-
The "admin_status" is the same for APPL_DB's PORT_TABLE and STATE_DB's PORT_TABLE for ports that are part of VLAN, portchannel and that are not part of any VLAN, portchannel.

liorghub · 2022-05-19T06:55:15Z

@liorghub can you test the following scenario :- The "admin_status" is the same for APPL_DB's PORT_TABLE and STATE_DB's PORT_TABLE for ports that are part of VLAN, portchannel and that are not part of any VLAN, portchannel.

@prgeor
I performed the test you asked for, the results are as expected.
port not part of vlan and not part of port-channel:

redis-cli -n 0 hgetall "PORT_TABLE:Ethernet96"
 1) "oper_status"
 2) "up"

redis-cli -n 6 hgetall "PORT_TABLE|Ethernet96"
 3) "netdev_oper_status"
 4) "up"

port part of port-channel:

redis-cli -n 0 hgetall "PORT_TABLE:Ethernet112"
 1) "oper_status"
 2) "up"

redis-cli -n 6 hgetall "PORT_TABLE|Ethernet112"
 3) "netdev_oper_status"
 4) "up"

port part of vlan:

redis-cli -n 0 hgetall "PORT_TABLE:Ethernet92"
 1) "oper_status"
 2) "up"

redis-cli -n 6 hgetall "PORT_TABLE|Ethernet92"
 3) "netdev_oper_status"
 4) "down"

For port which is part of vlan, indeed there is inconsistency between databases.
Before my changes, the inconsistency occurred for ports which are part of lag as well.
It looks like we should completely remove the ignore operation we have here:
https://github.com/Azure/sonic-swss/blob/2ea8581da4ba6f97bebde4845a234d7c810e5515/portsyncd/linksync.cpp#L215
Once we will remove it completely (I removed it for lag members), inconsistency will be solved.
@zhenggen-xu Can you please approve?

liorghub · 2022-05-19T08:28:22Z

@prgeor @zhenggen-xu
Guys, can we meet via teams and discuss?
I have all the details and we can close it real fast if we will talk.

prgeor · 2022-05-19T18:29:04Z

@prgeor @zhenggen-xu Guys, can we meet via teams and discuss? I have all the details and we can close it real fast if we will talk.

@liorghub OK

zhenggen-xu · 2022-05-20T06:05:37Z

@liorghub thought about this a little more, I think the right fix should be changing:

if (master)
{
    return;
}

to:

if (master && nlmsg_type == RTM_DELLINK)
{
    return;
}

what we were really trying to avoid before was when the PORT was removed from bridge, we didn't want to remove the port itself. I think this should be applicable to LAG too (in case port was removed from LAG), thus above changes. This should also fix the inconsistency of the the link status across the tables as you mentioned above. My email: zxu@linkedin.com , we can meet in Teams.

…is being removed from bridge

liorghub · 2022-05-22T11:39:35Z

@zhenggen-xu your fix made it work, thanks!
@prgeor I retested the scenarios you mentioned and also the scenario of removing port from vlan and from lag. Everything is working as expected there is no inconsistency between databases anymore.

zhenggen-xu · 2022-05-22T15:07:54Z

portsyncd/linksync.cpp

@@ -215,7 +215,7 @@ void LinkSync::onMsg(int nlmsg_type, struct nl_object *obj)
    /* If netlink for this port has master, we ignore that for now
     * This could be the case where the port was removed from VLAN bridge
     */
-    if (master)
+    if (master && nlmsg_type == RTM_DELLINK)


Let change the comments session above to state that we ignore the DELLINK message if port has master, this is applicable to the case where port was part or VLAN or LAG etc. You should rename the PR title too.

dprital · 2022-05-24T07:26:40Z

@zhenggen-xu , @prgeor - Now that all the comments were addressed, can you please approve this PR ?

- What I did Ignore netlink DELLINK messages if port has master, this is applicable to the case where port was part of VLAN bridge or LAG. - Why I did it Netlink messages handler in portsyncd was ignoring all messages that had master. Therefore we ignored messages on interfaces that belong to LAG (not only interfaces belong to bridge as intended). The result was "netdev_oper_status" down in PORT_TABLE in state DB for port which is part of LAG although it is actually up. - How I verified it Check "netdev_oper_status" in PORT_TABLE in state DB for port which is part of LAG.

liorghub requested a review from prsunny as a code owner April 18, 2022 19:57

Do not ignore netlink messages on interfaces belong to LAG

4f2cf4f

liorghub force-pushed the fix_netdev_oper branch from 2786614 to 4f2cf4f Compare April 18, 2022 19:58

dprital added the Request for 202111 Branch label Apr 19, 2022

prsunny reviewed Apr 20, 2022

View reviewed changes

dprital requested a review from judyjoseph April 26, 2022 07:02

Add VS test for fix in netlink messages handler

df2d788

Merge branch 'Azure:master' into fix_netdev_oper

7f21453

prsunny requested a review from prgeor May 11, 2022 22:44

prgeor mentioned this pull request May 12, 2022

Consumer of same field subscribed to different DBs to have DBs/TABLE check prior to processing event sonic-net/sonic-platform-daemons#259

Open

shyam77git mentioned this pull request May 16, 2022

admin_status field to be made consistent across different DBs (STATE_DB, APPL_DB) #2275

Open

Revert former fix and ignore netlink messages of a port only when it …

a8228f6

…is being removed from bridge

Remove unneeded include

45232f2

zhenggen-xu previously approved these changes May 22, 2022

View reviewed changes

zhenggen-xu reviewed May 22, 2022

View reviewed changes

Fix comment

5bb10de

liorghub dismissed zhenggen-xu’s stale review via 5bb10de May 23, 2022 17:53

liorghub changed the title ~~Do not ignore netlink messages on interfaces belong to LAG~~ Ignore only DELLINK message if port has master May 23, 2022

liorghub changed the title ~~Ignore only DELLINK message if port has master~~ Update netlink messages handler May 24, 2022

zhenggen-xu approved these changes May 24, 2022

View reviewed changes

liat-grozovik approved these changes May 25, 2022

View reviewed changes

liat-grozovik merged commit 7fc0f73 into sonic-net:master May 25, 2022

dprital mentioned this pull request May 25, 2022

[submodule] Advanved sonic-swss pointer sonic-net/sonic-buildimage#10926

Merged

6 tasks

judyjoseph added the Included in 202111 Branch label May 25, 2022

prsunny mentioned this pull request Jul 6, 2022

DUT sent lldp frame with incorrect port ID to EOS neighbor sonic-net/sonic-buildimage#11255

Closed

akokhan mentioned this pull request Sep 20, 2022

[teammgr] Added LAG member check into addLagMember() #2464

Merged

yenlu-keith mentioned this pull request Feb 13, 2023

Added LAG member check on addLagMember() (#2464) #2665

Closed

Update netlink messages handler #2233

Update netlink messages handler #2233

Conversation

liorghub commented Apr 18, 2022 • edited Loading

zhenggen-xu commented Apr 20, 2022

prsunny Apr 20, 2022

Choose a reason for hiding this comment

liorghub Apr 24, 2022 • edited Loading

Choose a reason for hiding this comment

prsunny Apr 25, 2022 • edited Loading

Choose a reason for hiding this comment

liorghub Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

prsunny Apr 26, 2022

Choose a reason for hiding this comment

judyjoseph Apr 26, 2022

Choose a reason for hiding this comment

liorghub Apr 28, 2022 • edited Loading

Choose a reason for hiding this comment

liorghub commented May 1, 2022

liorghub commented May 1, 2022

mssonicbld commented May 1, 2022

azure-pipelines bot commented May 1, 2022

liorghub commented May 1, 2022

liorghub commented May 1, 2022

mssonicbld commented May 1, 2022

azure-pipelines bot commented May 1, 2022

liorghub commented May 2, 2022

liorghub commented May 2, 2022

mssonicbld commented May 2, 2022

azure-pipelines bot commented May 2, 2022

liorghub commented May 8, 2022

mssonicbld commented May 8, 2022

azure-pipelines bot commented May 8, 2022

liorghub commented May 9, 2022

mssonicbld commented May 9, 2022

azure-pipelines bot commented May 9, 2022

prgeor commented May 11, 2022

liorghub commented May 19, 2022 • edited Loading

liorghub commented May 19, 2022

prgeor commented May 19, 2022

zhenggen-xu commented May 20, 2022

liorghub commented May 22, 2022

zhenggen-xu May 22, 2022

Choose a reason for hiding this comment

liorghub May 23, 2022

Choose a reason for hiding this comment

dprital commented May 24, 2022

liorghub commented Apr 18, 2022 •

edited

Loading

liorghub Apr 24, 2022 •

edited

Loading

prsunny Apr 25, 2022 •

edited

Loading

liorghub Apr 26, 2022 •

edited

Loading

liorghub Apr 28, 2022 •

edited

Loading

liorghub commented May 19, 2022 •

edited

Loading