Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bgp nexthop group optimisations #15488

Draft
wants to merge 60 commits into
base: master
Choose a base branch
from

Conversation

pguibert6WIND
Copy link
Member

This patch set proposes to bring BGP changes. Among the reasons of those changes:

  • split the BGP information between prefix and nexthop. This change applies to single path routes, but also multiple path routes (with addpath functionality).
  • reduce the message exchanged between BGP and ZEBRA during failover scenarios

Link: https://github.com/pguibert6WIND/SONiC/blob/proposal_bgp/doc/pic/bgp_pic_arch_doc.md#9-frrouting-ipc-messaging-from-bgp-to-zebra

The BGP nexthop-groups created need to be installed to ZEBRA.
Notifications from ZEBRA must be sent back to BGP.

Extend the BGP nexthop group library: add the syncronisation with the
ZEBRA_NHG_ADD config messages to ZEBRA, and installation and
removal events from ZEBRA.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Each BGP path should reference the nexthop-group according to its
nexthop attribute information.

A bgp_nhg_cache pointer is added in each BGP path structure, as
well as a pointer in the linked list.

At installation time, the nexthops of the paths are parsed.
Some paths will not have a BGP nexthop-group identifier.
- Aggregated paths
- nexthops resolving over blackhole routes
- directly connected nexthops
- ECMP nexthops
- IPv4 mapped IPv6 nexthops

The selected BGP path will get a bgp_nhg_cache entry.
The BGP route will be added to ZEBRA with the BGP nexthop-group
identifier. The installation to ZEBRA will be delayed after the
BGP nexthop-group is installed.

The route removal procedure is kept unchanged.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Nexthop tracking changes should be handled for each BGP
nexthop-group, prior to evaluate each BGP path.

Add a hook on nexthop tracking to parse the list of available
bgp_nhg_cache entries matching the changed nexthop tracking
context.

The nexthop-group is either detached and removed from the BGP
paths; or updated to take into account the new resolved path.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Before using nexthop-groups, ZEBRA could cancel route installation:
- if the prefix was not a host prefix and was resolving over itself
- if the prefix was resolving over the default route whereas the
resolution to the default route was unconfigured on ZEBRA.

When using BGP nexthop-grups, at nexthop-group installation, ZEBRA
does not know the prefix, and can not do this control.

Add the control at BGP level to compare BGP prefixes and the nexthop
resolution: the control will eventually install the route without using
the nexthop-group.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Nexthop group utilities used for testing in topotests, need to be shared
in a common place. New BGP tests will be done.

Move the nexthop_group functions from the all_protocol_startup test to a
new file in lib/nexthop_group.py
A new router name parameter is passed to the following functions:
- route_get_nhg_id()
- verify_nexthop_group()
- verify_route_nexthop_group()

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
This topology uses an IGP network with BGP between r1, and
r5/r6 on the one hand, and two CEs at the edge side, on the
other hand. The bgp nexthop-group mode is used, meaning that
BGP gathers prefixes which use a shared nexthop together, and
use the NHG_ADD zapi messages to inform ZEBRA.

Consequently, ZEBRA will pick up an identifier (either the one
proposed by BGP or an other one chosen by ZEBRA), and will stick
that identifier to the BGP nexthop.

The test ensures that that identifier is correctly used in the
ZEBRA RIB, and the Linux FIB, even when the IGP resolution
changes (interfaces going down, or remote IGP label value change).

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Some basic path information is used by the 'show bgp nexthops detail'
command. This information only deals with path information, and could
be used by other show commands to list bgp_paths.

Move the specific code from bgp_nexthop.c to bgpd.c
- bgp_show_bgp_path_info_flags() is renamed to bgp_path_info_show_flags()
- add a new function bgp_path_info_display() to display this information

The AS of the bgp path is displayed based on the BGP instance, where
the path is.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add a vty function that displays the nexthop groups
used by BGP.

> r1# show bgp nexthop-group  detail
> ID: 75757571, #paths 1
>   Flags: 0x0000
>   State: 0x0001 (Installed)
>            via 172.31.10.7 (vrf vrf1) inactive, weight 1
>   Paths:
>     1/1 192.0.2.7/32 VRF vrf1 flags 0x418
> ID: 75757572, #paths 1
>   Flags: 0x0000
>   State: 0x0001 (Installed)
>            via 172.31.11.8 (vrf vrf2) inactive, weight 1
>   Paths:
>     1/1 192.0.2.7/32 VRF vrf2 flags 0x418
> ID: 75757575, #paths 1
>   Flags: 0x0003 (allowRecursion, internalBgp)
>   State: 0x0001 (Installed)
>            via 192.0.2.6 (vrf default) inactive, weight 1
>   Paths:
>     1/1 192.0.2.9/32 VRF default flags 0x418
> ID: 75757576, #paths 1
>   Flags: 0x0003 (allowRecursion, internalBgp)
>   State: 0x0001 (Installed)
>            via 192.0.2.6 (vrf default) inactive, label 6000, weight 1
>   Paths:
>     1/1 192.0.2.9/32 VRF vrf1 flags 0x4018

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When BGP addpath is used, BGP prefixes may use one or multiple
nexthops. To implement multipath with BGP nexthop-groups, each
nexthop should be handled separately for each single path, and
at the same time, grouped for the multiple paths.

Because of this requirement, the choice of having a two level
nexthop group hierarchy with two kind of nexthop-groups is done:
- child nexthop-group contain the original bgp nexthop-group
definition. Each child can have only one nexthop.
- parent nexthop-group contain the child nexthop-group identifiers

A nhg_childs tree implements the connection of the parent to one
or multiple childs (depending if it is a single or multiple path)
A nhg_groups tree implements the connection of the child to the
parents using that child (because one nexthop may be used by
single or multiple paths).

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The bgp_nhg_cache entry needs to handle an hierarchy to either reference
parent groups or child nexthops.

Move the original nexthop information into a sub structure and use the
new 'nexthops' instance into the original bgp_nhg_cache entry.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The list of paths will be available for any kind of bgp nhg.
To avoid code redundancy, a rework is done.

Move the code of that parsing from the show_bgp_nhg_id_helper(),
to a separte show_bgp_nhg_id_helper_detail().

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Introducing a nexthop-group hierarchy requires to handle
child and parent nexthop-groups in different ways.

Reusing a same child nexthop-group by using hash entries
based on nexthop attributes makes sense, and guarantees
that each nexthop is unique. For instance, at installation
time of an incoming BGP update, BGP will figure out an
existing nexthop exists.

For parent nexthop-groups, using key based on the number
of childs, and the id of each child is not wishable: this
implies that at failover events, the key of the parent
nexthop-group would be recomputed, and may conflict with
an already present parent nexthop-group. This would result
in moving bgp paths to a new nexthop-group, whereas the
goal of this implementation is to minimise the number of
changes at failover events.

This is why it is proposed to have a separate hash list
for parent nexthop-groups, which will be ordered by
id: each id is unique and will never conflict with a
parent that shares the same childs.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Extend the current BGP nexthop-group API to support an hierarchy
between child nexthop-groups and parent nexthop-groups.

- The original bgp_nhg_cache hash entry is used to store both
kind of nexthop-groups.

- Extend the debug API to dump parent nexthop groups.

- Extend the BGP API to send NHG_CHILD_ADD command to zebra.

- Extend the BGP NHG add and delete API by sorting out the
NHG in the appropriate hash-list

- Extend the BGP NHG show command to reach both hash lists.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add a new 'bgp_nhg_nexthop' pointer in each bgp path.
This pointer is the child nexthop-group, is not used
in this commit.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Unlock the possibility to use nexthop groups for ECMP paths.

For each path to install, two nexthop groups will be used:
parent (bgp_nhg attribute) and child (bgp_nhg_nexthop attribute).

Handle the ZEBRA installation order at protocol level: the nexthop
first, the parent after.

Unlock the ability to handle multiple paths with NHGs.

At nexthop tracking event, when a nexthop is invalid, the child
nexthop-groups are detached, and the parent nexthop-group is updated.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When BGP peering is considered down due to BFD event, or when the
operator performs a clear command on a given peer, the parent
nexthop groups are never kept.

Actually, upon each bgp path removal, the peer routing table paths
are flushed, and the two bgp nexthop-groups of each path are
detached, whereas only the child nexthop-group of each path of the
peer should be detached.

Prevent from detaching the parent nexthop-group by introducing
a two step procedure:
- for each peer failover event, the bgp tables are parsed:
for each path of the peer, only the child nexthop-group is detached.
- when all tables are parsed, a check is done on the unused
nexthops. The same parent will eventually be updated with the
updated child group number.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The 'show bgp nexthop-group' function is extended to display
the parent and child information.

Display if a nexthop-group is a child or a parent.
Display the nexthop-group interconnections between child
list and parent list.

> r1# show bgp nexthop-group detail
> ID: 75757571, #paths 1
>   Flags: 0x0003 (allowRecursion, internalBgp)
>   State: 0x0001 (Installed)
>            via 192.0.2.5 (vrf default) inactive
>           parent group 75757574                <--- added
>   Paths:					 <--- added
>     1/1 192.0.2.9/32 VRF default flags 0xc10
> ID: 75757573, #paths 1
>   Flags: 0x0003 (allowRecursion, internalBgp)
>   State: 0x0001 (Installed)
>            via 192.0.2.6 (vrf default) inactive
>           parent group 75757574
> ID: 75757574, #paths 2
>   Flags: 0x000b (allowRecursion, internalBgp, TypeParent)
>   State: 0x0001 (Installed)
>           child list count 2			<--- added
>           childi(s) 75757571 75757573		<--- added
> r1# show bgp nexthop-group json
> [..]
>   {
>     "nhgId":75757575,
> [..]
>     "parentList":[				<--- added
>       {
>         "Id":75757576
>       }
>      ]
>   },
>   {
>     "nhgId":75757576,
>  [..]
>     "pathCount":3,
>     "childListCount":3,			<--- added
>     "childList":[				<--- added
>        {
>          "Id":75757571
>        },
>        {
>          "Id":75757573
>        },
>        {
>          "Id":75757575
>        }
>      ],
>      "paths": [				<--- added
>       {
>        "afi": "IPv4",
>        "safi:" "unicast",
>        "prefix": "192.0.2.9/32",
>        [..]
>       }
>      ]
>  }
>]

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Until now, only the PATH_SELECTED path had a link to the
parent nexthop group. This may be a problem if we have a
single path with no bgp_nhg, and a bgp_nexthop.

Instead of parsing the whole path list for a given prefix,
reuse the bgp_nhg pointer for all entries.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using BGP nexthop-groups with route, it is not needed to reinstall
routes when only the nexthop needs to be updated for any reason (IGP,
or same incoming BGP update received with nexthop modification).

Fix this by introducing a flag at node level, to know if a route update
is needed or not. This feature works if 'bgp suppress-fib-enabled'
functionality is used. Without that, the BGP service does not know if
a route has already been installed or not.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When a BFD failure happens with multiple same nexthops belonging
to different peers, the parent nexthop group is not refreshed,
and after one BGP failover, a new parent NHGID is chosen.

This case happens on a multihomed setup with a full-route that
contains recursive routes that have the same nexthop.

The below case shows the same prefix with the same nexthop
learned from 3 different peers. A single nexthop group is
created for all the three identical nexthops.

> r1# show bgp nexthop-group 71428590 detail
> ID: 71428590, #paths 6
>   Flags: 0x000b (allowRecursion, internalBgp, TypeParent)
>   State: 0x0001 (Installed)
>           child list count 1 (peer count 3)
>           child(s) 71428577
>   Paths:
>     1/1 172.18.1.101/32 VRF default flags 0xc10
>     1/1 172.18.1.101/32 VRF default flags 0xc10
>     1/1 172.18.1.101/32 VRF default flags 0x418
>     1/1 172.18.1.100/32 VRF default flags 0xc10
>     1/1 172.18.1.100/32 VRF default flags 0xc10
>     1/1 172.18.1.100/32 VRF default flags 0x418
> r1# show bgp nexthop-group 71428577 detail
> ID: 71428577, #paths 6
>   Flags: 0x0003 (allowRecursion, internalBgp)
>   State: 0x0001 (Installed)
>            via 172.16.0.100 (vrf default) inactive
>           parent list count 1
>           parent(s) 71428590
>   Paths:
>     1/1 172.18.1.101/32 VRF default flags 0x418
>     1/1 172.18.1.100/32 VRF default flags 0x418
>     1/1 172.18.1.100/32 VRF default flags 0xc10
>     1/1 172.18.1.101/32 VRF default flags 0xc10
>     1/1 172.18.1.101/32 VRF default flags 0xc10
>     1/1 172.18.1.100/32 VRF default flags 0xc10
> r1# show bgp ipv4 172.18.1.101/32
> BGP routing table entry for 172.18.1.101/32, version 26
> Paths: (3 available, best FRRouting#1, table default, vrf (null))
>   Not advertised to any peer
>   Local
>     172.16.0.100 from 192.0.2.5 (192.0.2.5)
>       Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best (Router ID)
>       AddPath ID: RX 5, TX-All 0 TX-Best-Per-AS 0 TX-Best-Selected 0
>       Last update: Fri Sep 13 14:47:14 2024
>   Local
>     172.16.0.100 from 192.0.2.3 (192.0.2.8)
>       Origin incomplete, metric 0, localpref 100, valid, internal, multipath
>       Originator: 192.0.2.8, Cluster list: 192.0.2.3
>       AddPath ID: RX 18, TX-All 0 TX-Best-Per-AS 0 TX-Best-Selected 0
>       Last update: Fri Sep 13 14:45:12 2024
>   Local
>     172.16.0.100 from 192.0.2.3 (192.0.2.6)
>       Origin incomplete, metric 0, localpref 100, valid, internal, multipath
>       Originator: 192.0.2.6, Cluster list: 192.0.2.3
>       AddPath ID: RX 13, TX-All 0 TX-Best-Per-AS 0 TX-Best-Selected 0
>       Last update: Fri Sep 13 14:45:12 2024

After failover, the parent nexthop group is replaced, and all the
remaining BGP paths are updated with the new parent nexthop group.

Instead, the same parent NHGID should be kept. Create an hash
list of peers for each child nexthop. Before failover, 3 peers
are mentioned along with the exact path counter. At failover,
the failed peer decrements, and the parent NHGID is refreshed

> 2024/09/13 13:52:11.277173 BGP: [MNZ4S-8HW2T] NHG 71428578: peer count changed (3 -> 2) for nexthop (172.16.0.100 if 0 VRF 0 wt 0  )

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add BGP user documentation about nexthop-group (show, debug, config).

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Add technical information about BGP nexthop groups.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using bgp nexthop groups, it is possible to avoid route
updates, and this test ensures that the number of zapi route
messages from BGP to ZEBRA can be reduced, for instance when
a failover happens.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When not used by default, bgp nexthop group are not tested
against the following route entry added:
- route entry resolved over default route
- route entry resolved over itself
- route entry resolved over blackhole route
- route entry connected (case mplsvpn)
- route entry using ipv4 mapped ipv6 address

This test suite covers the above use cases.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When using BGP nexthop-groups, BGP does not add inactive nexthops to
zebra routes. This is the case in the bgp_peer_type_multipath test,
where step8 fails to match the 203.0.113.8/30 prefix.

> # show bgp ipv4
> [..]
> *= 203.0.113.8/30   198.51.100.10                          0 64503 i
> *>                  10.0.3.2                               0 64502 i

With BGP nexthop-groups, BGP installs the valid nexthop.

> # show ip route
> [..]
> B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:22:49

Without BGP nexthop-group, BGP adds 2 nexthops in zebra, including
the inactive one.

> # show ip route
> [..]
> B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:00:21
>  *                       via 198.51.100.10 inactive, weight 1, 00:00:21

The option to permit sending inactive nexthops has been tested, and
leads to zebra error messages, which result in having non deterministic
output:

> 2024/07/03 16:22:20.137637 ZEBRA: [RTQGY-B7JCM] zapi_read_nexthop_group: nhgroup=75757592)
> 2024/07/03 16:22:20.137639 ZEBRA: [P3R7K-NKGH6][EC 4043309131] zapi_read_nexthop_group: Nexthops Groups Specified: 3(75757592) not found
> 2024/07/03 16:22:20.137688 ZEBRA: [NTQ7Y-ATGH0][EC 4043309131] zread_nhg_child_add: Nexthop Group Creation failed

An other option to modify the test to ignore the invalid nexthop entry
is preferred. This is what this commit will propose.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Modify the current test, to add recursive BGP routes: those routes
will resolve over BGP routes, similar to what happens when a full
route is received.

The IGP failure test checks that the NHIDs from those recursive
routes is not changed, and that a nexthop group message update to
ZEBRA is enough to change the routing to all incoming BGP routes.

The BFD failover test is implemented, and warns that the nexthop
group ID has changed.

The check_route test is modified to ignore the duplicated nexthops
in the FIB. Counting the number of duplicated nexthops is a pain
with recursive BGP routes, as per the below output:

> r1# show ip route 172.18.1.100
> Routing entry for 172.18.1.100/32
>   Known via "bgp", distance 200, metric 0, best
>   Last update 00:00:13 ago
>     172.16.0.200 (recursive), weight 1
>   *   172.31.0.3, via r1-eth1, label 16055, weight 1
>   *   172.31.2.4, via r1-eth2, label 16055, weight 1
>   *   172.31.0.3, via r1-eth1, label 16006, weight 1
>   *   172.31.2.4, via r1-eth2, label 16006, weight 1
>   *   172.31.8.7, via r1-eth4, label 16008, weight 1
>     172.16.0.200 (duplicate nexthop removed) (recursive), weight 1
>       172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16055, weight 1
>       172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16055, weight 1
>       172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16006, weight 1
>       172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16006, weight 1
>       172.31.8.7, via r1-eth4 (duplicate nexthop removed), label 16008, weight 1
>
> r1# show ip route 172.22.1.100
> Routing entry for 172.22.1.100/32
>   Known via "bgp", distance 200, metric 0, best
>   Last update 00:00:17 ago
>     172.18.1.200 (recursive), weight 1
>   *   172.31.0.3, via r1-eth1, label 16055, weight 1
>   *   172.31.2.4, via r1-eth2, label 16055, weight 1
>   *   172.31.0.3, via r1-eth1, label 16006, weight 1
>   *   172.31.2.4, via r1-eth2, label 16006, weight 1
>   *   172.31.8.7, via r1-eth4, label 16008, weight 1
>     172.18.1.200 (duplicate nexthop removed) (recursive), weight 1
>       172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16055, weight 1
>       172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16055, weight 1
>       172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16006, weight 1
>       172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16006, weight 1
>       172.31.8.7, via r1-eth4 (duplicate nexthop removed), label 16008, weight 1
>     172.18.1.200 (duplicate nexthop removed) (recursive), weight 1
>       172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16055, weight 1
>       172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16055, weight 1
>       172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16006, weight 1
>       172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16006, weight 1
>       172.31.8.7, via r1-eth4 (duplicate nexthop removed), label 16008, weight 1
>
> r1#

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The nexthop group test suite is extended with multipath support.
The test ensures that on a given multipath configuration, when a
failover happens, the new resulting paths use the same nexthop
group without the failed nexthop.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants