prefix yang'ification of code performance is beyond awful #6658

donaldsharp · 2020-06-30T22:24:58Z

7.3.1 load up of a 65k bgpd.conf that has prefix-lists and route-maps is aproximately 42 seconds of cpu. When running against a master with the yang code, I gave up on the read in with over > 2 hours of run time. This is completely unaceptable.

I cannot attach the config file as that it was given to me in confidence:

sharpd@eva:/frr4$ grep "prefix-list" /home/sharpd/frr.conf | wc
25958 201502 1542898
sharpd@eva:/frr4$ grep "route-map" /home/sharpd/frr.conf | wc
7699 33219 296423
sharpd@eva:~/frr4$

I think we should be able to come up with a configuration that shows this.

donaldsharp · 2020-06-30T22:28:29Z

donaldsharp · 2020-06-30T22:28:43Z

30.05% libyang.so.1.8.4 [.] moveto_node_check
4.90% libc-2.30.so [.] __memmove_sse2_unaligned_erms
4.37% libyang.so.1.8.4 [.] moveto_node
3.65% libyang.so.1.8.4 [.] lyd_node_module
2.79% libc-2.30.so [.] __memset_avx2_unaligned_erms
1.24% libpthread-2.30.so [.] __pthread_getspecific
1.09% libyang.so.1.8.4 [.] lyd_node_module@plt
1.05% libasan.so.5.0.0 [.] __interceptor_free
0.78% slack [.] v8::internal::ConcurrentMarking::Run
0.56% libasan.so.5.0.0 [.] 0x000000000002c188
0.55% libasan.so.5.0.0 [.] 0x000000000010b360

When using the default CLI mode, the northbound layer needs to create a separate transaction to process each YANG-modeled command since they are supposed to be applied immediately (there's no candidate configuration nor the "commit" command like in the transactional CLI). The problem is that configuration transactions have an overhead associated to them, in big part because of the use of some heavy libyang functions like `lyd_validate()` and `lyd_diff()`. As of now this overhead is substantial and doesn't scale well when large numbers of transactions need to be performed in sequence. As an example, loading 50k prefix-lists using a single transaction takes about 2 seconds on a modern CPU. Loading the same 50k prefix-lists using 50k transactions can take more than an hour to complete (which is unacceptable by any standard). To fix this problem, some heavy optimization work needs to be done on libyang and on the FRR northbound itself too (e.g. perform partial configuration diffs whenever possible). This, however, should be a long term effort since these optimizations shouldn't be trivial to implement and we're far from having the performance numbers we need. In the meanwhile, this commit introduces a simple but efficient workaround to alleviate the issue. In short, a new back-off timer was introduced in the CLI to monitor and detect when too many YANG-modeled commands are being received at the same time. When a certain threshold is reached (100 YANG-modeled commands within one second), the northbound starts to group all subsequent commands into a single large transaction, which allows them to be processed much faster (e.g. seconds and not hours). It's essentially a protection mechanism that creates dynamically-sized transactions when necessary to prevent performance issues from happening. This mechanism is enabled both when parsing configuration files and when reading commands from a terminal. The downside of this optimization is that, if several YANG-modeled commands are grouped into the same transaction and at least one of them fails, the whole transaction is rejected. This is undesirable since users don't expect transactional behavior when that's not enabled explicitly. To minimize this issue, the CLI will log all commands that were rejected whenever that happens, to make the user aware of what happened and have enough information to fix the problem. Commands that fail due to parsing errors or CLI-level validations in general are rejected separately. Again, this proposed workaround is intended to be temporary. The goal is to provided a quick fix to issues like FRRouting#6658 while we work on better long-term solutions. Signed-off-by: Renato Westphal <renato@opensourcerouting.org>

donaldsharp added the triage Needs further investigation label Jun 30, 2020

rwestphal mentioned this issue Jul 13, 2020

lib: introduce configuration back-off timer for YANG-modeled commands #6727

Merged

qlyoung closed this as completed Aug 2, 2021

qlyoung added libyang performance and removed triage Needs further investigation labels Aug 2, 2021

pguibert6WIND mentioned this issue Apr 18, 2024

BGP CPU issue during route-map / community-list configuration / yang issue #15790

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefix yang'ification of code performance is beyond awful #6658

prefix yang'ification of code performance is beyond awful #6658

donaldsharp commented Jun 30, 2020

donaldsharp commented Jun 30, 2020

donaldsharp commented Jun 30, 2020

prefix yang'ification of code performance is beyond awful #6658

prefix yang'ification of code performance is beyond awful #6658

Comments

donaldsharp commented Jun 30, 2020

donaldsharp commented Jun 30, 2020

donaldsharp commented Jun 30, 2020