Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[quagga bgp] set quagga graceful restart timeout to 180 seconds #2362

Merged
merged 1 commit into from
Dec 8, 2018

Conversation

yxieca
Copy link
Contributor

@yxieca yxieca commented Dec 7, 2018

- What I did
set quagga graceful restart timeout to 180 seconds

We need graceful restart timeout of 180 for warm reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@rodnymolina
Copy link
Contributor

@yxieca Can you please also adjust the default timer we are using at fpmsyncd level? Both timers should ideally match, at least till we have a synchronization mechanism between bgpd and fpmsyncd.

@zhenggen-xu
Copy link
Collaborator

@rodnymolina Since this only change quagga default timer, it might not be exactly right to change the fpmsyncd timer alone.
If we want to keep everything in sync, we should change FRR configuration to 120s too, and then fpmsyncd changes in a different PR against sonic-swss.

Or we can have timer configuration in the config-db per platform.

@nikos-github
Copy link
Collaborator

@rodnymolina @zhenggen-xu This is changing the timer for the peer. That's completely unrelated to the local fpmsyncd timer. The two timers must not have a requirement to match nor should they be compared against each other. The synchronization you are referring to has no bearing or influence to the peer restart timer the local system has sent.

@zhenggen-xu
Copy link
Collaborator

zhenggen-xu commented Dec 7, 2018

@nikos-github I think you are right, the two timers are not strongly related. I was thinking about FRR/Quagga consistence. The fpmsyncd timer was more for the convergence time not bgp down time. That should be tuned based on worst case of the routes learnt etc not necessarily the grace restart timer.

@rodnymolina
Copy link
Contributor

@nikos-github i don't agree with that. As you know, this timer is used as an estimation of the amount of time required (by each node) to re-establish the sessions with its peers. In this case, we are increasing this value to 180 secs, so this is the time that it may take (in the worst case scenario) for the restarting router to re-learn state back from helper nodes. In this case we don't want fpmsyncd reconciliation process to kick off before we have had a chance to receive all the pending state. This artificial correlation between bgp-gr timer and fpmsyncd-timer is only needed today coz, at fpmsyncd level, we don't have a deterministic way to identify when the "re-learning" phase has concluded. Once we have this missing glue i agree that both timers can run independently.

@zhenggen-xu Not sure i fully got your point, but looks like having separated per-platform/per-routing-stack values won't help in this case, as there seems to be a system fast-reboot limitation that is forcing us to increase this timer, and that will impact FRR in the same way that it affects Quagga. And yes, i agree that we will also need to change FRR values to be fully consistent.

@nikos-github
Copy link
Collaborator

@rodnymolina The artificial correlation you are making between the timers is not correct irrespective of a signal or not for EoR. We can discuss offline.

@sonic-net sonic-net deleted a comment from lguohan Dec 7, 2018
@lguohan
Copy link
Collaborator

lguohan commented Dec 8, 2018

@rodnymolina , I feel the two timers are not related. the gr timer is between the bgp shutdown and bgp session setup after reboot, the local timer starts after the bgp session setup after reboot.

@lguohan lguohan merged commit d9c076d into sonic-net:master Dec 8, 2018
@yxieca yxieca deleted the bgpd_conf branch December 8, 2018 21:33
@rodnymolina
Copy link
Contributor

@lguohan I agree/understand that both timers can potentially measure different things, but in the absence of a mechanism to sync-up bgp and fpmsyncd (through a EoR/EOIU message), i feel it's a good idea to have both timers being more and less in-sync. My point is specially valid for typical warm-reboot use-cases (daemon/docker restart), as in these scenarios both bgp-gr-timer and fpmsyncd-timer are going to be measuring similar things. On the other hand, i understand that this correlation is much weaker on the system-warm-reboot case, as both timers can/will diverge.

Question is, which case should we optimize for? I feel that warm-restart scenarios (daemon/docker restart) are much more frequent than system-warm-reboot ones, so i'd rather cover the first scenario as best as we can. And perhaps we could go even further: forget about the system-warm-reboot case altogether, so that we can set more reasonable bgp-gr timers (~30 secs) and reduce the suboptimal-routing window. If we are interesting in optimizing for the warm-restart case, bgp-gr-timer and fpmsyncd-timer values would need to be (more and less) in-sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants