Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps #4328

Closed
hectorgcr opened this issue Apr 28, 2020 · 0 comments
Closed
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug

Comments

@hectorgcr
Copy link
Contributor

hectorgcr commented Apr 28, 2020

Changes from #2236 caused the test TestLimitNumberOfConcurrentRemoteBootstraps to start failing because the number of concurrent remote bootstrap sessions was going above the number specified by the flag load_balancer_max_concurrent_tablet_remote_bootstraps.

After further investigation, we determined that the load balancer kept issuing AddServer requests even though one of the tservers was down. This was caused by a misconfigured flag (tserver_unresponsive_timeout_ms). It's unclear why it worked before the changes from #4328, but we know that the behavior after the changes is correct. So only the test needs to be fixed by setting the flag tserver_unresponsive_timeout_ms to a value lower than follower_unavailable_considered_failed_sec

@hectorgcr hectorgcr added the kind/bug This issue is a bug label Apr 28, 2020
@hectorgcr hectorgcr self-assigned this Apr 28, 2020
@rao-vasireddy rao-vasireddy added the area/docdb YugabyteDB core features label Apr 28, 2020
@hectorgcr hectorgcr changed the title Changes to heartbeat mechanism caused a regression on the load balancer Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps Apr 29, 2020
hectorgcr added a commit that referenced this issue May 6, 2020
Summary:
Before the changes from #2236, this test was succeeding even though the master
had not marked the paused tserver as failed. For some unknown but correct reason,
after the changes from #2236, the test started failing. This diff addresses this issue
by modifying the flag tserver_unresponsive_timeout_ms to be less than
follower_unavailable_considered_failed_sec. It needs to be less, otherwise if the peer
gets removed from the raft config before the master considers this peer unresponsive,
the balancer will start assigning tablets to the peer even though it's considered failed
by its peers, this will cause several RBS to accumulate if the peer were to come
online soon after that,

Test Plan:
ybd release --cxx-test integration-tests_remote_bootstrap-itest --gtest_filter "*TestLimitNumberOfConcurrentRemoteBootstraps*" -n 100

Reviewers: nicolas, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug
Projects
None yet
Development

No branches or pull requests

3 participants