Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps #4328

hectorgcr · 2020-04-28T20:02:57Z

Changes from #2236 caused the test TestLimitNumberOfConcurrentRemoteBootstraps to start failing because the number of concurrent remote bootstrap sessions was going above the number specified by the flag load_balancer_max_concurrent_tablet_remote_bootstraps.

After further investigation, we determined that the load balancer kept issuing AddServer requests even though one of the tservers was down. This was caused by a misconfigured flag (tserver_unresponsive_timeout_ms). It's unclear why it worked before the changes from #4328, but we know that the behavior after the changes is correct. So only the test needs to be fixed by setting the flag tserver_unresponsive_timeout_ms to a value lower than follower_unavailable_considered_failed_sec

The text was updated successfully, but these errors were encountered:

Summary: Before the changes from #2236, this test was succeeding even though the master had not marked the paused tserver as failed. For some unknown but correct reason, after the changes from #2236, the test started failing. This diff addresses this issue by modifying the flag tserver_unresponsive_timeout_ms to be less than follower_unavailable_considered_failed_sec. It needs to be less, otherwise if the peer gets removed from the raft config before the master considers this peer unresponsive, the balancer will start assigning tablets to the peer even though it's considered failed by its peers, this will cause several RBS to accumulate if the peer were to come online soon after that, Test Plan: ybd release --cxx-test integration-tests_remote_bootstrap-itest --gtest_filter "*TestLimitNumberOfConcurrentRemoteBootstraps*" -n 100 Reviewers: nicolas, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8402

hectorgcr added the kind/bug This issue is a bug label Apr 28, 2020

hectorgcr self-assigned this Apr 28, 2020

rao-vasireddy added the area/docdb YugabyteDB core features label Apr 28, 2020

hectorgcr changed the title ~~Changes to heartbeat mechanism caused a regression on the load balancer~~ Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps Apr 29, 2020

bmatican closed this as completed Jun 11, 2020

ryan-ally mentioned this issue Nov 30, 2023

[Snyk] Fix for 1 vulnerabilities ryan-ally/yugabyte-db#213

Open

ryan-ally mentioned this issue Feb 11, 2024

[Snyk] Security upgrade webpack-dev-server from 3.11.2 to 4.8.0 ryan-ally/yugabyte-db#255

Open

nyndyny mentioned this issue Feb 12, 2024

[Snyk] Security upgrade webpack-dev-server from 3.11.2 to 4.8.0 nyndyny/yugabyte-db#228

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps #4328

Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps #4328

hectorgcr commented Apr 28, 2020 •

edited

Loading

Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps #4328

Changes to heartbeat broke test TestLimitNumberOfConcurrentRemoteBootstraps #4328

Comments

hectorgcr commented Apr 28, 2020 • edited Loading

hectorgcr commented Apr 28, 2020 •

edited

Loading