[DocDB] PITR: Restoration races with Index Backfill #12672

sanketkedia · 2022-05-26T16:50:15Z

Jira Link: DB-564

Description

Consider the following scenario:

Take a snapshot when backfill is ongoing thus the table is in the SysTablesEntryPB::ALTERING state
Restore to this snapshot - on restore, there's a race between the tablets of this table performing restore and backfilling
For instance, in one case the RESTORE_ON_TABLET rpc (rpc from the master asking tservers to restore their tablets) sent the old schema while the tablet just updated to a newer schema due to backfill

This issue also reproduces intermittently in our unit test YbAdminSnapshotScheduleTest.UndeleteIndex

…n backfill was in progress (for YCQL) Summary: Currently, if we restore to a time when backfill was in progress, then after sys catalog restore completes, index backfill resumes but it races with RESTORE_ON_TABLET rpcs for schema. This could lead to RESTORE_ON_TABLET rpc overwriting the schema that was updated by backfill and thus the index could be left in non-running state forever. This diff fixes it in the following manner: 1. Index backfill is resumed only after RESTORE_ON_TABLET rpcs are finished. For resuming, the master snapshot coordinator enqueues the tables for which backfill needs to resume and the catalog_manager_bg_tasks picks it up in its cycle and resumes backfill. 2. The restored schema is now only sent once as part of RESTORE_ON_TABLET rpc. Previously, we were also sending the schema in the HB path when processing tablet reports. Thus it can happen that this second rpc is still in flight and all the RESTORE_ON_TABLET rpcs finish. This could have implications such as this rpc racing with backfill even if backfill is only resumed after restore completes. 3. In case of colocated tables, we were not sending restored schema during RESTORE_ON_TABLET rpc but only relying on the HB path. While this works, it could race with backfill as outlined in (2). With this diff, it is fixed. Test Plan: ybd --cxx_test yb-admin-snapshot-schedule-test --gtest-filter YbAdminSnapshotScheduleTest.UndeleteIndexToBackfillTime Reviewers: slingam, amitanand, sergei Reviewed By: sergei Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D17841

sanketkedia added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels May 26, 2022

sanketkedia self-assigned this May 26, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels May 26, 2022

vkulichenko mentioned this issue Jun 14, 2022

[docdb] PITR: Tracking issue #7120

Closed

sanketkedia closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] PITR: Restoration races with Index Backfill #12672

[DocDB] PITR: Restoration races with Index Backfill #12672

sanketkedia commented May 26, 2022 •

edited by yugabyte-ci

Loading

[DocDB] PITR: Restoration races with Index Backfill #12672

[DocDB] PITR: Restoration races with Index Backfill #12672

Comments

sanketkedia commented May 26, 2022 • edited by yugabyte-ci Loading

Description

sanketkedia commented May 26, 2022 •

edited by yugabyte-ci

Loading