Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] PITR: Restoration races with Index Backfill #12672

Closed
sanketkedia opened this issue May 26, 2022 · 0 comments
Closed

[DocDB] PITR: Restoration races with Index Backfill #12672

sanketkedia opened this issue May 26, 2022 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@sanketkedia
Copy link
Contributor

sanketkedia commented May 26, 2022

Jira Link: DB-564

Description

Consider the following scenario:

  1. Take a snapshot when backfill is ongoing thus the table is in the SysTablesEntryPB::ALTERING state
  2. Restore to this snapshot - on restore, there's a race between the tablets of this table performing restore and backfilling
  3. For instance, in one case the RESTORE_ON_TABLET rpc (rpc from the master asking tservers to restore their tablets) sent the old schema while the tablet just updated to a newer schema due to backfill

This issue also reproduces intermittently in our unit test YbAdminSnapshotScheduleTest.UndeleteIndex

@sanketkedia sanketkedia added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels May 26, 2022
@sanketkedia sanketkedia self-assigned this May 26, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels May 26, 2022
sanketkedia added a commit that referenced this issue Aug 25, 2022
…n backfill was in progress (for YCQL)

Summary:
Currently, if we restore to a time when backfill was in progress, then after sys catalog restore completes, index
backfill resumes but it races with RESTORE_ON_TABLET rpcs for schema. This could lead to
RESTORE_ON_TABLET rpc overwriting the schema that was updated by backfill and thus the index could be left in
non-running state forever. This diff fixes it in the following manner:

1. Index backfill is resumed only after RESTORE_ON_TABLET rpcs are finished. For resuming, the
master snapshot coordinator enqueues the tables for which backfill needs to resume and the
catalog_manager_bg_tasks picks it up in its cycle and resumes backfill.
2. The restored schema is now only sent once as part of RESTORE_ON_TABLET rpc. Previously, we were
also sending the schema in the HB path when processing tablet reports. Thus it can happen that this
second rpc is still in flight and all the RESTORE_ON_TABLET rpcs finish. This could have implications
such as this rpc racing with backfill even if backfill is only resumed after restore
completes.
3. In case of colocated tables, we were not sending restored schema during RESTORE_ON_TABLET rpc but
only relying on the HB path. While this works, it could race with backfill as outlined in (2). With this diff, it is fixed.

Test Plan:
ybd --cxx_test yb-admin-snapshot-schedule-test --gtest-filter
YbAdminSnapshotScheduleTest.UndeleteIndexToBackfillTime

Reviewers: slingam, amitanand, sergei

Reviewed By: sergei

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D17841
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants