Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size #20336

Closed
1 task done
pkj415 opened this issue Dec 18, 2023 · 4 comments
Assignees
Labels
2024.1 Backport Required area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@pkj415
Copy link
Contributor

pkj415 commented Dec 18, 2023

Jira Link: DB-9323

Description

Provide the user with the ability to ensure no kReadRestart errors are thrown in a READ COMMITTED txn even if statement's output exceeds ysql_output_buffer_size (gflag with default of 256KB).

The distributed nature of YugabyteDB means that clock skew can be present between nodes. This clock skew can sometimes result in an unresolvable ambiguity of whether a version of data should be/ or not be part of a read in snapshot-based transaction isolations (i.e., repeatable read & read committed). The database takes the approach of retrying the read with a newer snapshot to workaround this ambiguity whenever permissible/ possible. Such retries are present both in the tserver and the query layer. However, there are situations where the retries can’t be done (say if part of the query’s response has already been sent to the user while the query is still reading more data from docdb) – in this case we can’t transparently move the read’s snapshot (i.e., the read time) ahead. In such cases, YugabteDB will throw a kReadRestart error to the external client (see more details in https://docs.yugabyte.com/preview/architecture/transactions/read-committed/#read-restart-errors)

Some customers use read committed isolation and have reads whose response size exceeds the query layer to client output buffer size. This leads to situations where an internal retry might not be possible by the db. See #11572 as well which is similar to this issue.

Solution

We should provide the user with a syntactic option to pay some extra cost but ensure that kReadRestart error doesn't occur.

There are various ways to pay the extra cost (we are going forward with providing a GUC to relax the guarantee (option 1 below). This github issue will track this work):

  1. Relax the guarantee that read restart ambiguity checking provides: a read will be able to see all data that was committed before it as per global wall call time (or true time). Relaxing this guarantee means we could have the following situation: user X make a post on social media app, makes a phone call to user Y to inform about the post and user Y tries to read the post within 500 ms. If clock skew is large, user Y might not be able to see user X's post.
  2. Don't relax the guarantee, but pay an extra latency penalty (see the 3 options in the alternative solutions below)

Extra enhancements:

  1. Don't relax the guarantee for small reads that fit within the ysql_output_buffer_size. Only if they exceed the buffer, we start to relax the guarantee. ([YSQL] Add an option to defer long reads in high precision clock scenarios #21725)
  2. Pick the ceiling of the uncertainty window (i.e., the global limit) as soon as a query enter YSQL. This can help reduce the probability of hit read restart ambiguity windows by a slight amount ([YSQL] Pick global limit as soon as the query arrives at the YSQL backend process. #21961).

Ways to pay the penalty without relaxing the guarantee:

  1. Pick the read time as the maximum value of the current hybrid time across all nodes. This will require the query layer to issue one round of rpcs to all nodes.
    Counter point: This solution is similar to having a timestamp oracle. Coordinating all nodes in the cluster leads to unacceptably high latency.

  2. Pick the global limit as current hybrid time + max_clock_skew_usec, then sleep until current hybrid time crosses global limit and set the read time as this global limit. We already do something like this for SERIALIZABLE READ ONLY DEFERRABLE.
    Counter point: Maximum clock skew is quite high at the moment. Doing this incurs a high latency which is again unacceptable.

2a. To avoid waiting for max_clock_skew_usec, we can keep real time track of the clock skew in the cluster and calculate a better upper bound on the clock skew at any instant by assuming a safe user-configurable max_clock_drift_usec per time unit. Let us call this upper bound on clock skew as current_clock_skew_upper_bound. Then we can use the same logic as in option 2, and pick global limit as current hybrid time + current_clock_skew_upper_bound, and then wait out this time until we pick the read time.
Counter point: can be done in a later stage. This solution requires plenty more experimentation.
GH Issue: #21962

2b. Perhaps, for AWS clusters that have AWS Time sync service and provide microsecond clock skews, we can set the gflag max_clock_skew_usec to a much smaller number and wait out the clock skew.
GH Issue: #21963

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@pkj415 pkj415 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Dec 18, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Dec 18, 2023
@pkj415 pkj415 changed the title [YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size (gflag with default of 256KB). [YSQL] Provide user with option to avoid kReadRestart error with extra cost even if statement's output exceeds ysql_output_buffer_size Dec 18, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jan 3, 2024
@mbautin
Copy link
Collaborator

mbautin commented Jan 18, 2024

Pick the global limit as current hybrid time + max_clock_skew_usec, then sleep until current hybrid time crosses global limit and set the read time as this global limit. We already do something like this for SERIALIZABLE READ ONLY DEFERRABLE.

Typically we don't have to wait until current hybrid time crosses that threshold -- we could issue the reads and the waiting will happen as part of waiting for a particular safe time on tablet servers. This way we can overlap the wait for read time with sending RPCs.

@basavaraj29
Copy link
Contributor

proposed implementation approach - while picking the read time (ReadHybridTime), if the global_limit is set to the read of the picked ReadHybridTime, then the underlying db code wouldn't throw a read restart error (since the ambiguity interval gets shrinked to 0).

@rthallamko3
Copy link
Contributor

Assigning to @sushantrmishra to identify the assignment as the recent approach that was brainstormed doesn't have any DocDB work.

pao214 added a commit that referenced this issue May 28, 2024
…s Guarantees

Summary:
### Objective

Read restart errors are a distributed database specific error scenario that do not occur in a single node database such as PostgreSQL. These errors occur due to clock skew, usually when there are reads with simultaneous writes to that same data (refer https://docs.yugabyte.com/preview/architecture/transactions/read-restart-error/ for details).

Read restart errors are thrown to maintain the "read-after-commit-visibility" guarantee: any client issued read should see all data that was committed before the read request was issued (even in the presence of clock skew between nodes). In other words, the following example should always work:
(1) User X commits some data (for which the db picks a commit timestamp say ht1)
(2) Then user X communicates to user Y to inform about the commit via a channel outside the database (say a phone call)
(3) Then user Y issues a read to some YB node which picks a read time (less than ht1 due to clock skew)
(4) Then it should not happen that user Y gets an output without the data that user Y was informed about.

To ensure this guarantee, when the database performs a read at read_time, it picks a global_limit (= read_time + max_clock_skew_us). If it finds any matching records for the read in the window (read_time, global_limit], there is a chance for the above guarantee to be broken. In this case docdb throws a kReadRestart error.

However, users migrating from PostgreSQL are surprised by this error. Moreover, some users may not be in a position to change their application code to handle this new error scenario.

The number of kReadRestart errors thrown to the external client are reduced currently by retrying the transaction/statement at the query layer or at the docdb layer. A retry at the docdb is layer is possible when this is the first RPC in a transaction/statement and no read time was picked yet on the query layer.

The query layer retries have the following limitations:
- Is limited by `ysql_output_buffer_size`: if the YSQL to client buffer fills up and some data was already sent to the client, YSQL can't retry the whole query on a new read point.
- Has higher tail latency, sometimes, leading to statement timeouts or retries exhaustion.
-  kReadRestart errors are not retried for statements other than the first one in a transaction block in Repeatable Read isolation.

This change aims to provide users with an opposite tradeoff mechanism of sacrificing the read-after-commit-visibility guarantee for ease of use instead.

### Design

Minimizing read restart errors is a multi-stage plan and here we take the first step.

Provide users with a GUC `yb_read_after_commit_visibility` to relax the guarantee.

Configurable Options:
1. strict
   * Default.
   * Same behavior as current.
2. relaxed
   * Ignores clock skew and sets the global_limit to be the same as read_time, thus ignoring the uncertainty window.
   * Pick read time on the query layer and not on the storage layer. This is necessary so that users do not miss commits from their own session. That would be bad.

For simplicity, the relaxed option does not affect transactions unless they are marked **READ ONLY**. Handling generic transactions is more involved because of read-write operations. This may be handled in a future change. Moreover, we ignore DDLs and catalog requests for the purposes of this revision.

In the next section, we discuss the semantics of the relaxed option.

### Relaxed Semantics

In this section, we discuss what guarantees can be retained even in the relaxed mode.

(1) Same Session Guarantees

The reads never miss writes from its own session.

|//conn1//|
|--------|
|INSERT |
|SELECT |     <--- always observes the preceding DML statements.

Providing this guarantee is less obvious than one would think.

(1a) The read time of SELECT should not be lower than the commit time of the preceding INSERT operation.

The insert itself may pick its commit time at any node in the distributed database. However, the hybrid time is propagated back to the local proxy. As a result, the SELECT statement's read time will be higher than the preceding commit time as long as the read time is picked on local proxy, i.e. we do not pick the read time on some remote docdb.

Corollary 1a: read time of read only queries must be picked on local proxy whenever we relax the yb_read_after_commit_visibility guarantee.

Tested in **PgReadAfterCommitVisibilityTest.SameSessionRecency**.

(1b) If hypothetically we were to pick a read time on DocDB even after corollary 1a, that would lead to another problem too: DocDB  picks safe time as the read time. This is potentially a time in the past and might be smaller than the commit time of the INSERT before the SELECT. So, ignoring the uncertainty window on docdb might lead to the SELECT not seeing the prior INSERT from the same connection.

Corollary 1b: Do not ignore the uncertainty window when the read time is picked on the storage layer.
This cannot happen with read-only statements & transactions since we always pick read time on the local proxy.

Tested in **PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit**.

(1c) Server side connection pooling should not sacrifice the above same session guarantee.

Since
- server side connection pooling multiplexes connections only within the same node, and
- there is a common proxy tserver across all pg connections on the node,
we are guaranteed to see commits within the same session even with server-side connection pooling in effect.

Tested in **PgReadAfterCommitVisibilityTest.SamePgNodeRecency**.

Client-side connection pooling is out of scope for discussion (especially in the case of node failures, smart drivers, etc).

(2) Different Session guarantees

Relaxed mode does not provide read-after-commit-visibility guarantee with writes from a different session. We still have good consistency guarantees, nonetheless.

(2a) The first guarantee is consistent prefix.

| //conn1// | //conn2// |
| ...  | INSERT 1 |
| ... | INSERT 2 |
| SELECT | ... |

First things first, the SELECT statement on conn1 need not observe the `INSERT 2` statement on conn2 even though the insert happens before the SELECT in real time. This may happen in a distributed database because of clock skew between different machines (and no uncertainty window).

Next, if SELECT does observe INSERT 2, it must also observe INSERT 1 (and all the preceding statements). This is the consistent prefix guarantee and is maintained by the fact that INSERT 2 will always have a higher commit time than INSERT 1.

(2b) Monotonic Reads

| //conn1// | //conn2// |
| ... | INSERT 1 |
| ... | INSERT 2 |
| SELECT 1 | ... |
| SELECT 2 | ... |

A closely related consistency is that we guarantee monotonic reads. If SELECT 1 observes INSERT 1, then SELECT 2 also observes INSERT 1 (and maybe even more such as INSERT 2). This is because SELECT 2 has a higher read time than SELECT 1 because read time increases monotonically within the same session. Note that this would not be the case if we let SELECT pick the read time on the storage layer instead of force picking it on the proxy. Explanation: safe time is not //necessarily// affected by the most recent hybrid time propagation since it is potentially a time in the past.

(2c) Bounded Staleness

| //conn1// | //conn2// |
| ... | INSERT 1 |    <--- 500ms old
| ... | INSERT 2 |
| SELECT 1 | ... |

Most intuitive property. Since physical clocks do not skew more than max_clock_skew_usec, the SELECTs always see INSERTs that are older than max_clock_skew_usec.

In practice, the staleness bound is even lower since the skew between hybrid time (not physical time) across the machines is the more relevant metric here. hybrid time is close to each other across nodes since there is a regular exchange of messages across yb-tservers and yb-master.

Tested in **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead** and **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness**.

(3) Thematic worst-case scenario

Here, we discuss the type of workload that is most susceptible to stale reads.

For a stale read to occur,
- The read must touch a node with a higher time (than the pg connection). More likely when the read is touching a lot of nodes.
- The writes don't touch enough nodes to ensure hybrid time is propagated to the query layer of the node that performs the read. Happens when the writes are single row inserts/updates.

Therefore, thematically, we are most susceptible to miss recent writes with the relaxed option when there are high throughput single-row DML ops happening concurrently with a long read that touches a lot of rows.

Backport-through: 2024.1

**Upgrade/Rollback safety:**

Fortunately, the only change in proto files is in pg_client.proto.
pg_client.proto is used exclusively for communication between postgres and local tserver proxy layer.

During upgrades once a node is upgraded, both Pg and local tserver are upgraded. Therefore, both of them understand this new field.

Moreover, even though the read behavior is changed in the new relaxed mode, it is only changed for upgraded nodes. Non upgraded nodes do not require any knowledge of changes in the upgraded nodes because the existing interface between the query and storage layers works well to support this new feature.

No auto flags are necessary.
Jira: DB-9323

Test Plan:
Jenkins

### Basic Functional Tests

**In TestPgTransparentRestarts**

1. When yb_read_after_commit_visibility is strict,

Long reads that exceed the ysql_output_buffer_size threshold raise a read restart error to the client since they cannot be handled transparently.

```
./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong
```

2. When yb_read_after_commit_visibility is relaxed,

For read only txns/stmts, we silently ignore read restart errors.

```
./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong_relaxedReadAfterCommitVisibility
```

3. For execution of prepared statements, relaxed mode must be set before the execute command and not necessarily before the prepare command.

```
./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLongExecute_relaxedReadAfterCommitVisibility
```

4. We raise no read restart errors even after transactions restarts due to conflicts. This is because we decided to relax the guarantee only for read only queries/transactions. In addition, read only ops do not run into any transaction conflicts.

### Semantics Tests

**In PgReadAfterCommitVisibilityTest**

1. Same session read-after-commit-visibility guarantee.

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SameSessionRecency
```

2. Same node read-after-commit-visibility guarantee.

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SamePgNodeRecency
```

3. Sessions connecting to Pg on different nodes - bounded staleness guarantee.

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness
```

### Sanity Tests

1. Guard ourselves against this scenario

- Read time is picked on docdb.
- Read uncertainty window is ignored.
- The picked time is safe time, which is a time before the previous statement in the same session.
- We miss recent updates from the same session because the read time is in the past.

```
./yb_build.sh --java-test PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit
```

2. We never ignore read uncertainty window (thus do not relax read-after-commit-visibility guarantee) with

- INSERT/UPDATE/DELETE
- inserts/updates in WITH clause

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDuplicateInsertCheck
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionUpdateKeyCheck
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDeleteKeyCheck
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDmlHidden
```

3. We also avoid this scenario (because relaxed mode does not affect INSERTs)

- Two concurrent inserts to the same key - both single shard.
- The read time of the insert is picked on the local proxy (because this is in relaxed mode).
- This read time is used for checking transaction conflicts happening to the same key.
- However, the conflict resolution step on the RegularDB is skipped in single-shard inserts, see GitHub issue #19407.

### Read Time Tests

```
./yb_build.sh --cxx-test pgwrapper_pg_read_time-test --gtest_filter PgReadTimeTest.CheckRelaxedReadAfterCommitVisibility
```

Reviewers: pjain, smishra, rthallam

Reviewed By: pjain

Subscribers: rthallam, hsunder, tnayak, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34002
pao214 added a commit that referenced this issue Jun 27, 2024
… Bounded Staleness Guarantees

Summary:
Original commit: 2724346 / D34002

Read restart errors are a distributed database specific error scenario that do not occur in a single node database such as PostgreSQL. These errors occur due to clock skew, usually when there are reads with simultaneous writes to that same data (refer https://docs.yugabyte.com/preview/architecture/transactions/read-restart-error/ for details).

Read restart errors are thrown to maintain the "read-after-commit-visibility" guarantee: any client issued read should see all data that was committed before the read request was issued (even in the presence of clock skew between nodes). In other words, the following example should always work:
(1) User X commits some data (for which the db picks a commit timestamp say ht1)
(2) Then user X communicates to user Y to inform about the commit via a channel outside the database (say a phone call)
(3) Then user Y issues a read to some YB node which picks a read time (less than ht1 due to clock skew)
(4) Then it should not happen that user Y gets an output without the data that user Y was informed about.

To ensure this guarantee, when the database performs a read at read_time, it picks a global_limit (= read_time + max_clock_skew_us). If it finds any matching records for the read in the window (read_time, global_limit], there is a chance for the above guarantee to be broken. In this case docdb throws a kReadRestart error.

However, users migrating from PostgreSQL are surprised by this error. Moreover, some users may not be in a position to change their application code to handle this new error scenario.

The number of kReadRestart errors thrown to the external client are reduced currently by retrying the transaction/statement at the query layer or at the docdb layer. A retry at the docdb is layer is possible when this is the first RPC in a transaction/statement and no read time was picked yet on the query layer.

The query layer retries have the following limitations:
- Is limited by `ysql_output_buffer_size`: if the YSQL to client buffer fills up and some data was already sent to the client, YSQL can't retry the whole query on a new read point.
- Has higher tail latency, sometimes, leading to statement timeouts or retries exhaustion.
-  kReadRestart errors are not retried for statements other than the first one in a transaction block in Repeatable Read isolation.

This change aims to provide users with an opposite tradeoff mechanism of sacrificing the read-after-commit-visibility guarantee for ease of use instead.

Minimizing read restart errors is a multi-stage plan and here we take the first step.

Provide users with a GUC `yb_read_after_commit_visibility` to relax the guarantee.

Configurable Options:
1. strict
   * Default.
   * Same behavior as current.
2. relaxed
   * Ignores clock skew and sets the global_limit to be the same as read_time, thus ignoring the uncertainty window.
   * Pick read time on the query layer and not on the storage layer. This is necessary so that users do not miss commits from their own session. That would be bad.

For simplicity, the relaxed option does not affect transactions unless they are marked **READ ONLY**. Handling generic transactions is more involved because of read-write operations. This may be handled in a future change. Moreover, we ignore DDLs and catalog requests for the purposes of this revision.

In the next section, we discuss the semantics of the relaxed option.

In this section, we discuss what guarantees can be retained even in the relaxed mode.

(1) Same Session Guarantees

The reads never miss writes from its own session.

|//conn1//|
|--------|
|INSERT |
|SELECT |     <--- always observes the preceding DML statements.

Providing this guarantee is less obvious than one would think.

(1a) The read time of SELECT should not be lower than the commit time of the preceding INSERT operation.

The insert itself may pick its commit time at any node in the distributed database. However, the hybrid time is propagated back to the local proxy. As a result, the SELECT statement's read time will be higher than the preceding commit time as long as the read time is picked on local proxy, i.e. we do not pick the read time on some remote docdb.

Corollary 1a: read time of read only queries must be picked on local proxy whenever we relax the yb_read_after_commit_visibility guarantee.

Tested in **PgReadAfterCommitVisibilityTest.SameSessionRecency**.

(1b) If hypothetically we were to pick a read time on DocDB even after corollary 1a, that would lead to another problem too: DocDB  picks safe time as the read time. This is potentially a time in the past and might be smaller than the commit time of the INSERT before the SELECT. So, ignoring the uncertainty window on docdb might lead to the SELECT not seeing the prior INSERT from the same connection.

Corollary 1b: Do not ignore the uncertainty window when the read time is picked on the storage layer.
This cannot happen with read-only statements & transactions since we always pick read time on the local proxy.

Tested in **PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit**.

(1c) Server side connection pooling should not sacrifice the above same session guarantee.

Since
- server side connection pooling multiplexes connections only within the same node, and
- there is a common proxy tserver across all pg connections on the node,
we are guaranteed to see commits within the same session even with server-side connection pooling in effect.

Tested in **PgReadAfterCommitVisibilityTest.SamePgNodeRecency**.

Client-side connection pooling is out of scope for discussion (especially in the case of node failures, smart drivers, etc).

(2) Different Session guarantees

Relaxed mode does not provide read-after-commit-visibility guarantee with writes from a different session. We still have good consistency guarantees, nonetheless.

(2a) The first guarantee is consistent prefix.

| //conn1// | //conn2// |
| ...  | INSERT 1 |
| ... | INSERT 2 |
| SELECT | ... |

First things first, the SELECT statement on conn1 need not observe the `INSERT 2` statement on conn2 even though the insert happens before the SELECT in real time. This may happen in a distributed database because of clock skew between different machines (and no uncertainty window).

Next, if SELECT does observe INSERT 2, it must also observe INSERT 1 (and all the preceding statements). This is the consistent prefix guarantee and is maintained by the fact that INSERT 2 will always have a higher commit time than INSERT 1.

(2b) Monotonic Reads

| //conn1// | //conn2// |
| ... | INSERT 1 |
| ... | INSERT 2 |
| SELECT 1 | ... |
| SELECT 2 | ... |

A closely related consistency is that we guarantee monotonic reads. If SELECT 1 observes INSERT 1, then SELECT 2 also observes INSERT 1 (and maybe even more such as INSERT 2). This is because SELECT 2 has a higher read time than SELECT 1 because read time increases monotonically within the same session. Note that this would not be the case if we let SELECT pick the read time on the storage layer instead of force picking it on the proxy. Explanation: safe time is not //necessarily// affected by the most recent hybrid time propagation since it is potentially a time in the past.

(2c) Bounded Staleness

| //conn1// | //conn2// |
| ... | INSERT 1 |    <--- 500ms old
| ... | INSERT 2 |
| SELECT 1 | ... |

Most intuitive property. Since physical clocks do not skew more than max_clock_skew_usec, the SELECTs always see INSERTs that are older than max_clock_skew_usec.

In practice, the staleness bound is even lower since the skew between hybrid time (not physical time) across the machines is the more relevant metric here. hybrid time is close to each other across nodes since there is a regular exchange of messages across yb-tservers and yb-master.

Tested in **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead** and **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness**.

(3) Thematic worst-case scenario

Here, we discuss the type of workload that is most susceptible to stale reads.

For a stale read to occur,
- The read must touch a node with a higher time (than the pg connection). More likely when the read is touching a lot of nodes.
- The writes don't touch enough nodes to ensure hybrid time is propagated to the query layer of the node that performs the read. Happens when the writes are single row inserts/updates.

Therefore, thematically, we are most susceptible to miss recent writes with the relaxed option when there are high throughput single-row DML ops happening concurrently with a long read that touches a lot of rows.

Backport-through: 2024.1

**Upgrade/Rollback safety:**

Fortunately, the only change in proto files is in pg_client.proto.
pg_client.proto is used exclusively for communication between postgres and local tserver proxy layer.

During upgrades once a node is upgraded, both Pg and local tserver are upgraded. Therefore, both of them understand this new field.

Moreover, even though the read behavior is changed in the new relaxed mode, it is only changed for upgraded nodes. Non upgraded nodes do not require any knowledge of changes in the upgraded nodes because the existing interface between the query and storage layers works well to support this new feature.

No auto flags are necessary.
Jira: DB-9323

Test Plan:
Jenkins

**In TestPgTransparentRestarts**

1. When yb_read_after_commit_visibility is strict,

Long reads that exceed the ysql_output_buffer_size threshold raise a read restart error to the client since they cannot be handled transparently.

```
./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong
```

2. When yb_read_after_commit_visibility is relaxed,

For read only txns/stmts, we silently ignore read restart errors.

```
./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong_relaxedReadAfterCommitVisibility
```

3. For execution of prepared statements, relaxed mode must be set before the execute command and not necessarily before the prepare command.

```
./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLongExecute_relaxedReadAfterCommitVisibility
```

4. We raise no read restart errors even after transactions restarts due to conflicts. This is because we decided to relax the guarantee only for read only queries/transactions. In addition, read only ops do not run into any transaction conflicts.

**In PgReadAfterCommitVisibilityTest**

1. Same session read-after-commit-visibility guarantee.

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SameSessionRecency
```

2. Same node read-after-commit-visibility guarantee.

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SamePgNodeRecency
```

3. Sessions connecting to Pg on different nodes - bounded staleness guarantee.

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness
```

1. Guard ourselves against this scenario

- Read time is picked on docdb.
- Read uncertainty window is ignored.
- The picked time is safe time, which is a time before the previous statement in the same session.
- We miss recent updates from the same session because the read time is in the past.

```
./yb_build.sh --java-test PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit
```

2. We never ignore read uncertainty window (thus do not relax read-after-commit-visibility guarantee) with

- INSERT/UPDATE/DELETE
- inserts/updates in WITH clause

```
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDuplicateInsertCheck
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionUpdateKeyCheck
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDeleteKeyCheck
./yb_build.sh  --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDmlHidden
```

3. We also avoid this scenario (because relaxed mode does not affect INSERTs)

- Two concurrent inserts to the same key - both single shard.
- The read time of the insert is picked on the local proxy (because this is in relaxed mode).
- This read time is used for checking transaction conflicts happening to the same key.
- However, the conflict resolution step on the RegularDB is skipped in single-shard inserts, see GitHub issue #19407.

```
./yb_build.sh --cxx-test pgwrapper_pg_read_time-test --gtest_filter PgReadTimeTest.CheckRelaxedReadAfterCommitVisibility
```

Reviewers: pjain, smishra, rthallam

Reviewed By: pjain

Subscribers: ybase, yql, tnayak, hsunder, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36197
@pao214
Copy link
Contributor

pao214 commented Jun 27, 2024

Landed the change in 2724346 on master and 025c231 on 2024.1

@pao214 pao214 closed this as completed Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1 Backport Required area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
Status: Done
Development

No branches or pull requests

7 participants