Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] LightweightMessage sets max size incorrectly when parsing #21352

Closed
1 task done
es1024 opened this issue Mar 7, 2024 · 0 comments
Closed
1 task done

[DocDB] LightweightMessage sets max size incorrectly when parsing #21352

es1024 opened this issue Mar 7, 2024 · 0 comments
Assignees
Labels
2.20 Backport Required 2024.1 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@es1024
Copy link
Contributor

es1024 commented Mar 7, 2024

Jira Link: DB-10251

Description

LightweightMessage currently sets a maximum size for reading to rpc_max_message_size (default 255 MB), but Preparer batches based on protobuf_message_total_bytes_limit (default 511 MB). This results in cases where under default settings, we may have protobufs between 255 MB and 511 MB, which LightweightMessage::ParseFromSlice is unable to read.

When these protobufs are written to the WAL, errors like the following may occur afterwards:

Found a corruption in a closed log segment: OK
Error: Corruption (yb/consensus/log_util.cc:965): Log file corruption detected.: Failed to parse PB at offset: 26013423, length: 303149529. Cause: Corruption (yb/rpc/lightweight_message.cc:376): Failed to parse ‘entry’: Failed trying to read batch #5 at offset 26013423 for log segment /mnt/d0/yb-data/tserver/wals/table-12345678901234567890123456789012/tablet-12345678901234567890123456789012/wal-000003000: ...

where the length in the error is larger than rpc_max_message_size.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@es1024 es1024 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Mar 7, 2024
@es1024 es1024 self-assigned this Mar 7, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage and removed status/awaiting-triage Issue awaiting triage labels Mar 7, 2024
es1024 added a commit that referenced this issue Sep 4, 2024
Summary:
LightweightMessage currently sets a maximum size for reading to `rpc_max_message_size`
(default 255 MB), but Preparer batches based on `protobuf_message_total_bytes_limit` (default
511 MB). This results in cases where under default settings, we may have protobufs between 255 MB
and 511 MB, which `LightweightMessage::ParseFromSlice` is unable to read.

This diff changes the limit to use `protobuf_message_total_bytes_limit`, so that protobufs between
`rpc_max_message_size` and `protobuf_message_total_bytes_limit` can be parsed properly.

This addresses errors such as:
```
Found a corruption in a closed log segment: OK
Error: Corruption (yb/consensus/log_util.cc:965): Log file corruption detected.: Failed to parse PB at offset: 26013423, length: 303149529. Cause: Corruption (yb/rpc/lightweight_message.cc:376): Failed to parse ‘entry’: Failed trying to read batch #5 at offset 26013423 for log segment /mnt/d0/yb-data/tserver/wals/table-12345678901234567890123456789012/tablet-12345678901234567890123456789012/wal-000003000: ...
```
(length larger than 255 MB) when such protobufs are written to WALs.

This also fixes cause of flakiness for TabletPeerTest.MaxRaftBatchProtobufLimit in TSAN builds.

Jira: DB-7654, DB-10251

Test Plan:
Jenkins.

Added test:
```
yb_build.sh --cxx-test rpc_lwproto-test --gtest_filter LWProtoTest.BigMessage
```

Also ran TabletPeerTest.MaxRaftBatchProtobufLimit 100x on Jenkins.

Reviewers: sergei, qhu

Reviewed By: qhu

Subscribers: yyan, bogdan, rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D33041
jasonyb pushed a commit that referenced this issue Sep 6, 2024
Summary:
 c587efd [docs] minor edit (#23796)
 31e09f3 [PLAT-15029] yba installer split data and software directory setup
 f5ba17d [PLAT-15039]: Fix bootstrap on bi-directional xCluster config creation
 578248a [#23770] YSQL: Deterministically populate catalog cache in tests with Connection Manager enabled
 7b1f22a [#23799] test: Fixed PgTableSizeTest.SharedTableSize test for pg15
 788434a [#18771, #21352] docdb: Fix LightweightMessage max size when parsing
 02ced43 [#22821] YSQL: Preserve local limit in a multi-page read
 50ff737 [#23741] docdb: Fix cloning of colocated databases with only parent table
 Excluded: 9889df7 [#23706] YSQL: Add table-level catcache Prometheus metrics
 c770d79 [#23747] MetaCache: Callback should not be called while holding the lock
 Excluded: 40689bc [#22150] YSQL, QueryDiagnostics:  EXPLAIN (ANALYZE, DIST) support for queryDiagnostics
 1655e69 [PLAT-15148]: Set XCluster Table Status to DroppedFromTarget if table in replication is dropped from target only
 16262f7 [#22519] YSQL: Simplify API of the ExplicitRowLockBuffer class
 6614afb [PLAT-14958][PLAT-14959] Make ssh fields optional if skipProvisioning is true
 1153b56 [PLAT-14867] Make sure restart alerts don't trigger for small time updates during NTP sync
 bf1c7bc [PLAT-12226] Add connection pooling status to universe health check
 38d8ae8 [PLAT-14805]Support adding EAR configs
 a180bef [#19134] YSQL, ASH: Setting ASH circular buffer size based on the number of cores
 7d8fc76 Adjust heading link (#23807)
 4c6cf5a [PLAT-4899]Basic validation of certificates
 f24eb10 [#23787] YSQL: Avoid executing conn mgr guc variables hooks for parallel workers
 ee18df8 [PLAT-13921] [K8] [UI] Universe action tasks are disabled after a failed shrink rr node task
 Excluded: dcf1821 [#23797] YSQL: Modify some tests to run in single connection mode with Connection Manager
 0ac22cd [Docs] Remove Drift chat bot (#23802)
 0e91003 [#22825] DocDB: Vector Index General Read Path with DummyANN
 e8f09b5 [PLAT-15175] Make runtime conf for skipping cluster consistency check public
 ee479ee Versionwarning (#23781)
 a05c6a3 [DB-12681] yugabyted-ui: Add Voyager commands to different Voyager phases in the UI.
 cc80d59 [#23777] yugabyted: updating the pg parity testcase to reflect the new gflags enabled for the pg parity feature.

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D37822
es1024 added a commit that referenced this issue Sep 6, 2024
…ize when parsing

Summary:
Original commit: 788434a / D33041
LightweightMessage currently sets a maximum size for reading to `rpc_max_message_size`
(default 255 MB), but Preparer batches based on `protobuf_message_total_bytes_limit` (default
511 MB). This results in cases where under default settings, we may have protobufs between 255 MB
and 511 MB, which `LightweightMessage::ParseFromSlice` is unable to read.

This diff changes the limit to use `protobuf_message_total_bytes_limit`, so that protobufs between
`rpc_max_message_size` and `protobuf_message_total_bytes_limit` can be parsed properly.

This addresses errors such as:
```
Found a corruption in a closed log segment: OK
Error: Corruption (yb/consensus/log_util.cc:965): Log file corruption detected.: Failed to parse PB at offset: 26013423, length: 303149529. Cause: Corruption (yb/rpc/lightweight_message.cc:376): Failed to parse ‘entry’: Failed trying to read batch #5 at offset 26013423 for log segment /mnt/d0/yb-data/tserver/wals/table-12345678901234567890123456789012/tablet-12345678901234567890123456789012/wal-000003000: ...
```
(length larger than 255 MB) when such protobufs are written to WALs.

This also fixes cause of flakiness for TabletPeerTest.MaxRaftBatchProtobufLimit in TSAN builds.

Jira: DB-7654, DB-10251

Test Plan:
Jenkins.

Added test:
```
yb_build.sh --cxx-test rpc_lwproto-test --gtest_filter LWProtoTest.BigMessage
```

Also ran TabletPeerTest.MaxRaftBatchProtobufLimit 100x on Jenkins.

Reviewers: sergei, qhu

Reviewed By: qhu

Subscribers: yyan, bogdan, rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37781
es1024 added a commit that referenced this issue Sep 6, 2024
…e when parsing

Summary:
Original commit: 788434a / D33041
LightweightMessage currently sets a maximum size for reading to `rpc_max_message_size`
(default 255 MB), but Preparer batches based on `protobuf_message_total_bytes_limit` (default
511 MB). This results in cases where under default settings, we may have protobufs between 255 MB
and 511 MB, which `LightweightMessage::ParseFromSlice` is unable to read.

This diff changes the limit to use `protobuf_message_total_bytes_limit`, so that protobufs between
`rpc_max_message_size` and `protobuf_message_total_bytes_limit` can be parsed properly.

This addresses errors such as:
```
Found a corruption in a closed log segment: OK
Error: Corruption (yb/consensus/log_util.cc:965): Log file corruption detected.: Failed to parse PB at offset: 26013423, length: 303149529. Cause: Corruption (yb/rpc/lightweight_message.cc:376): Failed to parse ‘entry’: Failed trying to read batch #5 at offset 26013423 for log segment /mnt/d0/yb-data/tserver/wals/table-12345678901234567890123456789012/tablet-12345678901234567890123456789012/wal-000003000: ...
```
(length larger than 255 MB) when such protobufs are written to WALs.

This also fixes cause of flakiness for TabletPeerTest.MaxRaftBatchProtobufLimit in TSAN builds.

Jira: DB-7654, DB-10251

Test Plan:
Jenkins.

Added test:
```
yb_build.sh --cxx-test rpc_lwproto-test --gtest_filter LWProtoTest.BigMessage
```

Also ran TabletPeerTest.MaxRaftBatchProtobufLimit 100x on Jenkins.

Reviewers: sergei, qhu

Reviewed By: qhu

Subscribers: ybase, rthallam, bogdan, yyan

Differential Revision: https://phorge.dev.yugabyte.com/D37780
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.20 Backport Required 2024.1 Backport Required area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants