Use raft log to patch operation history in Robustness tests #15598

serathius · 2023-03-31T09:12:19Z

What would you like to be added?

Robustness tests execute traffic against a cluster and record their history so they can be validated for linearizability. Unfortunately linearizing history is an NP-Hard problem, meaning it has exponential complexity by default. We maneuver around the problem by special algorithms (https://github.com/anishathalye/porcupine#porcupine-- uses https://arxiv.org/pdf/1504.00204.pdf) and history patching.

History patching is needed to remove operations that cause linearization to require exponential complexity. Those operations are client requests that returned error for any reason, connection timeout, request timeout etc. In those cases we don't know if the request was persisted or not, and if it was persisted what revision it got. Without knowing this we need to consider each possibility and any order of requests. For sequence of requests of length n, we already get 2^n states (persisted or not) that is also multiplied by n! (order of operations).

History patching uses independent history of events collected via watches to either remove failed requests from the operation history or provides them with their revision. This can be done for PUT and TXN as they are unique because robustness tests write unique value in each request.

Problem

Watch events only include transactions that succeeded. Delete request for a non-present key will not be present in watch history. Same for TXN transaction that failed it's condition and had empty onFail operations. Those operations where still executed by etcd and their condition checked, however current patching marks them as never existing.

Using watch to patch operation history means that etcd could still have an linearization issue on condition checking and we would never know about this. We need a way to not only collect operations that succeeded by operations that were applied to backend.

Proposed solution

Validate best way to access raft log in cluster, either by WAL files or connecting to raft as a learner
Use raft log to decide whether operation was executed or not.
Still use watch history, but only to provide revision for operations peristed.

Why is this needed?

Patching operation history is critical to make robustness tests performant, however in it's current state limits our ability to validate correctness. We should be aware of this flaw and when needed we should fix it.

Not a priority for now, but an important thing to have in mind.

The text was updated successfully, but these errors were encountered:

stale · 2023-08-12T07:31:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

serathius · 2024-02-14T15:56:33Z

I should consider prioritizing this to implement Compaction for Kubernetes. With more aggressive compaction watches will be interrupted, which would lead to gaps in our event history. With gaps we could not patch operation history, which could lead to linearization timeouts.

ahrtr · 2024-04-09T09:25:14Z

Overall it seems not a good idea to verify etcd's correctness depending on WAL data. It's a little strange that you verify etcd's correctness, but you trust the WAL data generated by etcd.

We should regard the whole etcd as a black box, and interactive & verify it using its public APIs. The WAL data should be transparent to the robustness test, and it's supposed to be used by etcdserver itself only.

A related topic: should we patch the history using the watch events? My thoughts:

I tend to keep them totally separated to simplify the robustness test's implementation. We verify the linearizability using porcupine & model, and verify watch doesn't break any invariant properties; they are two different & separate verifications. But on the other hand, it indeed doesn't help to reduce the exponential complexity caused by the failed client requests for any reason. So I don't have very strong opinion on this for now, although I tend to keep them totally separated.
At least, we should consider patching the history using watch events, only after we have verified that the watch events do not break any invariant properties.

serathius · 2024-04-09T10:23:47Z

Overall it seems not a good idea to verify etcd's correctness depending on WAL data.

I don't depend on WAL data for checking correctness, this is done using client requests that treat etcd as black box. I use it to remove dead ends from linearization, it should not impact the main linearization path. If there would be any error in etcd with regards to WAL, the robustness tests would still catch it.

We should regard the whole etcd as a black box, and interactive & verify it using its public APIs. The WAL data should be transparent to the robustness test, and it's supposed to be used by etcdserver itself only.

If I had a quantum computer then yes, but unfortunately we are still far from solving NP-Complete problems. We cannot do this fully in a black box approach and we need some tricks to succeed.

A related topic: should we patch the history using the watch events?

We already do that, that's why we could not implemented compaction as with compaction we no longer can guarantee that we will observe all events.

serathius · 2024-06-04T12:18:00Z

Done

serathius added type/feature area/robustness-testing labels Mar 31, 2023

This was referenced May 5, 2023

tests/robustness: Split model code into deterministic and non-deterministic #15819

Merged

tests/robustness: Add safeguards to client and history #15873

Merged

serathius added the stage/tracked label Aug 17, 2023

This was referenced Oct 15, 2023

Robustness wal #16761

Merged

Limit a timeout in testing robustness validation #16764

Merged

serathius mentioned this issue Feb 14, 2024

Codify the the kubernetes-etcd contract into etcd robustness tests #15820

Open

14 tasks

serathius closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use raft log to patch operation history in Robustness tests #15598

Use raft log to patch operation history in Robustness tests #15598

serathius commented Mar 31, 2023

stale bot commented Aug 12, 2023

serathius commented Feb 14, 2024

ahrtr commented Apr 9, 2024

serathius commented Apr 9, 2024

serathius commented Jun 4, 2024

Use raft log to patch operation history in Robustness tests #15598

Use raft log to patch operation history in Robustness tests #15598

Comments

serathius commented Mar 31, 2023

What would you like to be added?

Problem

Proposed solution

Why is this needed?

stale bot commented Aug 12, 2023

serathius commented Feb 14, 2024

ahrtr commented Apr 9, 2024

serathius commented Apr 9, 2024

serathius commented Jun 4, 2024