Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce issue 13766 in linearizability tests #14682

Merged
merged 1 commit into from
Jan 18, 2023

Conversation

serathius
Copy link
Member

@serathius serathius commented Nov 3, 2022

Decided to use the original approach of reproduction of the issue. I was not able to reliable reproduction via SIGKILL on lower QPS or clients. This approach has downside that validating linerizability for so many clients is too costly. Memory usage for pourcupine just exploded and would definitely not fit in github action worker.

Decided to use the initial data inconsistency validation for reproduction.

@serathius serathius force-pushed the issue13766 branch 3 times, most recently from ce4dfe9 to 0f06e47 Compare November 3, 2022 20:57
@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2022

Codecov Report

Merging #14682 (85f72e9) into main (26c0627) will decrease coverage by 0.16%.
The diff coverage is n/a.

❗ Current head 85f72e9 differs from pull request most recent head 3f22b0a. Consider uploading reports for the commit 3f22b0a to get more accurate results

@@            Coverage Diff             @@
##             main   #14682      +/-   ##
==========================================
- Coverage   74.77%   74.61%   -0.17%     
==========================================
  Files         415      415              
  Lines       34335    34328       -7     
==========================================
- Hits        25674    25613      -61     
- Misses       7025     7081      +56     
+ Partials     1636     1634       -2     
Flag Coverage Δ
all 74.61% <ø> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/proxy/grpcproxy/register.go 69.76% <0.00%> (-20.94%) ⬇️
server/etcdserver/api/rafthttp/peer_status.go 87.87% <0.00%> (-12.13%) ⬇️
server/etcdserver/api/rafthttp/peer.go 87.01% <0.00%> (-8.45%) ⬇️
client/pkg/v3/testutil/leak.go 62.83% <0.00%> (-4.43%) ⬇️
server/etcdserver/txn/util.go 75.47% <0.00%> (-3.78%) ⬇️
server/etcdserver/api/v3rpc/watch.go 84.76% <0.00%> (-3.50%) ⬇️
client/v3/leasing/txn.go 88.09% <0.00%> (-3.18%) ⬇️
client/v3/experimental/recipes/key.go 75.34% <0.00%> (-2.74%) ⬇️
client/v3/experimental/recipes/double_barrier.go 68.83% <0.00%> (-2.60%) ⬇️
client/v3/concurrency/election.go 79.68% <0.00%> (-2.35%) ⬇️
... and 10 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@serathius serathius force-pushed the issue13766 branch 2 times, most recently from 4ad9cb1 to 9bd84c9 Compare November 4, 2022 10:29
@serathius
Copy link
Member Author

cc @ahrtr @spzala @chaochn47 @ptabor

ClusterSize: 3,
InitialCorruptCheck: true,
},
skipValidation: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you skip the validation, then it isn't linearizablity test anymore. So it doesn't make sense to get the test included in the linearzability test suite.

And also shouldn't we check HashKV after the test? The existing functional test already covers the case. Linearizablity test is good, but please do not recreate the wheel. Instead, I would suggest to enhance both the linearizablity and functional tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linearizablity test is good, but please do not recreate the wheel. Instead, I would suggest to enhance both the linearizablity and functional tests.

As long functional tests are flaky they are useless. Using initial hash check in linerazibility tests I see as a shortcut that I hope to remove in future, however I don't see same possibility in functional tests.

We are already not benefiting from running functional tests, so when we reproduce all standing inconstancy issues and add network blackholing in lineariziability tests I will propose to remove them.

@chaochn47
Copy link
Member

chaochn47 commented Nov 4, 2022

I was not able to reliable reproduction via SIGKILL on lower QPS or clients. This approach has downside that validating linerizability for so many clients is too costly. Memory usage for pourcupine just exploded and would definitely not fit in github action worker.

What's the machine requirement to reliably run this test to reproduce the issue #14682? Just in case someone wants to locally try it out.

@serathius serathius force-pushed the issue13766 branch 2 times, most recently from 3d2693a to 7e5ab32 Compare November 7, 2022 20:07
@serathius
Copy link
Member Author

Working on making the test not dependent on data inconsistency check.

@serathius
Copy link
Member Author

What's the machine requirement to reliably run this test to reproduce the issue #14682? Just in case someone wants to locally try it out.

Problem is that linearization of history is an NP-Hard problem. When we increase number of clients and qps the compute/space required explodes. I was not able to get results from linearization as I usually stop it after 5 minutes when it allocates over 100GB of RAM.

@serathius
Copy link
Member Author

Marking as draft as I'm still working on making tests execute within reasonable time.

@serathius
Copy link
Member Author

Got first success! cc @ahrtr @chaochn47

I managed to get over issue of too many clients by looking up persisted requests via watch. Fixing this requires a lot of changes so I will do it one by one, starting from #15044.

Reproducing #13766 requires pretty high QPS load (>1000 qps). To achieve that we need increase number of clients from 8 to 20. This caused issue for linearizability verification as it's complexity is exponential to number of clients. This is because when we crash etcd, ongoing request from each client is interrupted, meaning for 8 clients we lost 8 requests, for 20 clients we loos 20 requests.

Client doesn't know if a lost request is persisted on not. If requests was lost on invocation (before it gets to etcd), it will not be persisted. However if request was lost on return (after etcd processed it), it will be persisted without notifying the client. Consecutive lost requests cause complexity explosion, because current etcd model will try to find whether they were persisted and their exact order of their execution, so for each n lost requests there are n! orderings of requests and 2^n possible final states for each path. Pretty big!

Solution was to minimize number of lost requests by checking if they were really persisted. There is a easy source of this information in etcd, watch. By collecting all the events in separate running watch I can verify if particular request was persisted (all Put requests have unique id) and also get revision it was persisted. This eliminates both problems of request ordering and checking persistence.

I have a draft code, just need to split it into PRs for review. First two #15045 #15044

@serathius serathius marked this pull request as ready for review January 13, 2023 20:23
@serathius
Copy link
Member Author

Sending code for review! Possibly will also need #15101 due to increase qps this issue happens more frequently.
cc @ahrtr @ptabor

Makefile Outdated Show resolved Hide resolved
Makefile Outdated
@@ -129,6 +129,60 @@ gofail-disable: install-gofail
install-gofail:
cd tools/mod; go install go.etcd.io/gofail@${GOFAIL_VERSION}

# Reproduce historical issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this to a separate Makefile within './tests/repros' directory.

I think it's too detailed for a top-level Makefile (and have big potential to growth over time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, didn't want this included with PR.

@serathius serathius force-pushed the issue13766 branch 4 times, most recently from a52b29f to 287e1b6 Compare January 17, 2023 13:21
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants