Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storaged go insane after feeded with a ChainAddEdge request of 409600 batch size of edges #3465

Closed
kikimo opened this issue Dec 14, 2021 · 4 comments
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Dec 14, 2021

Please check the FAQ documentation before raising an issue

Storage go insane after feeded with a ChainAddEdge request of 409600 batch size of edges. storaged respond every subsequent add edge request with Code:E_WRITE_WRITE_CONFLICT, this situation can be reproduced steadily:

root@graph1:~/nebula-chaos-cluster/scripts# ./tossEdge
I1214 16:27:13.847096    1546 raft.go:156] found leader of term: 370, leader: store2
putting kvs...
insert resp: ExecResponse({Result_:ResponseCommon({FailedParts:[PartitionResult_({Code:E_WRITE_WRITE_CONFLICT PartID:1 Leader:<nil>})] LatencyInUs:172 LatencyDetailUs:map[]})}), err: <nil>
done putting kvs...

from the storage log:

W1214 16:34:38.944389    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.052822    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.161804    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.270409    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.378113    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.486645    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.596043    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.705327    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.814685    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:39.923729    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.032511    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.141647    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.250559    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.359755    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.468784    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.577255    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.685833    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.794581    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:40.903748    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:41.012645    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:41.121811    51 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again
W1214 16:34:41.231221    46 RaftPart.cpp:948] [Port: 9780, Space: 1, Part: 1] Only 0 hosts succeeded, Need to try again

We still cannot insert edge after we restart the whole cluster:

root@graph1:~/nebula-chaos-cluster/scripts# ./tossEdge
I1214 16:55:22.207335     383 raft.go:156] found leader of term: 371, leader: store1
putting kvs...
insert resp: ExecResponse({Result_:ResponseCommon({FailedParts:[PartitionResult_({Code:E_WRITE_WRITE_CONFLICT PartID:1 Leader:<nil>})] LatencyInUs:115 LatencyDetailUs:map[]})}), err: <nil>
done putting kvs...
root@graph1:~/nebula-chaos-cluster/scripts# ./tossEdge
I1214 16:55:22.822024     393 raft.go:156] found leader of term: 371, leader: store1
putting kvs...
insert resp: ExecResponse({Result_:ResponseCommon({FailedParts:[PartitionResult_({Code:E_WRITE_WRITE_CONFLICT PartID:1 Leader:<nil>})] LatencyInUs:97 LatencyDetailUs:map[]})}), err: <nil>
done putting kvs...
root@graph1:~/nebula-chaos-cluster/scripts# ./tossEdge
I1214 16:55:23.505508     402 raft.go:156] found leader of term: 371, leader: store1
putting kvs...
insert resp: ExecResponse({Result_:ResponseCommon({FailedParts:[PartitionResult_({Code:E_WRITE_WRITE_CONFLICT PartID:1 Leader:<nil>})] LatencyInUs:98 LatencyDetailUs:map[]})}), err: <nil>
done putting kvs...

store1 is the new leader, and from its log we can see that:

I1214 16:54:14.756846    62 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.757864    65 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.758860    68 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.760455    71 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.762053    57 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.763609    61 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.764801    64 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.765938    67 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.767606    70 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.768651    56 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.769855    60 StorageClientBase-inl.h:245] Send request to storage "store1":9777
I1214 16:54:14.770905    63 StorageClientBase-inl.h:245] Send request to storage "store1":9777

Your Environments (required)

  • OS: uname -a
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id (e.g. a3ffc7d8)

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Step 1
  2. Step 2
  3. Step 3

Expected behavior

Additional context

@kikimo kikimo added the type/bug Type: something is unexpected label Dec 14, 2021
@Sophie-Xie Sophie-Xie added this to the v3.0.0 milestone Dec 14, 2021
@kikimo kikimo changed the title Storaged go insane after feed with a ChainAddEdge request of 409600 batch size of edges Storaged go insane after feeded with a ChainAddEdge request of 409600 batch size of edges Dec 14, 2021
@liuyu85cn
Copy link
Contributor

I guess this is because this log is too big that raft can't return in 1 min(default raft rpc timeout).

In this case, leader will think its replication to follower failed, then goes into infinite loop.

we add some log to check this later.

@kikimo
Copy link
Contributor Author

kikimo commented Dec 17, 2021

tested with raft kvput,and batch size = 4096, we reproduce this problem in nearly 20mins.

@kikimo
Copy link
Contributor Author

kikimo commented Dec 20, 2021

tested with raft kvput,and batch size = 4096, we reproduce this problem in nearly 20mins.

We perform a double check for this problem and found that this only happends when we are inserting edges.

@kikimo kikimo assigned liuyu85cn and unassigned critical27 Dec 20, 2021
@liuyu85cn
Copy link
Contributor

in this case, TOSS will make a batch, consist of the original request and 409600 locks
(not a good design, I admit first).
and need raft to complete write this batch succeeded in one min, which is not possible for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

4 participants