Optimize the write performance when host is down #5571

luyade · 2023-06-02T13:24:40Z

What type of PR is this?

enhancement

What problem(s) does this PR solve?

Issue(s) number: #5570

Description:

Optimize the write performance after some host is down.

How do you solve it?

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

Unit test(positive and negative cases)
Function test
Performance test
N/A

Affects:

Documentation affected (Please add the label if documentation needs to be modified.)
Incompatibility (If it breaks the compatibility, please describe it and add the label.）
If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

src/kvstore/raftex/Host.h

codecov-commenter · 2023-06-12T05:05:56Z

Codecov Report

Patch coverage: 46.42% and project coverage change: +1.26 🎉

Comparison is base (e469b32) 76.96% compared to head (ccaa956) 78.22%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5571      +/-   ##
==========================================
+ Coverage   76.96%   78.22%   +1.26%     
==========================================
  Files        1130     1130              
  Lines       85345    85368      +23     
==========================================
+ Hits        65686    66783    +1097     
+ Misses      19659    18585    -1074

Impacted Files	Coverage Δ
src/kvstore/raftex/Host.h	`87.80% <ø> (+4.87%)`	⬆️
src/kvstore/raftex/RaftPart.cpp	`70.67% <ø> (+0.47%)`	⬆️
src/kvstore/raftex/Host.cpp	`70.23% <46.42%> (-2.11%)`	⬇️

... and 82 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

luyade · 2023-06-26T06:08:00Z

@critical27 any additional thoughts?

critical27 · 2023-06-27T04:18:20Z

If a host is down, the leader will still continue to send appendLog requests to the bad follower. As time goes on, the log gap will become large which will make the performance of write request bad.

The key points that make performance regression is that, when log gap become large, the building of the WAL iterator from specific logId is too slow. (In the same wal file, the target log is seek in sequential)

I'm ok with this PR for now. If you are interested in the problem I said, could try to fix it. The problem is the biggest one in current implementation. It will make raft much more robust and faster in many scenarios.

critical27 · 2023-06-27T04:20:15Z

One point you need to check is that, when a node is elected as leader, should we reset all paused_ of its host to false. WDYT?

luyade · 2023-06-27T05:17:40Z

If a host is down, the leader will still continue to send appendLog requests to the bad follower. As time goes on, the log gap will become large which will make the performance of write request bad.

The key points that make performance regression is that, when log gap become large, the building of the WAL iterator from specific logId is too slow. (In the same wal file, the target log is seek in sequential)

I'm ok with this PR for now. If you are interested in the problem I said, could try to fix it. The problem is the biggest one in current implementation. It will make raft much more robust and faster in many scenarios.

Actually I fixed the issue you mentioned (at least from my view) in our implementation and it almost had no peformance regression even if the log gap is very large. ^_^

luyade · 2023-06-27T05:29:01Z

One point you need to check is that, when a node is elected as leader, should we reset all paused_ of its host to false. WDYT?

Yes, the current implementation will call host->resume to reset paused_ as false. I think it should be fine. At the beginning, the new leader doesn't know its peer is down. It can eventually set the right status of the peer by sending HB request.

* Optimize the write performance when host is down * fix the comments --------- Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* fix traverse build path memory tracker (#5619) * Optimize the write performance when host is down (#5571) * Optimize the write performance when host is down * fix the comments --------- Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com> * Fix too many logs print when listing sessions (#5618) * Set min_level_for_custom_filter as 0 by default to delete expired d… (#5622) Reset min_level_for_custom_filter as 0 by default to delete expired data during compaction * Revert "Revert "Change ccache to sccache"" (#5623) * Revert "Revert "Change ccache to sccache" (#5613)" This reverts commit 08a5d90. * Update pull_request.yml * Revert "Change ccache to sccache" (#5627) Revert "Revert "Revert "Change ccache to sccache"" (#5623)" This reverts commit c1b433d. * fix all path memory tracker (#5621) * fix all path memory tracker * fix error * Update pull_request.yml enable sccache debug log * Update pull_request.yml add ninja -v * Update pull_request.yml * Update pull_request.yml * Update pull_request.yml * Update pull_request.yml --------- Co-authored-by: George <58841610+Shinji-IkariG@users.noreply.github.com> Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com> * Fix edge all predicate embedding when generating path (#5631) * Fix edge all predicate embedding when generate path * fmt * Enhancement/eliminate invalid filter (#5634) * Fix crash double free of expr. * Change issue id. * Elimintate invalid property filter. * support find circular (#5636) Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com> * fix allpath memory tracker (#5640) fix allpath memory traker * fix_delete_validate (#5645) add test case --------- Co-authored-by: jimingquan <mingquan.ji@vesoft.com> Co-authored-by: Ryan <ydlu1987@gmail.com> Co-authored-by: Songqing Zhang <zhangsongqing164@163.com> Co-authored-by: George <58841610+Shinji-IkariG@users.noreply.github.com> Co-authored-by: kyle.cao <kyle.cao@vesoft.com> Co-authored-by: shylock <33566796+Shylock-Hg@users.noreply.github.com>

Optimize the write performance when host is down

67038b0

luyade requested a review from critical27 as a code owner June 2, 2023 13:24

fix the comments

5d3dca4

critical27 reviewed Jun 5, 2023

View reviewed changes

src/kvstore/raftex/Host.h Show resolved Hide resolved

critical27 added ready-for-testing PR: ready for the CI test ready for review labels Jun 9, 2023

vesoft-inc deleted a comment from critical27 Jun 9, 2023

Merge branch 'master' into optimize_host_performance

dc773b7

critical27 approved these changes Jun 27, 2023

View reviewed changes

Sophie-Xie added 2 commits July 2, 2023 09:22

Merge branch 'master' into optimize_host_performance

53dd2ad

Merge branch 'master' into optimize_host_performance

ccaa956

cangfengzhs approved these changes Jul 5, 2023

View reviewed changes

Sophie-Xie merged commit 429e474 into vesoft-inc:master Jul 5, 2023
3 of 4 checks passed

Sophie-Xie added cherry-pick-v3.6 already-picked-3.6 labels Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the write performance when host is down #5571

Optimize the write performance when host is down #5571

luyade commented Jun 2, 2023

codecov-commenter commented Jun 12, 2023 •

edited

Loading

luyade commented Jun 26, 2023

critical27 commented Jun 27, 2023

critical27 commented Jun 27, 2023

luyade commented Jun 27, 2023

luyade commented Jun 27, 2023

Optimize the write performance when host is down #5571

Optimize the write performance when host is down #5571

Conversation

luyade commented Jun 2, 2023

What type of PR is this?

What problem(s) does this PR solve?

Issue(s) number: #5570

Description:

How do you solve it?

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Release notes:

codecov-commenter commented Jun 12, 2023 • edited Loading

Codecov Report

luyade commented Jun 26, 2023

critical27 commented Jun 27, 2023

critical27 commented Jun 27, 2023

luyade commented Jun 27, 2023

luyade commented Jun 27, 2023

codecov-commenter commented Jun 12, 2023 •

edited

Loading