Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the write performance when host is down #5571

Merged
merged 5 commits into from
Jul 5, 2023

Conversation

luyade
Copy link
Contributor

@luyade luyade commented Jun 2, 2023

What type of PR is this?

  • enhancement

What problem(s) does this PR solve?

Issue(s) number: #5570

Description:

Optimize the write performance after some host is down.

How do you solve it?

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • N/A

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

@luyade luyade requested a review from critical27 as a code owner June 2, 2023 13:24
@critical27 critical27 added ready-for-testing PR: ready for the CI test ready for review labels Jun 9, 2023
@vesoft-inc vesoft-inc deleted a comment from critical27 Jun 9, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jun 12, 2023

Codecov Report

Patch coverage: 46.42% and project coverage change: +1.26 🎉

Comparison is base (e469b32) 76.96% compared to head (ccaa956) 78.22%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5571      +/-   ##
==========================================
+ Coverage   76.96%   78.22%   +1.26%     
==========================================
  Files        1130     1130              
  Lines       85345    85368      +23     
==========================================
+ Hits        65686    66783    +1097     
+ Misses      19659    18585    -1074     
Impacted Files Coverage Δ
src/kvstore/raftex/Host.h 87.80% <ø> (+4.87%) ⬆️
src/kvstore/raftex/RaftPart.cpp 70.67% <ø> (+0.47%) ⬆️
src/kvstore/raftex/Host.cpp 70.23% <46.42%> (-2.11%) ⬇️

... and 82 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@luyade
Copy link
Contributor Author

luyade commented Jun 26, 2023

@critical27 any additional thoughts?

@critical27
Copy link
Contributor

If a host is down, the leader will still continue to send appendLog requests to the bad follower. As time goes on, the log gap will become large which will make the performance of write request bad.

The key points that make performance regression is that, when log gap become large, the building of the WAL iterator from specific logId is too slow. (In the same wal file, the target log is seek in sequential)

I'm ok with this PR for now. If you are interested in the problem I said, could try to fix it. The problem is the biggest one in current implementation. It will make raft much more robust and faster in many scenarios.

@critical27
Copy link
Contributor

One point you need to check is that, when a node is elected as leader, should we reset all paused_ of its host to false. WDYT?

@luyade
Copy link
Contributor Author

luyade commented Jun 27, 2023

If a host is down, the leader will still continue to send appendLog requests to the bad follower. As time goes on, the log gap will become large which will make the performance of write request bad.

The key points that make performance regression is that, when log gap become large, the building of the WAL iterator from specific logId is too slow. (In the same wal file, the target log is seek in sequential)

I'm ok with this PR for now. If you are interested in the problem I said, could try to fix it. The problem is the biggest one in current implementation. It will make raft much more robust and faster in many scenarios.

Actually I fixed the issue you mentioned (at least from my view) in our implementation and it almost had no peformance regression even if the log gap is very large. ^_^

@luyade
Copy link
Contributor Author

luyade commented Jun 27, 2023

One point you need to check is that, when a node is elected as leader, should we reset all paused_ of its host to false. WDYT?

Yes, the current implementation will call host->resume to reset paused_ as false. I think it should be fine. At the beginning, the new leader doesn't know its peer is down. It can eventually set the right status of the peer by sending HB request.

@Sophie-Xie Sophie-Xie merged commit 429e474 into vesoft-inc:master Jul 5, 2023
3 of 4 checks passed
Sophie-Xie added a commit that referenced this pull request Jul 20, 2023
* Optimize the write performance when host is down

* fix the comments

---------

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>
Sophie-Xie added a commit that referenced this pull request Jul 20, 2023
* fix traverse build path memory tracker (#5619)

* Optimize the write performance when host is down (#5571)

* Optimize the write performance when host is down

* fix the comments

---------

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* Fix too many logs print when listing sessions (#5618)

* Set min_level_for_custom_filter as 0 by default to delete expired d… (#5622)

Reset min_level_for_custom_filter as 0 by default to delete expired data during compaction

* Revert "Revert "Change ccache to sccache"" (#5623)

* Revert "Revert "Change ccache to sccache" (#5613)"

This reverts commit 08a5d90.

* Update pull_request.yml

* Revert  "Change ccache to sccache" (#5627)

Revert "Revert "Revert "Change ccache to sccache"" (#5623)"

This reverts commit c1b433d.

* fix all path memory tracker (#5621)

* fix all path memory tracker

* fix error

* Update pull_request.yml

enable sccache debug log

* Update pull_request.yml

add ninja -v

* Update pull_request.yml

* Update pull_request.yml

* Update pull_request.yml

* Update pull_request.yml

---------

Co-authored-by: George <58841610+Shinji-IkariG@users.noreply.github.com>
Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* Fix edge all predicate embedding when generating path (#5631)

* Fix edge all predicate embedding when generate path

* fmt

* Enhancement/eliminate invalid filter (#5634)

* Fix crash double free of expr.

* Change issue id.

* Elimintate invalid property filter.

* support find circular (#5636)

Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

* fix allpath memory tracker (#5640)

fix allpath memory traker

* fix_delete_validate (#5645)

add test case

---------

Co-authored-by: jimingquan <mingquan.ji@vesoft.com>
Co-authored-by: Ryan <ydlu1987@gmail.com>
Co-authored-by: Songqing Zhang <zhangsongqing164@163.com>
Co-authored-by: George <58841610+Shinji-IkariG@users.noreply.github.com>
Co-authored-by: kyle.cao <kyle.cao@vesoft.com>
Co-authored-by: shylock <33566796+Shylock-Hg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants