Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support scale in tikv simultaneously #3881

Merged
merged 3 commits into from
Aug 15, 2022

Conversation

lizhemingi
Copy link
Contributor

Signed-off-by: Zheming Li nkdudu@126.com

What problem does this PR solve?

Closes #3849

What is changed and how does it work?

Scale in multi tikv store in one ScaleIn call.

Find as many as possible but no more than spec.tikv.maxScaleInReplicas in scaleMulti.

Try every possible ordinal found and aggregate all error to return.

Code changes

  • Has Go code change
  • Has CI related scripts change

Tests

  • Unit test
  • E2E test
  • Manual test
  • No code

Side effects

  • Breaking backward compatibility
  • Other side effects:

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Mar 29, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • csuzhangxc
  • july2993

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@CLAassistant
Copy link

CLAassistant commented Mar 29, 2021

CLA assistant check
All committers have signed the CLA.

@lizhemingi lizhemingi force-pushed the support-scale-in-simultaneously branch 2 times, most recently from 6e1bda7 to eb862dd Compare March 29, 2021 12:20
@DanielZhangQD
Copy link
Contributor

@duduainankai Thanks for the contribution!
Did you verify this change? Could you please update the test result in the PR description?
Here are the test cases in my mind:

  • Default case, scale in one by one.
  • spec.tikv.maxScaleInReplicas=2, spec.tikv.replicas=5, scale in from 5 to 3, two Pods are scaled in at the same time
  • spec.tikv.maxScaleInReplicas=3, spec.tikv.replicas=5, scale in from 5 to 2, two Pods are scaled in at the same time, tikv-2 cannot be scaled in
  • spec.tikv.maxScaleInReplicas=2, spec.tikv.replicas=6, scale in from 6 to 3, two Pods are scaled in at the same time
  • spec.tikv.maxScaleInReplicas=4, spec.tikv.replicas=5, scale in from 5 to 3, two Pods are scaled in at the same time

Please feel free to add more cases.

It would be better to add e2e cases for this change.

@lizhemingi
Copy link
Contributor Author

@duduainankai Thanks for the contribution!
Did you verify this change? Could you please update the test result in the PR description?
Here are the test cases in my mind:

  • Default case, scale in one by one.
  • spec.tikv.maxScaleInReplicas=2, spec.tikv.replicas=5, scale in from 5 to 3, two Pods are scaled in at the same time
  • spec.tikv.maxScaleInReplicas=3, spec.tikv.replicas=5, scale in from 5 to 2, two Pods are scaled in at the same time, tikv-2 cannot be scaled in
  • spec.tikv.maxScaleInReplicas=2, spec.tikv.replicas=6, scale in from 6 to 3, two Pods are scaled in at the same time
  • spec.tikv.maxScaleInReplicas=4, spec.tikv.replicas=5, scale in from 5 to 3, two Pods are scaled in at the same time

Please feel free to add more cases.

It would be better to add e2e cases for this change.

I already done some of these tests in unit test, and I will fill up reset.

After that I will try to add e2e cases.

@codecov-io
Copy link

Codecov Report

Merging #3881 (ee9aab0) into master (3761156) will decrease coverage by 5.43%.
The diff coverage is 79.00%.

@@            Coverage Diff             @@
##           master    #3881      +/-   ##
==========================================
- Coverage   67.91%   62.47%   -5.44%     
==========================================
  Files         175      171       -4     
  Lines       18612    18155     -457     
==========================================
- Hits        12641    11343    -1298     
- Misses       4877     5714     +837     
- Partials     1094     1098       +4     
Flag Coverage Δ
e2e ?
unittest 62.47% <79.00%> (+0.06%) ⬆️

@DanielZhangQD
Copy link
Contributor

@duduainankai
Thanks for the contribution!
Could you please also help update the code to cover the following cases:

  • support scale in multiple TiFlash Pods at the same time
  • support scale out multiple TiFlash/TiKV Pods at the same time

@DanielZhangQD
Copy link
Contributor

@duduainankai Just a friendly ping, are you available to continue working on this PR?

@lizhemingi
Copy link
Contributor Author

@DanielZhangQD oh sorry for the slow response.

I would like to continue on this PR but it's a little busy these days, so maybe I need more time.

Is there any ETA for this feature to be merged ? If it's out of my time I will let you know first.

@DanielZhangQD
Copy link
Contributor

@DanielZhangQD oh sorry for the slow response.

I would like to continue on this PR but it's a little busy these days, so maybe I need more time.

Is there any ETA for this feature to be merged ? If it's out of my time I will let you know first.

@duduainankai Thanks a lot!
It's not so urgent now, what about finishing it by 05/14 or 05/21?

@lizhemingi
Copy link
Contributor Author

@DanielZhangQD oh sorry for the slow response.
I would like to continue on this PR but it's a little busy these days, so maybe I need more time.
Is there any ETA for this feature to be merged ? If it's out of my time I will let you know first.

@duduainankai Thanks a lot!
It's not so urgent now, what about finishing it by 05/14 or 05/21?

@DanielZhangQD
Got. I will finish it ASAP.

@lizhemingi
Copy link
Contributor Author

And maybe you can review the current PR first, I already fill up the case in unit test. @DanielZhangQD

@DanielZhangQD
Copy link
Contributor

And maybe you can review the current PR first, I already fill up the case in unit test. @DanielZhangQD

OK. /cc @csuzhangxc

@lizhemingi
Copy link
Contributor Author

@DanielZhangQD

Hi, I already finish "support scale in multiple TiFlash Pods at the same time" and add test for it, mainly copy from tikv_scaler_test.

But I found it a little different when do scale in between TiKV and TiFlash (like no preCheck in TiFlash and etc) so I remove and modify some test case. And I want to confirm that is by design or not, because it seems quite reasonable to do preCheck when scale in.

@DanielZhangQD
Copy link
Contributor

@DanielZhangQD

Hi, I already finish "support scale in multiple TiFlash Pods at the same time" and add test for it, mainly copy from tikv_scaler_test.

But I found it a little different when do scale in between TiKV and TiFlash (like no preCheck in TiFlash and etc) so I remove and modify some test case. And I want to confirm that is by design or not, because it seems quite reasonable to do preCheck when scale in.

@duduainankai We need to keep the preCheck for the TiKV, as for the TiFlash, it's not done yet, please refer to #3170.

@DanielZhangQD
Copy link
Contributor

BTW, there are a lot of changes in this PR, please add e2e tests to cover the changes.
You can do manual tests first to make sure that the changes work as expected, update the result in the PR description, and then add e2e test cases.
Thanks a lot!

@lizhemingi
Copy link
Contributor Author

Got.

@lizhemingi lizhemingi force-pushed the support-scale-in-simultaneously branch from 0d568f3 to 3b96df3 Compare May 7, 2021 11:40
pkg/apis/pingcap/v1alpha1/types.go Outdated Show resolved Hide resolved
pkg/manager/member/scaler.go Outdated Show resolved Hide resolved
pkg/manager/member/tiflash_member_manager_test.go Outdated Show resolved Hide resolved
tc, ok := meta.(*v1alpha1.TidbCluster)
if !ok {
return nil
return fmt.Errorf("tiflashScaler.ScaleIn: failed to convert cluster %s/%s", meta.GetNamespace(), meta.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return error here will cause the TC to be resynced, however, what's the point of resync here if it's not a TC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't notice that TiFlash ignore this error since it's determined an error in TiKV.

So the difference here is by design ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think TiKV should also ignore this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got. I will correct it.

} else {
doSet = true
if deleteSlots.Has(ordinal) {
actualDeleteSlots.Insert(ordinal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help verify the cases in https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/scaler_test.go#L343-L611?
And you have to enable the advanced statefulset to enable the deleteslots functionality, refer to https://docs.pingcap.com/zh/tidb-in-kubernetes/dev/advanced-statefulset.
Please also help verify the Kubernetes native statefulset cases.
Thanks!

pkg/manager/member/tiflash_scaler.go Outdated Show resolved Hide resolved
pkg/manager/member/tikv_scaler_test.go Show resolved Hide resolved
@DanielZhangQD
Copy link
Contributor

BTW, there are a lot of changes in this PR, please add e2e tests to cover the changes.
You can do manual tests first to make sure that the changes work as expected, update the result in the PR description, and then add e2e test cases.
Thanks a lot!

@duduainankai Please help cover the following cases during the test:

@DanielZhangQD
Copy link
Contributor

@duduainankai Could you please join the slack channel of sig-k8s https://tidbcommunity.slack.com/archives/CHD0HA3LZ?
BTW, you can continue the testing now and @csuzhangxc will help prepare a test plan of this PR, we will share it with you when it's ready.
Thanks!

@lizhemingi
Copy link
Contributor Author

@duduainankai Could you please join the slack channel of sig-k8s https://tidbcommunity.slack.com/archives/CHD0HA3LZ?
BTW, you can continue the testing now and @csuzhangxc will help prepare a test plan of this PR, we will share it with you when it's ready.
Thanks!

I already joined the channel and you can find me by 'lizheming'.

@lizhemingi
Copy link
Contributor Author

/run-all-tests

@lizhemingi lizhemingi force-pushed the support-scale-in-simultaneously branch from 0d756d0 to 966dfaa Compare August 8, 2022 03:39
@ti-chi-bot
Copy link
Member

@liubog2008: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@DanielZhangQD
Copy link
Contributor

/run-all-tests

Comment on lines +2608 to +2611
// ScaleInParallelism configures max scale in replicas for TiKV stores.
// +kubebuilder:default=1
// +optional
ScaleInParallelism *int32 `json:"scaleInParallelism,omitempty"`
Copy link
Collaborator

@KanShiori KanShiori Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it support negative number, how about adding a validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

Copy link
Contributor Author

@lizhemingi lizhemingi Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielZhangQD Should we treat 0 as an illegal argument too ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@DanielZhangQD
Copy link
Contributor

/run-all-tests

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind pull-e2e-kind-serial

@DanielZhangQD
Copy link
Contributor

@lizhemingi could you please fix the CI?

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind pull-e2e-kind-serial

Signed-off-by: Zheming Li <nkdudu@126.com>
@lizhemingi lizhemingi force-pushed the support-scale-in-simultaneously branch from e83e96f to a8d771c Compare August 10, 2022 13:02
@lizhemingi
Copy link
Contributor Author

@lizhemingi could you please fix the CI?

Done.

@DanielZhangQD
Copy link
Contributor

/run-all-test

@DanielZhangQD
Copy link
Contributor

/run-all-tests

Comment on lines +2608 to +2611
// ScaleInParallelism configures max scale in replicas for TiKV stores.
// +kubebuilder:default=1
// +optional
ScaleInParallelism *int32 `json:"scaleInParallelism,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@DanielZhangQD
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 3941104

@DanielZhangQD
Copy link
Contributor

@lizhemingi

I am very sorry that it took so long to finish the review of this PR due to our reasons.
But you didn't give up, you kept pushing forward the completion of this PR, and did a lot of testing to ensure the PR quality.
Thank you very much for your great effort in this PR.

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind pull-e2e-kind-br

@DanielZhangQD
Copy link
Contributor

/test pull-e2e-kind

@ti-chi-bot ti-chi-bot merged commit 45c6120 into pingcap:master Aug 15, 2022
KanShiori pushed a commit to KanShiori/tidb-operator that referenced this pull request Aug 15, 2022
xhebox pushed a commit to KanShiori/tidb-operator that referenced this pull request Sep 16, 2022
@csuzhangxc csuzhangxc mentioned this pull request Mar 12, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support scaling in multiple TiKV/TiFlash instances simultaneously
10 participants