Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32012] Provide rollback feature on savepoint upgrade mode if HA missing and JM never started #622

Merged
merged 8 commits into from
Jun 27, 2023

Conversation

ashangit
Copy link
Contributor

What is the purpose of the change

This pull request will provide rollback feature If the JobManagers have never started and we rely on savepoint upgrade mode. The operator will safely rollback relying on the provided savepoint

Also benefit from this change to unify rollback and upgrade code path

Brief change log

  • Provide rollback feature on savepoint upgrade mode if HA missing and JM never started
  • Unify rollback and upgrade code path

Verifying this change

(Please pick either of the following options)

This change is already covered by existing tests, such as (please describe tests).

  • flink-kubernetes-operator/src/test/java/org/apache/flink/kubernetes/operator/controller/RollbackTest.java: Test different rollback cases

This change added tests and can be verified as follows:

  • Manually verified the change by running a 4 node cluster with 3 JobManagers and 1 TaskManagers in standalone deployment mode:
    • First deployed a working app
    • Redeploy the app changing the parallelism to get 1 more TM
    • Then deploy the app by setting a not existing service account leading the JM to never start and requiring the job to be rolled back
    • Once rollbacked redeploy app decrease of parallelism and appropriate service account leading to app being redeployed

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: yes but in status only adding lastRollbackSpec
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@ashangit ashangit force-pushed the nfraison/FLINK-32012-2 branch 2 times, most recently from 6c056df to 070472a Compare June 23, 2023 09:52
@ashangit ashangit requested a review from gyfora June 23, 2023 09:54
Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this turned out to be a pretty nice and simple implementation piggybacking on the existing logic. However we should add some new test cases that cover the scenarios we wished to enable here in the first place.

@gyfora
Copy link
Contributor

gyfora commented Jun 26, 2023

Can you please rebase it to the latest changes on master so we can run the e2es?

@ashangit ashangit force-pushed the nfraison/FLINK-32012-2 branch 2 times, most recently from c22f02d to 5279182 Compare June 26, 2023 09:05
@ashangit
Copy link
Contributor Author

There was an issue with new set of commits due to deployConfig being created too early and missing the potential reset of spec in case of rolling_back
Unittest updated by checking that the submitted config has well roll back a specific conf

@ashangit ashangit requested a review from gyfora June 27, 2023 11:27
@gyfora gyfora merged commit d346ca9 into apache:main Jun 27, 2023
112 checks passed
@gyfora
Copy link
Contributor

gyfora commented Jun 27, 2023

Thank you @ashangit for this great contribution :) good stuff

@ashangit ashangit deleted the nfraison/FLINK-32012-2 branch June 27, 2023 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants