Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add fields to migrate from 3.5 to 3.4 #15994

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lavacat
Copy link

@lavacat lavacat commented Jun 1, 2023

Tested migrate from 3.5 to 3.4 manually.

$ ./bin/etcdutl migrate --data-dir <data-dir> --target-version 3.4
2023-06-01T10:47:51-07:00	info	schema/migration.go:65	updated storage version	{"new-storage-version": "3.4.0"}

$ bbolt get <data-dir>/member/snap/db meta confState
key not found
$ bbolt get <data-dir>/member/snap/db meta term
key not found

fixes: #15878

@lavacat lavacat force-pushed the main-migrate branch 3 times, most recently from 10f4667 to 387657e Compare June 2, 2023 05:47
@lavacat lavacat marked this pull request as ready for review June 2, 2023 06:17
@lavacat lavacat requested a review from serathius June 2, 2023 06:17
@serathius
Copy link
Member

The proposed approach looks good, please add e2e upgrade downgrade tests. We need v3.5 -> v3.4 -> v3.5 if we want to officially support this in migrate.

Note; current e2e framework supports downloading only the last etcd minor version. You might need to rewrite some e2e code to support downloading multiple old releases.

cc @ahrtr @ptabor to get feedback about adding downgrade support.

addNewField(Meta, MetaTermKeyName, emptyValue),
addNewField(Meta, MetaConfStateName, emptyValue),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add ClusterClusterVersionKeyName and ClusterDowngradeKeyName?

MetaTermKeyName = []byte("term")
MetaConfStateName = []byte("confState")
ClusterClusterVersionKeyName = []byte("clusterVersion")
ClusterDowngradeKeyName = []byte("downgrade")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those fields were added in v3.4 not v3.5

@ahrtr
Copy link
Member

ahrtr commented Jun 5, 2023

We need v3.5 -> v3.4 -> v3.5

Good point. We need e2e test for v3.6 -> v3.5 -> v3.6 as well.

@serathius
Copy link
Member

Good point. We need e2e test for v3.6 -> v3.5 -> v3.6 as well.

I think the current downgrade test already does this. It tests current main vs last release which now is v3.5. I would like to extend the current e2e release tests to test last 2 releases as that's our support window.

// Since v3.5
MetaTermKeyName = []byte("term")
MetaConfStateName = []byte("confState")
ScheduledCompactKeyName = []byte("scheduledCompactRev")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create section Since v3.4 to match the schema?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've searched release-3.3 branch and compared to 3.4 and I don't think any new fields were added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

name: "Downgrading v3.5 to v3.4 is supported",
version: version.V3_5,
targetVersion: version.V3_4,
// note: 3.4 doesn't have storageVersion, this field was added in 3.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, could we however also test that DetectSchemaVersion will properly recognize v3.4?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this, but had to add another case to UnsafeDetectSchemaVersion, otherwise this won't work for Upgrading v3.4 to v3.5 should succeed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm little worried about implementing etcdutl migrate from v3.4 -> v3.5.

@serathius
Copy link
Member

Looks really promising! left couple of comments.

@lavacat
Copy link
Author

lavacat commented Jun 14, 2023

@serathius main change since last review is that I had to add better defaults for fields to make sure they can be read on upgrade.

@serathius
Copy link
Member

Looks almost done. Would like to also have a offline downgrade test. Basically start etcd v3.5, do some requests, use etcdutl do downgrade it, do couple of requests, upgrade it using etcdutl, do couple of requests. Do you think you can add it?

@lavacat
Copy link
Author

lavacat commented Jun 14, 2023

Would like to also have a offline downgrade test. Basically start etcd v3.5, do some requests, use etcdutl do downgrade it, do couple of requests, upgrade it using etcdutl, do couple of requests. Do you think you can add it?

Yes, I was thinking about adding start after migrate to TestEtctlutlMigrate test. I'll give it a go, if it makes TestEtctlutlMigrate unreadable, I'll move it to a separate test.

@lavacat lavacat force-pushed the main-migrate branch 2 times, most recently from 89a5dbf to 8182657 Compare June 21, 2023 20:08
be.Close()

if tc.delWal {
// clear out v2store to avoid "etcdserver/membership: cluster cannot be downgraded (current version: X is lower than determined cluster version: X)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is correct. Reason etcdserver/membership: cluster cannot be downgraded is written is because etcd server will still replay wal entries from since last snapshot. Proper solution would be to ensure that there is no SetClusterVersion wal entries since last snapshot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick update: working on adding snapshot save/restore for the case when /0/version entry is in wal log. I think that's the best way to go about it, since that's how users will need to handle this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick update: working on adding snapshot save/restore for the case when /0/version entry is in wal log

Don't understand. You mean to use snapshot save/restore to workaround the issue. I don't think it's correct, the only reason it works is fact that it throws out the WAL completely. Don't think this is correct in 3 node clusters as there is no guarantee that snapshot will save the same data on different nodes.

Proper solution would be to modify/remove the SetClusterVersion WAL entry. Generally it can be risky, as this entry protects etcd from writing incompatible WAL entries that would be interpreted wrongly. There were some new fields introduced into etcd v3.5 WAL. List is available in https://github.com/etcd-io/etcd/blob/main/scripts/etcd_version_annotations.txt.

For v3.6 downgrades I implemented a mechanism to detect WAL version based on proto annotations https://github.com/etcd-io/etcd/blob/main/server/storage/wal/version.go. However maybe we could check what proto versions were between v3.4 and v3.4, maybe the differences are not critical so we could yolo it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius you are correct, one has to use same snapshot. I was thinking of something like this for 3 node cluster. There will be cluster downtime if using this approach. There is probably a chance of data loss for entries that are in wal but not applied to backend (maybe there is a way to mitigate this).

Updating WAL is probably ideal. Here is the diff of raft_internal. From the first glance all new requests can be removed during migrate.ClusterVersionSetRequest, ClusterMemberAttrSetRequest, DowngradeInfoSetRequest and AuthStatusRequest. I still need to check every existing request for possible new non-optional fields.

I'll give it a shot and try to add wal update to migrate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that each member could be on different stage of applying WAL and storing db. We need to make sure that applying those entries works the same on both v3.5/v3.4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius
Copy link
Member

ping @lavacat

server/storage/schema/schema.go Outdated Show resolved Hide resolved
scripts/test.sh Outdated Show resolved Hide resolved
server/storage/schema/membership_test.go Outdated Show resolved Hide resolved
@lavacat lavacat force-pushed the main-migrate branch 4 times, most recently from 7fdfd99 to adb503f Compare September 11, 2023 07:30
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
@jmhbnz
Copy link
Member

jmhbnz commented Dec 14, 2023

Hey @lavacat - When you have a moment can you please resolve conflicts on this? Marek, Benjamin, Wenjia and I discussed this pr today as still being key for upcoming release.

@lavacat
Copy link
Author

lavacat commented Jan 18, 2024

related doc Downgrade Support 3.5 to 3.4

@serathius
Copy link
Member

/retest

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link

@lavacat: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify 859c0ee link true /test pull-etcd-verify
pull-etcd-unit-test-amd64 859c0ee link true /test pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64 859c0ee link true /test pull-etcd-unit-test-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Downgrade support from 3.5 to 3.4
5 participants