Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Wait shutdown-grace-seconds to flush async db writes to disk #17094

Merged
merged 21 commits into from
Sep 15, 2023

Conversation

chillyvee
Copy link
Contributor

@chillyvee chillyvee commented Jul 23, 2023

Description

server/start.go by default exits the main() routine before databases have a chance to flush async writes to disk (such as pebbledb)

Since there is no master registry of all open databases that can be used for orderly flushing of all open databases, databases should be allowed to catch interrupt signals (like the main daemon) and be given time to flush to disk to avoid data loss.

Add Close() for snapshot manager
Add application.db and snapshots/metadata.db to BaseApp.Close()


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • run make lint and make test
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

@chillyvee chillyvee requested a review from a team as a code owner July 23, 2023 12:39
server/start.go Fixed Show fixed Hide fixed
@yihuang
Copy link
Collaborator

yihuang commented Jul 23, 2023

You can probably implement the app.Close method, which is called in graceful shutdown.

@chillyvee
Copy link
Contributor Author

You can probably implement the app.Close method, which is called in graceful shutdown.

Thank you for the comment. Will try it.

@chillyvee
Copy link
Contributor Author

Do you know where the database handle to data/evidence.db is held?

store/snapshots/manager.go Fixed Show fixed Hide fixed
baseapp/baseapp.go Fixed Show fixed Hide fixed
Copy link
Contributor

@alexanderbez alexanderbez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the server package's signal capturing mechanism executes a callback function that performs all cleanup, instead the app should pass what it wants cleaned up to that call IMO

@chillyvee
Copy link
Contributor Author

chillyvee commented Jul 24, 2023

Since the server package's signal capturing mechanism executes a callback function that performs all cleanup, instead the app should pass what it wants cleaned up to that call IMO

Are you referring to cleanupFn from func startApp(svrCtx *Context, appCreator types.AppCreator, opts StartCmdOptions) (app types.Application, cleanupFn func(), err error) ?

A callback to add items to clean up makes sense, but since start.go/startApp opens db, should start.go clean itself up rather than relying on the chain app.go to ask?

cosmos-sdk/baseapp/baseapp.go creates the snapshotmanager, so is it better that cosmos-sdk clean that up rather than rely on a callback supplied by the application chain?

if app/app.go opens a new database, then cleanupFn should clean up that new database that cosmos-sdk does not manage

@chillyvee
Copy link
Contributor Author

@alexanderbez Confirming whether you agree that:

  1. it is okay for cosmos-sdk to close the database handles that it opens (rather than relying on the app.go chain logic)
  2. app specific databases be closed by request from the app

@alexanderbez
Copy link
Contributor

I would say this makes sense to me, yes!

@chillyvee
Copy link
Contributor Author

Following guidance, cometbft closes databases it opens OnStop:
cometbft/cometbft#1210

@chillyvee
Copy link
Contributor Author

chillyvee commented Aug 6, 2023

github dependency test does not see the new snapshot manager Close() method

local go test succeeds with manager.Close() call

app.snapshotManager.Close undefined (type *snapshots.Manager has no field or method Close)

defined here:
https://github.com/cosmos/cosmos-sdk/pull/17094/files#diff-127e2148fdc244ca5bc1bf693f0d9d2020a46b6bf32cc1377801b6c49ceb9205R555

@julienrbrt
Copy link
Member

github dependency test does not see the new snapshot manager Close() method

This is because store it is own go.mod. You could split the PR and scope one on store changes the the other one for integrating the changes (using then a pseudo version from main). That's better than adding replaces imho.

@@ -550,3 +550,9 @@ func (m *Manager) snapshot(height int64) {
m.logger.Debug("pruned state snapshots", "pruned", pruned)
}
}

// close snapshot database
func (m *Manager) Close() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a changelog under store/CHANGELOG.md for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #17294

@chillyvee
Copy link
Contributor Author

snapshot manager Close() split into #17294

@chillyvee
Copy link
Contributor Author

chillyvee commented Aug 6, 2023

This PR is rebased on #17294

pseudo version for go.mod after #17294 merge?

@julienrbrt
Copy link
Member

This PR is rebased on #17294

pseudo version for go.mod after #17294 merge?

👍
Yes, pseudo version from main.

@chillyvee chillyvee force-pushed the cv_servergraceful branch 2 times, most recently from 6a91d0f to 398b357 Compare August 7, 2023 12:40
@chillyvee
Copy link
Contributor Author

Hi! Could you fix the conflicts and look at the last comment?

Conflict resolved.

Last comment from yihuang?

@julienrbrt
Copy link
Member

Hi! Could you fix the conflicts and look at the last comment?

Conflict resolved.

Last comment from yihuang?

Yes

@chillyvee
Copy link
Contributor Author

Hi! Could you fix the conflicts and look at the last comment?

Conflict resolved.
Last comment from yihuang?

Yes

Let's wait for an update from @yihuang to see how soon a root cause solution is available.

If a solution isn't available soon, it might be good to use the grace period now and reduce data loss on program exit (this is a current problem). We can remove the grace feature later.

baseapp/baseapp.go Outdated Show resolved Hide resolved
baseapp/baseapp.go Outdated Show resolved Hide resolved
@yihuang
Copy link
Collaborator

yihuang commented Sep 5, 2023

Hi! Could you fix the conflicts and look at the last comment?

Conflict resolved.
Last comment from yihuang?

Yes

Let's wait for an update from @yihuang to see how soon a root cause solution is available.

If a solution isn't available soon, it might be good to use the grace period now and reduce data loss on program exit (this is a current problem). We can remove the grace feature later.

sorry for the delay, our chain binary still uses sdk 0.46, which is known to be not good at graceful shutdown.
but I just tested with simd in latest cosmos-sdk, it's very stable that app.Close is always called when kill the node, I also added several seconds of time.Sleep in it, and it'll wait for it to return, there's big cleanup regarding the graceful shutdown has done in recent sdk.

@chillyvee
Copy link
Contributor Author

chillyvee commented Sep 5, 2023

I just tested with simd in latest cosmos-sdk, it's very stable that app.Close is always called when kill the node, I also added several seconds of time.Sleep in it, and it'll wait for it to return, there's big cleanup regarding the graceful shutdown has done in recent sdk.

@yihuang Do you think it's better to remove the new grace-seconds and only keep the DB close calls in baseapp/baseapp.go Close() ?

Is there any way to backport the Close()/Cleanup so that previous SDK versions can shutdown properly? Or must all chains upgrade to 0.49/0.50 to shutdown properly?

@yihuang
Copy link
Collaborator

yihuang commented Sep 5, 2023

I just tested with simd in latest cosmos-sdk, it's very stable that app.Close is always called when kill the node, I also added several seconds of time.Sleep in it, and it'll wait for it to return, there's big cleanup regarding the graceful shutdown has done in recent sdk.

@yihuang Do you think it's better to remove the new grace-seconds and only keep the DB close calls in baseapp/baseapp.go Close() ?

I think so, the issue probably don't reproduce in main branch, but I guess we can add this in 0.46 backport if is reproduce there?

Is there any way to backport the Close()/Cleanup so that previous SDK versions can shutdown properly? Or must all chains upgrade to 0.49/0.50 to shutdown properly?

I did an try before, but there are too many API breaking changes to be accepted: #16203

@chillyvee
Copy link
Contributor Author

I just tested with simd in latest cosmos-sdk, it's very stable that app.Close is always called when kill the node, I also added several seconds of time.Sleep in it, and it'll wait for it to return, there's big cleanup regarding the graceful shutdown has done in recent sdk.

@yihuang Do you think it's better to remove the new grace-seconds and only keep the DB close calls in baseapp/baseapp.go Close() ?

I think so, the issue probably don't reproduce in main branch, but I guess we can add this in 0.46 backport if is reproduce there?

How about we move the db close calls in baseapp.go to a new PR to close the databases on exit.

This PR can be just for the shutdown-grace backport for versions that don't shutdown gracefully like 0.45/0.46/0.47(juno)

@yihuang
Copy link
Collaborator

yihuang commented Sep 9, 2023

I just tested with simd in latest cosmos-sdk, it's very stable that app.Close is always called when kill the node, I also added several seconds of time.Sleep in it, and it'll wait for it to return, there's big cleanup regarding the graceful shutdown has done in recent sdk.

@yihuang Do you think it's better to remove the new grace-seconds and only keep the DB close calls in baseapp/baseapp.go Close() ?

I think so, the issue probably don't reproduce in main branch, but I guess we can add this in 0.46 backport if is reproduce there?

How about we move the db close calls in baseapp.go to a new PR to close the databases on exit.

This PR can be just for the shutdown-grace backport for versions that don't shutdown gracefully like 0.45/0.46/0.47(juno)

Yeah, sounds good to me.

@chillyvee
Copy link
Contributor Author

How about we move the db close calls in baseapp.go to a new PR to close the databases on exit.
This PR can be just for the shutdown-grace backport for versions that don't shutdown gracefully like 0.45/0.46/0.47(juno)

Yeah, sounds good to me.

Moved database close calls to #17667

@chillyvee
Copy link
Contributor Author

This PR can be just for the shutdown-grace backport for versions that don't shutdown gracefully like 0.45/0.46/0.47(juno)

@julienrbrt julienrbrt added this pull request to the merge queue Sep 15, 2023
@julienrbrt julienrbrt removed this pull request from the merge queue due to a manual request Sep 15, 2023
server/start.go Outdated Show resolved Hide resolved
server/start.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@julienrbrt julienrbrt added this pull request to the merge queue Sep 15, 2023
Merged via the queue into cosmos:main with commit c92b257 Sep 15, 2023
44 of 45 checks passed
@julienrbrt
Copy link
Member

@Mergifyio backport release/v0.50.x

Copy link
Contributor

mergify bot commented Nov 10, 2023

backport release/v0.50.x

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Nov 10, 2023
…17094)

Co-authored-by: Julien Robert <julien@rbrt.fr>
(cherry picked from commit c92b257)

# Conflicts:
#	CHANGELOG.md
julienrbrt added a commit that referenced this pull request Nov 10, 2023
…ackport #17094) (#18431)

Co-authored-by: Chill Validation <92176880+chillyvee@users.noreply.github.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants