Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: events: sqlite db improvements #12090

Merged
merged 4 commits into from
Jun 14, 2024
Merged

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jun 14, 2024

  • fix two unclosed multi-row queries
  • tune options to limit wal growth

Ref: #12089

--

This is a draft for now while I test it out. I've manually truncated my current, massive wal file (sqlite3 /path/to/sqlite/events.db followed by PRAGMA wal_checkpoint(TRUNCATE);) so I'm starting from zero again and will watch its behaviour with this and report back.

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089
Copy link

github-actions bot commented Jun 14, 2024

All checks have passed

@rvagg
Copy link
Member Author

rvagg commented Jun 14, 2024

I think I've cracked it!

This line, right at the beginning of initialising the database checks whether we need to make the database or not:

q, err := db.QueryContext(ctx, "SELECT name FROM sqlite_master WHERE type='table' AND name='_meta';")

In normal operation, we end up at q.Next() which gives us a row that just contains "_meta", but because it's an *sql.Rows, which holds on to a read operation until closed, and we never call q.Next() again (which would return false and close it) or just q.Close(), we hold on to a read operation.

I believe this is the read operation that's left open for the entirety of the lifetime of the database so the checkpoint never gets to run and the WAL just grows and grows. This could also explain performance problems too if we have a mammoth WAL that we need to process when doing queries too.

The changes in here force a WAL truncation on checkpoint, and I've lowered the checkpoints to 256 pages. Our page size is unfortunately set quite large for this db (set on db create), so 256 gets us to ~8MiB WAL size before checkpoint and then truncate. The default is 1000, which would land us closer to 32MiB.

Watching the chain as I write this, I'm seeing it truncate every 3-4 epochs where it hits around the 8MiB mark. So it's ~2MiB of events per epoch, although it's extremely variable. There'll probably be epochs where we easily blow past the 8MiB mark.

So, we could increase this number, but compacting the write-ahead log every epoch or 4 doesn't seem like such a bad thing.

@rvagg rvagg marked this pull request as ready for review June 14, 2024 04:57
@rvagg
Copy link
Member Author

rvagg commented Jun 14, 2024

Here's the WAL for my mainnet node using this patch, every 30 seconds, compacting every 2-4 epochs.

Screenshot 2024-06-14 at 3 32 23 PM

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@Stebalien Stebalien merged commit 2e6a5c3 into master Jun 14, 2024
77 checks passed
@Stebalien Stebalien deleted the rvagg/events-db-improvements branch June 14, 2024 15:09
@snissn
Copy link
Contributor

snissn commented Jun 14, 2024

@rvagg does this ticket close #12081 ?

ribasushi pushed a commit to ribasushi/ci-abusing-lotus-fork that referenced this pull request Jun 15, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: filecoin-project#12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
ribasushi pushed a commit to ribasushi/ci-abusing-lotus-fork that referenced this pull request Jun 15, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: filecoin-project#12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
@ZenGround0 ZenGround0 mentioned this pull request Jun 17, 2024
7 tasks
rjan90 pushed a commit that referenced this pull request Jun 17, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
@rvagg
Copy link
Member Author

rvagg commented Jun 17, 2024

Continuing to monitor my wal file and the behaviour is not what I expected: mostly it seems to hover around what I showed above, but it has long stretches where it gets stuck at a particular, overly large, size. A few hours at 42MiB; the next day a stretch of ~5h at 44MiB, down again for 2 epochs then up to 37MiB for another ~6h, up from there to 41MiB for a couple of hours, then back down again; later the same day it grew up to 140MiB for a few hours but now it's running normal again.

The most important thing is that it eventually shrinks. However the behaviour of growing and getting stuck at a size well above the set maximum is strange because I can't reconcile it with what the docs suggest and what I know of the access patterns. Like we have long-running read locks that eventually free up after hours; but that doesn't square with how we run this thing. Perhaps this is just internal sqlite behaviour that's not clearly documented.

dumikau pushed a commit to protofire/lotus that referenced this pull request Jun 18, 2024
* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: filecoin-project#12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures
jennijuju pushed a commit that referenced this pull request Jun 19, 2024
* fix: ci: do not use deprecated --debug goreleaser flag (#12086)

* chore: deals: remove forgotten graphsync references (#12084)

* chore: types: remove more items forgotten after markets (#12095)

* chore: cleanup: remove more items forgotten after markets

* .gz somehow reappeared after #11625

* fix: ETH RPC API: ETH Call should use the parent state root of the subsequent tipset (#11905)

* fix eth call

* tests

* changes as per review

* changes as per review

* Update node/impl/full/eth.go

Co-authored-by: Rod Vagg <rod@vagg.org>

* fix as per review

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update changelog to RC2

Update changelog to RC2

* Make gen / make docsgen-cli

Make gen / make docsgen-cli

* chore: api: the Net API/CLI now remains only on daemon

The only part of this repository that does lp2p is now lotus-daemon

Remove the CommonNet type, used exclusively bu the CLI stack

Adjust the rest of struct-memebership to match what went where

End result best seen in diff of `documentation/en/api-v0-methods-miner.md`

* Update changelog

Update changelog

* fix: events: sqlite db improvements (#12090)

* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures

* Update CHANGELOG.md

---------

Co-authored-by: Piotr Galar <piotr.galar@gmail.com>
Co-authored-by: Peter Rabbitson <ribasushi@protocol.ai>
Co-authored-by: Aarsh Shah <aarshkshah1992@gmail.com>
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Peter Rabbitson <ribasushi@leporine.io>
jennijuju added a commit that referenced this pull request Jun 25, 2024
* release: v1.26.3 (#11908) (#11915)

* deps: update dependencies to address migration memory bloat

to address memory concerns during a heavy migration

Ref: filecoin-project/go-state-types#260
Ref: whyrusleeping/cbor-gen#96
Ref: filecoin-project/go-amt-ipld#90

* release: prep v1.26.3 patch

Prep v1.26.3 patch release:
- Update changelog, version and make gen + make docsgen-cli

* deps: update cbor-gen to tagged version

deps: update cbor-gen to tagged version

* deps: update go-state-types to tagged version

deps: update go-state-types to tagged version v0.13.2

* chore: deps: update go-state-types to v0.13.3

Fixes a panic when we have fewer than 1k proposals.

---------

Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Steven Allen <steven@stebalien.com>

* build: release: v1.27.0-rc1 (#11947)

* chore: Set version as v1.27.0-rc1

Set version as v1.27.0-rc1, run make gen & make docsgen-cli

* Update changelog

Update changelog

* Update changelog

Update changelog based on feedback

* Bump pubsub-dep

Bump pubsub-dep

* Prep v1.27.0-rc2

Prep v1.27.0-rc2

* Typo fixes, and more changelog updates

Typo fixes, and more changelog updates

* chore: remove unmaintained bootstrappers (#11983)

* chore: remove unmaintained bootstrappers

chore: remove unmaintained bootstrappers

* Update mainnet.pi fixing typoed domain

fixing typo for 1475.io 'bootstarp' -> 'bootstrap'

* Update mainnet.pi

apparently the actual hostname is typoed. so bootstarp it is.

---------

Co-authored-by: smagdali <stefan@fil.org>

* chore: update go-data-transfer and go-graphsync

* add ETH addrs API to Gateway (#11979)

* fix: copy Flags field from SectorOnChainInfo

Fixes: #11962

* feat: libp2p: Lotus stream cleanup (#11993)

* set stream deadlines in Lotus

* reduce timeout

* whitelist bootstrappers

* fix tests

* Update changelog and version

Update changelog and version

* ci: deprecate circle ci in favour of github actions (#11786)

* Update changelog

Update changelog with the deprecate circle-ci

* chore: update drand (#12021)

* Update changelog / make docsgen

Update changelog / make docsgen

* chore: lint: update golangci lint config

* remove and replace some linters
* remove some exclusions
* make all exclusions more explicit matches

* chore: lint: fix lint errors with new linting config

Ref: #11967

* chore: lint: address feedback from reviews

* doc: eth: restore comment lost in linter cleanup

Ref: #11968

* chore: libp2p: update to v0.34.1 (#12027)

* update libp2p to v0.34.0

* fix libp2p err

* fix imports

* update go mod

* update go mod

* Update changelog

Update changelog

* go mod tidy

go mod tidy

* revert go version change (#12050)

* Update changelog

Update changelog

* chore: backport #12054 to release/v1.27.0 branch (#12056)

* chore: pin golanglint-ci to v1.58.2 (#12054)

Fixes: #12044

* Add backport to changelog

Add backport to changelog

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Bump version - make gen/make docsgen

Bump version - make gen/make docsgen

* Update changelog

Update changelog

* Bump NodeBuildVersion to v1.27.1-rc1

Bump NodeBuildVersion to v1.27.1-rc1

* Add Lotus-Miner / Curio related changes

Add Lotus-Miner / Curio related changes

* Update date and upgrade warnings

Update date and upgrade warnings

* fix: ci: do not use deprecated --debug goreleaser flag (#12086)

* chore: deals: remove forgotten graphsync references (#12084)

* chore: types: remove more items forgotten after markets (#12095)

* chore: cleanup: remove more items forgotten after markets

* .gz somehow reappeared after #11625

* fix: ETH RPC API: ETH Call should use the parent state root of the subsequent tipset (#11905)

* fix eth call

* tests

* changes as per review

* changes as per review

* Update node/impl/full/eth.go

Co-authored-by: Rod Vagg <rod@vagg.org>

* fix as per review

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update changelog to RC2

Update changelog to RC2

* Make gen / make docsgen-cli

Make gen / make docsgen-cli

* chore: api: the Net API/CLI now remains only on daemon

The only part of this repository that does lp2p is now lotus-daemon

Remove the CommonNet type, used exclusively bu the CLI stack

Adjust the rest of struct-memebership to match what went where

End result best seen in diff of `documentation/en/api-v0-methods-miner.md`

* Update changelog

Update changelog

* fix: events: sqlite db improvements (#12090)

* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures

* Update CHANGELOG.md

* build: release: v1.27.1-rc2 (#12101)

* fix: ci: do not use deprecated --debug goreleaser flag (#12086)

* chore: deals: remove forgotten graphsync references (#12084)

* chore: types: remove more items forgotten after markets (#12095)

* chore: cleanup: remove more items forgotten after markets

* .gz somehow reappeared after #11625

* fix: ETH RPC API: ETH Call should use the parent state root of the subsequent tipset (#11905)

* fix eth call

* tests

* changes as per review

* changes as per review

* Update node/impl/full/eth.go

Co-authored-by: Rod Vagg <rod@vagg.org>

* fix as per review

---------

Co-authored-by: Rod Vagg <rod@vagg.org>

* Update changelog to RC2

Update changelog to RC2

* Make gen / make docsgen-cli

Make gen / make docsgen-cli

* chore: api: the Net API/CLI now remains only on daemon

The only part of this repository that does lp2p is now lotus-daemon

Remove the CommonNet type, used exclusively bu the CLI stack

Adjust the rest of struct-memebership to match what went where

End result best seen in diff of `documentation/en/api-v0-methods-miner.md`

* Update changelog

Update changelog

* fix: events: sqlite db improvements (#12090)

* fix: events: sqlite db improvements

* fix unclosed multi-row query
* tune options to limit wal growth

Ref: #12089

* fix: events: use correct context for CollectEvents transaction

* fix: events: close prepared read statement

* fix: events: close initial query; handle lint failures

* Update CHANGELOG.md

---------

Co-authored-by: Piotr Galar <piotr.galar@gmail.com>
Co-authored-by: Peter Rabbitson <ribasushi@protocol.ai>
Co-authored-by: Aarsh Shah <aarshkshah1992@gmail.com>
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Peter Rabbitson <ribasushi@leporine.io>

* small fix in changelog

* fix: release: update goreleaser config file

Fixes: #12120

* fix go releaser and test with rc3

* Update CHANGELOG.md

* lotus v1.27.1 prep

* address review
- resolve one more conflicts
- revert 2 new line added

* doc: events: note events db migration impact

---------

Co-authored-by: Phi-rjan <orjan.roren@gmail.com>
Co-authored-by: Rod Vagg <rod@vagg.org>
Co-authored-by: Steven Allen <steven@stebalien.com>
Co-authored-by: smagdali <stefan@fil.org>
Co-authored-by: Aarsh Shah <aarshkshah1992@gmail.com>
Co-authored-by: Piotr Galar <piotr.galar@gmail.com>
Co-authored-by: Peter Rabbitson <ribasushi@protocol.ai>
Co-authored-by: Peter Rabbitson <ribasushi@leporine.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants