Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change sqlite3_busy_timeout to 60 seconds, add warnings re: litestream. #4867

Merged
merged 2 commits into from
Oct 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 83 additions & 31 deletions doc/BACKUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,31 +242,6 @@ three or four storage devices.
BTRFS would probably work better if you were purchasing an entire set
of new storage devices to set up a new node.

## SQLite Litestream Replication
`/!\` WHO SHOULD DO THIS: Casual users

One of the simpler things on any system is to use Litestream to replicate the SQLite database.
It continuously streams SQLite changes to file or external storage - the cloud storage option
should not be used.
Backups/replication should not be on the same disk as the original SQLite DB.

/etc/litestream.yml :

dbs:
- path: /home/bitcoin/.lightning/bitcoin/lightningd.sqlite3
replicas:
- path: /media/storage/lightning_backup

and start the service using systemctl:

$ sudo systemctl start litestream

Restore:

$ litestream restore -o /media/storage/lightning_backup /home/bitcoin/restore_lightningd.sqlite3



## PostgreSQL Cluster

`/!\` WHO SHOULD DO THIS: Enterprise users, whales.
Expand Down Expand Up @@ -353,6 +328,59 @@ This can be difficult to create remote replicas due to the latency.

[pqsyncreplication]: https://www.postgresql.org/docs/13/warm-standby.html#SYNCHRONOUS-REPLICATION

## SQLite Litestream Replication
`/!\` WHO SHOULD DO THIS: Casual users

`/!\` **CAUTION** `/!\` This technique will only be safe on 0.10.2
or later.
Earlier versions will crash regularly with "database is locked" errors,
as Litestream puts a read-shared lock on the database.
0.10.2 adds a 60-second timeout for locking.

One of the simpler things on any system is to use Litestream to replicate the SQLite database.
It continuously streams SQLite changes to file or external storage - the cloud storage option
should not be used.
Backups/replication should not be on the same disk as the original SQLite DB.

You need to enable WAL mode on your database.
To do so, first stop `lightningd`, then:

$ sqlite3 lightningd.sqlite3
sqlite3> PRAGMA journal_mode = WAL;
sqlite3> .quit

Then just restart `lightningd`.

/etc/litestream.yml :

dbs:
- path: /home/bitcoin/.lightning/bitcoin/lightningd.sqlite3
replicas:
- path: /media/storage/lightning_backup

and start the service using systemctl:

$ sudo systemctl start litestream

Restore:

$ litestream restore -o /media/storage/lightning_backup /home/bitcoin/restore_lightningd.sqlite3

Because Litestream only copies small changes and not the entire
database (holding a read lock on the file while doing so), the
60-second timeout on locking should not be reached unless
something has made your backup medium very very slow.

Litestream has its own timer, so there is a tiny (but
non-negligible) probability that `lightningd` updates the
database, then irrevocably commits to the update by sending
revocation keys to the counterparty, and *then* your main
storage media crashes before Litestream can replicate the
update.
Treat this as a superior version of "Database File Backups"
section below and prefer recovering via other backup methods
first.

## Database File Backups

`/!\` WHO SHOULD DO THIS: Those who already have at least one of the
Expand Down Expand Up @@ -458,9 +486,33 @@ Even if the backup is not corrupted, take note that this backup
strategy should still be a last resort; recovery of all funds is
still not assured with this backup strategy.

You might be tempted to use `sqlite3` `.dump` or `VACUUM INTO`.
Unfortunately, these commands exclusive-lock the database.
A race condition between your `.dump` or `VACUUM INTO` and
`lightningd` accessing the database can cause `lightningd` to
crash, so you might as well just cleanly shut down `lightningd`
and copy the file at rest.
### `sqlite3` `.dump` or `VACUUM INTO` Commands

`/!\` **CAUTION** `/!\` This technique will only be safe on 0.10.2
or later.
Earlier versions will crash regularly with "database is locked"
errors, as `.dump` and `VACUUM INTO` put a read-shared lock on the
database.
0.10.2 adds a 60-second timeout for locking.

Use the `sqlite3` command on the `lightningd.sqlite3` file, and
feed it with `.dump "/path/to/backup.sqlite3"` or `VACUUM INTO
"/path/to/backup.sqlite3";`.

These create a snapshot copy that, unlike the previous technique,
is assuredly uncorrupted (barring any corruption caused by your
backup media).

However, if the copying process takes a long time (approaching the
timeout of 60 seconds) then you run the risk of `lightningd`
attempting to grab a write lock, waiting up to 60 seconds, and
then failing with a "database is locked" error.
Your backup system could `.dump` to a fast `tmpfs` RAMDISK or
local media, and *then* copy to the final backup media on a remote
system accessed via slow network, for example, to reduce this
risk.

It is recommended that you use `.dump` instead of `VACUUM INTO`,
as that is assuredly faster; you can just open the backup copy
in a new `sqlite3` session and `VACUUM;` to reduce the size
of the backup.
9 changes: 6 additions & 3 deletions wallet/db_sqlite3.c
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,13 @@ static bool db_sqlite3_setup(struct db *db)
}
db->conn = sql;

/* In case another writer (litestream?) grabs a lock, we don't
/* In case another process (litestream?) grabs a lock, we don't
* want to return SQLITE_BUSY immediately (which will cause a
* fatal error): give it 5 seconds. */
sqlite3_busy_timeout(db->conn, 5000);
* fatal error): give it 60 seconds.
* We *could* make this an option, but surely the user prefers a
* long timeout over an outright crash.
*/
sqlite3_busy_timeout(db->conn, 60000);

sqlite3_prepare_v2(db->conn, "PRAGMA foreign_keys = ON;", -1, &stmt, NULL);
err = sqlite3_step(stmt);
Expand Down