Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sliding sync: The Two Sync Loops #1928

Closed
10 of 11 tasks
poljar opened this issue May 16, 2023 · 0 comments · Fixed by #2204
Closed
10 of 11 tasks

Sliding sync: The Two Sync Loops #1928

poljar opened this issue May 16, 2023 · 0 comments · Fixed by #2204
Assignees

Comments

@poljar
Copy link
Contributor

poljar commented May 16, 2023

Push notifications

Push notifications are used to deliver information about messages relevant to the user in a timely manner as a OS notification.

The mobile device receives a room ID and event ID from the push service. The Client on the device then determines whether to convert this data into a visible OS notification.

Since the room ID and event ID are not enough to display a notification, more data needs to be fetched from the homeserver before doing so. To display all the relevant data we'll need to sync with the server. Push notifications on Apple devices are handled by a separate, resource limited, process. From now on called the "Notification process". It cannot wake up the main process and must independently retrieve any additional data.

In cases where an encrypted event triggers the notification, the presence of room keys may be necessary for event decryption. Room keys are delivered as to-device events to each recipient device.

To fulfill this requirement, the Notification process must initiate a /sync request. This request may fetch more data, including the room display name calculated by the SS proxy, the member event of the sender, and the event itself.

Client setup for multiple process sync support

At present, there is no support for multiple processes making changes to the Client database simultaneously. The following flowchart models modifications to our Client and its sync loop setup which aim to allow this.

flowchart TD
    homeserver[Homeserver]

    subgraph client[Main process]
     direction LR
     main[Main sync]
     to-device[Extension sync]
    end

    nse[Notification process]

    main <==> homeserver
    to-device <==> homeserver
    nse <==> homeserver
Loading

The main process will split out its sync loop into two parts. The main part, which is responsible to sync room data and events the client wants to display right away, and the extensions part (or only to-device) part, where we mainly sync to-device events.

This solves two problems:

  1. The main sync, getting reset by changing sync settings because of user interaction, is not blocking to-device and room key delivery anymore.
  2. If another process is doing the extensions sync, the main sync does not need to wait for the process to be done.
  3. Prevent the state where two sync loops are fighting for to-device events, since those are supposed to be delivered only once.

Main process client flow

The following flow chart models how the main process should behave when setting up a sync loop.

flowchart TD
    A[Main process] --> B(Start the sync streams)
    B --> C(Main Sync stream)
    C --> E(Sync once)
    E --> E
    B --> F(Extension sync stream)
    F --> G{Acquire the extensions sync lock}
    G -->|fa:fa-lock-open Lock acquired| H(Sync once - extensions)
    G -->|fa:fa-lock Can't acquire lock| I(Wait)
    I --> F
    H --> H
Loading

It would probably be nice, or even required, to stop the sync to release the "Extensions sync lock" when the main process gets suspended by the OS.

Notification client flow

The notification client on the other hand should behave a bit differently. It mainly should not set up a long running sync loop. It should sync only to fetch the required data which is necessary to display the notification.

flowchart TD
    A[Notification process] --> B(Received a push)
    B --> C(Start up the Client)
    C --> D{Acquire the extensions sync lock}
    D -->|fa:fa-lock-open Lock acquired| E(Sync once with to-device enabled)
    D -->|fa:fa-lock Can't acquire lock| F(Sync once with to-device disabled)
Loading

Tasks

Related but not mandatory for this

@bnjbvr bnjbvr self-assigned this May 29, 2023
bnjbvr added a commit that referenced this issue Jun 16, 2023
This implements a value-based lock in the crypto stores. The intent is to use that for multiple processes to be able to make writes into the store concurrently, while still cooperating on who does them. In particular, we need this for #1928, since we may have up to two different processes trying to write into the crypto store at the same time.

## New methods in the `CryptoStore` trait

The idea is to introduce two new methods touching **custom values** in the crypto store:

- one to atomically insert a value, only if it was missing (so, not following the semantics of `upsert` used in the `set_custom_value`) 
- one to atomically remove a custom value

Those two operations match the semantics we want:

- take the lock only if it ain't taken already == insert an entry only if it was missing
- release the lock = remove the entry

By looking at the number of lines affected by the query, we can infer whether the insert/remove happened or not, that is, if we managed to take the lock or not.

## High-level APIs

I've also added an high-level API, `CryptoStoreLock`, that helps managing such a lock, and adds some niceties on top of that:

- exponential backoff to retry attempts at acquiring the lock, when it was already taken
- attempt to gracefully recover when the lock has been taken by an app that's been killed by the environment
- full configuration of the key / value / backoff parameters

While it'd be nice to have something like a `CryptoStoreLockGuard`, it's hard to implement without being racy, because of the `async` statements that would happen in the `Drop` method (and async drop isn't stable yet).

## Test program

There's also a test program in which I shamelessly show my rudimentary unix skills; I've put it in the `labs/` directory but this could as well be a large integration test. A parent program initially fills a custom crypto store, then creates a `pipe()` for 1-way communication with a child created with `fork()`; then the parent sends commands to the child. These commands consist in reading and writing into the crypto store, using a lock. And while the child attempts to perform these operations, the parent tries hard to get the lock at the same time. This helps figuring out a few issues and making sure that cross-process locking would work as intended.
bnjbvr added a commit that referenced this issue Jun 29, 2023
This implements a new time lease based lock for the `CryptoStore`, that doesn't require explicit unlocking, so that's more robust in the context of #1928, where any process may die because the device is running out of battery, or unexpected flows cause a lock to not be released properly in one or the other process.

```
//! This is a per-process lock that may be used only for very specific use
//! cases, where multiple processes might concurrently write to the same
//! database at the same time; this would invalidate crypto store caches, so
//! that should be done mindfully. Such a lock can be acquired multiple times by
//! the same process, and it remains active as long as there's at least one user
//! in a given process.
//!
//! The lock is implemented using time-based leases to values inserted in a
//! crypto store. The store maintains the lock identifier (key), who's the
//! current holder (value), and an expiration timestamp on the side; see also
//! `CryptoStore::try_take_leased_lock` for more details.
//!
//! The lock is initially acquired for a certain period of time (namely, the
//! duration of a lease, aka `LEASE_DURATION_MS`), and then a "heartbeat" task
//! renews the lease to extend its duration, every so often (namely, every
//! `EXTEND_LEASE_EVERY_MS`). Since the tokio scheduler might be busy, the
//! extension request should happen way more frequently than the duration of a
//! lease, in case a deadline is missed. The current values have been chosen to
//! reflect that, with a ratio of 1:10 as of 2023-06-23.
//!
//! Releasing the lock happens naturally, by not renewing a lease. It happens
//! automatically after the duration of the last lease, at most.
```

---

* feat: implement a time lease based lock for the crypto store
* feat: switch the crypto-store lock a time-leased based one
* chore: fix CI, don't use unixepoch in sqlite and do time math in rust
* chore: dummy implementation in indexeddb, don't run lease locks tests there
* feat: in NSE, wait the duration of a lease if first attempt to unlock failed
* feat: immediately release the lock when there are no more holders
* chore: clippy
* chore: add comment about atomic sanity
* chore: increase sleeps in timeline queue tests?
* feat: lower lease and renew durations
* feat: keep track of the extend-lease task
* fix: increment num_holders when acquiring the lock for the first time
* chore: reduce indent + abort prev renew task on non-wasm + add logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants