Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Tile38 followers can get out of sync with leaders #740

Open
danwit-at-lytx opened this issue May 24, 2024 · 2 comments
Open

Bug: Tile38 followers can get out of sync with leaders #740

danwit-at-lytx opened this issue May 24, 2024 · 2 comments

Comments

@danwit-at-lytx
Copy link

As originally reported in the Tile38 Slack Channel

Describe the bug
I noticed an issue recently with tile38 leaders and followers. We have a use case with one leader and two followers all running in AWS ECS Fargate instances. They are connected via DNS entries. Here is the issue. AWS periodically will stop and redeploy the running instances for maintenance. Since the leader and followers are separate containers they will not be redeployed at the same time. When the leader is redeployed the existing followers connect to the new leader and start storing data from it. They do not however remove the data from the old leader first and therefore end up with two different copies of the data, one old and one new. This creates a lot of weird behavior and breaks the customer experience.

Note: Our data is ephemeral as its highly time sensitive, therefore when a leader is deployed it starts from scratch, it's not restarting from some existing external copy of the aof file.

To Reproduce
I've had this happen multiple times now. Best I can tell when the follower is connected to the leader via a dns host name, if a new instance of the leader is stood up at the same host name (i.e. new db), the follower will start following the new leader but will not clear out its existing data first, resulting in something that is out of sync with the leader. The follower reports that it is caught up to the leader but it is not in sync.

Expected behavior
When a follower connects to a leader it should clear out anything already existing in its own DB so that it only contains a copy of the leader's db.

Operating System (please complete the following information):

  • AWS ECS Fargate deployed via docker
  • Tile38 version 1.30.2
@iwpnd
Copy link
Contributor

iwpnd commented May 24, 2024

If your data is ephemeral tie the health of your followers to the health of your leader. Therefor if your leader restarts, so do your followers. That’s an option if your architecture supports a downtime of a couple of seconds. As your leader loses it’s aoffile, so should your followers.

that is not to say that I would consider this expected behavior.

We host our tile38 instances on an extra node pool, to avoid unnecessary restarts like that. Also while our leader has its private volume, followers use the node volume, and therefor have ephemeral storage.

edit:
I tried to take a look and tracked it down to this:

if pos == 0 {

There is no complete checksum between leader and follower, yet I am wondering why in your case the follower aof is not recreated as there is unlikely to even be a partial match between the old leader aof/current follower aof and the new leader aof.

Can you please try to replicate this behaviour with Tile38 1.32.2? @danwit-at-lytx

@Kilowhisky
Copy link
Contributor

I've been seeing this behavior too. Though for it to happen for me, the leader has to go down and come back up empty. At that point the followers will still have their data and will start following the leader and storing its data as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants