Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track last seen post timestamp per wiki #100

Merged
merged 6 commits into from
May 4, 2024
Merged

Conversation

rossjrw
Copy link
Member

@rossjrw rossjrw commented May 3, 2024

2a85c2c fixed a bug where a post would be skipped being downloaded if a post was stored with a later timestamp than it, which because posts are downloaded ordered by wiki and not by timestamp, was causing almost all posts to be skipped. It fixed it by limiting the 'latest timestamp' search only to posts from the given wiki.

While effective, this is not efficient. Most downloaded posts are deleted during the very same run, meaning on the next run, many of those same posts will be downloaded again, wasting time and networking resources.

This PR fixes this by recording the last time that posts were downloaded from a wiki in the context_wiki table. This recording will be permanent, even if the corresponding post is deleted.

@rossjrw rossjrw added the optimisation Make an existing feature faster or smaller label May 3, 2024
@rossjrw
Copy link
Member Author

rossjrw commented May 4, 2024

Coincidentally looks like this also fixes a bug where the latest timestamp would be none if there are no posts stored for a given wiki, causing an error that prevents any posts from being stored for it at all.

I messed up the migration by initting all wikis to 0 when I should have calculated it from the last recorded post for each. Because of that, the new posts getter is temporarily completely deoptimised and the next few runs will overshoot as it tries to download every post it can get its grubby little mitts on. So long as no single wiki takes more than 15 mins to download it'll fix itself over the next few runs, it just means my uptime takes a hit for the next month.

It can never be reliable to use the notifiable_posts table for any sort
of aggregate data, because any and all of it is liable to be deleted.
Therefore, using the latest post's timestamp as the 'now' check is
actually deleting posts that are a year older than the latest queued
notification in my database - which could be anything.

Conversely, the max of the post check timestamps across all registered
wikis is simply the timestamp of the latest post on Wikidot that the
service has ever seen, which is updated every hour and is much more
reliable.
@rossjrw rossjrw merged commit a4d48f5 into main May 4, 2024
1 check passed
@rossjrw rossjrw deleted the reopt-post-download branch May 4, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimisation Make an existing feature faster or smaller
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant