Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental fetchers #91

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Incremental fetchers #91

wants to merge 2 commits into from

Conversation

berekuk
Copy link
Collaborator

@berekuk berekuk commented Jun 3, 2022

This is a draft for #35 and #36, and it's not ready yet, but the changes are significant and I want to braindump my thoughts on it.

So, currently all platform modules fetch all questions and then store a huge array in the DB (and then on Algolia).

As I mentioned in #36, I'd like to change that.


Sidenote: I spent several hours today fighting the new metaculus fetcher which failed for one reason or another (mostly because of excessive validation, but also once because one question was on the frontpage and ON DELETE was set to restrict instead of cascade). Every time I had to wait until it got past the last point of failure, only to have it fail again further down the road.

I really don't like to have such a long feedback loop to get some initial results; also, the current architecture gets in the way when I want to get some questions in my dev DB. Though I've recently implemented the npm run cli metaculus -- --id=12345 command, what I really want is to say "fetch some stuff for this platform" without waiting several hours for the script to finish.

Of course, there are also other benefits for why I'm doing this; getting us closer to the real-time capabilities, etc.


The basic idea is: we crawl the graph of urls; there are some leaf nodes (question page urls or graphql endpoints with questions data or whatever) and some intermediate nodes which allow us to discover leaf nodes, e.g. /api2/questions/ on metaculus which doesn't give us full data but gives it us urls for other api pages with full data.

To store the progress we can use the table (Robot) with jobs as rows; each job includes an url, a json context, and some metadata for when the job was created and whether it was completed. Then we can encapsulate the common pattern of "keep fetching stuff until there's some stuff to process" behind a common API.

Here's a draft which uses this approach:

export const myPlatform = {
	...,
	async fetcher({ robot, storage }) {
		await robot.schedule({
			url: 'https://www.metaculus.com/api2/questions/',
			context: {
				type: 'apiIndex',
			},
                        maxAge: 3600 * 24, // don't schedule if previous fetch happened recently
		});

		for(let job; job = await robot.nextJob(); ) {
			const result = await job.fetch();
			if (job.context.type === 'apiIndex') {
				const data = validate(result);
				for (const tmp of data) {
					await robot.schedule({
						url: tmp.url,
						context: {
							type: 'apiSingle',
						},
					});
				}
				if (data.next) {
					await robot.schedule({
						url: data.next,
						context: {
							type: 'apiIndex',
						},
					});
				}
			} else if (job.context.type === 'apiSingle') {
				const validated = validate(result);
				const question = resultToQuestion(validated);
				await storage.save(question);

                                // complete the job and create a new one; this is excessive in this case since we crawl the index api pages, but can be helpful in other cases
				await job.done({ repeatAfter: 86400 });
			} else {
				throw new Error("Unknown job type");
			}
		}
	}
};

Notes on this example:

  • fetcher don't return anything, it calls storage.save instead
  • fetcher is completely interruptible, you should be able to ctrl+c it and restart it and it'll continue from the same point
  • fetcher exits when there are no jobs queued up but you can just restart it every minute and it'll queue up new stuff when it becomes necessary (previous discussion: Independent update schedules for different platforms #35)
  • fetcher could add different indices to the queue with the different maxAge values; e.g., it's easy to schedule a metaculus frontpage with a small maxAge and crawl urls from it more frequently
  • robot API could encapsulate sleeps and other common logic (not implemented yet)
  • storage.save will also update history and algolia synchronously, no need to do it in a separate step

In the future, we could also:

  • separate the robot and the platform code further, so that it'll be possible to run "re-process all the fetched data which is already cached"
  • pass a different storage to the fetcher, e.g. for debugging purposes you could console.log questions instead of storing them in the DB

Stuff I'm still figuring out:

  • need to pass platform-specific credentials to the robot somehow
  • not sure if maxAge and repeatAfter is the right approach, still experimenting with this
  • need a mode for force-fetching even if url was fetched recently; this can be hacked with DELETE from "Robot" WHERE platform = "myplatform", not sure if we need anything more
  • how to deploy this (maybe it's a good time to move away from Heroku to a separate DO instance)
  • deletions are problematic (I'll expand on this in a comment)

@vercel
Copy link

vercel bot commented Jun 3, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
metaforecast ✅ Ready (Inspect) Visit Preview Jun 3, 2022 at 5:57PM (UTC)

@berekuk
Copy link
Collaborator Author

berekuk commented Jun 3, 2022

Note on deletions.

Several possible solutions to consider:

  1. check every question once in a while to confirm that it's still alive; maybe have a separate function in the Platform API for that, checkQuestion
  2. have a separate function, listAllQuestions, which returns the list of all questions that shouldn't be deleted, and delete everything else; this is problematic because for some platforms this function could be as expensive as crawling everything
  3. delete all questions which weren't updated recently; this is easy but too dangerous, we might delete too much stuff by accident

I lean towards (1), though I don't like that it'll require a significant amount of new code.

@berekuk
Copy link
Collaborator Author

berekuk commented Jun 3, 2022

@NunoSempere I'd appreciate any feedback you have on this. I might be missing some corner cases, since I still haven't read the code for all the platforms carefully.

@NunoSempere
Copy link
Collaborator

Ok, looking at this, I don't understand what type of pattern the following is:

async fetcher({ robot, storage }) {
  ...
}

Should this be something like: async function fetcher?

No comments for now while I understand what the code is doing.

@berekuk
Copy link
Collaborator Author

berekuk commented Jun 3, 2022

It's a shorthand;

const obj = {
  foo: async () => {
  },
};

is the same as

const obj = {
  async foo() {
  },
};

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Method_definitions

@NunoSempere
Copy link
Collaborator

Ok, so looking through this I think I would have tended to do something much hackier, like saving the page for apis that implement pagination. Overall not sure how to judge this though; the approach is a bit more complicated and, as you mention, it will take some tweaks to make the robot conform to the different APIs of all the platforms.

@NunoSempere
Copy link
Collaborator

On question deletion, note that we do want to keep questions after they resolve, even if we don't show them in the frontpage.

@berekuk
Copy link
Collaborator Author

berekuk commented Jun 4, 2022

I would have tended to do something much hackier, like saving the page for apis that implement pagination

That would help with the interruptible metaculus fetcher, but the main reason for this PR is the future near-real-time capabilities, which are impossible to get with the current "once in 24 hours" approach.

On question deletion, note that we do want to keep questions after they resolve, even if we don't show them in the frontpage.

Right. Scenarios when deletion is necessary I can think of are:

  • we identify a question by its url and the platform changes the url due to a typo or something (causes a duplicate, not a big deal if we clean it up eventually)
  • platform posts a question by accident and revokes it
  • someone on an open platform posts something inappropriate and the platform admins delete it

@NunoSempere
Copy link
Collaborator

Makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants