Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version file tree diff: design doc #11507

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Jul 30, 2024

ref #11319

Manual linking to the preview since we don't have this feature yet :D https://dev--11507.org.readthedocs.build/en/11507/design/file-tree-diff.html


📚 Documentation previews 📚

@stsewd stsewd requested a review from a team as a code owner July 30, 2024 21:59
@stsewd stsewd requested a review from ericholscher July 30, 2024 21:59
# ['+ ore', '- one', '- two', ' three', ' four', '+ five']

A good thing of using Python is that we don't need to write the files to disk,
and the result is easier to parse.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably slower as well, but I think I'm liking the idea of doing the diff in Python.

$ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
+ new.txt
- deleted.txt
= unchanged.txt
Copy link
Member Author

@stsewd stsewd Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find an option to exclude the unchanged files from the list, we could log each type to three different files if we want, but I like no having to deal with files...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like grep -v "= " or similar, but that still returns it, but we can avoid it in the Python process at least.

Comment on lines 158 to 161
A combination of using the DB and S3 would be really useful,
as the model will serve as the lock for the API,
and will be easier to keep track of which build was used to generate the diff
without having to download the diff file from S3.
Copy link
Member Author

@stsewd stsewd Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm liking the idea of using both, or even just store everything in DB, maybe automatically clean diffs that haven't been accessed in a while if we are worried about this taking a lot of space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would go with the DB-only approach. Dealing with static data on S3 is complicated, in particular if we need to perform queries on them or update its structure in the future.

The VersionDiff data is temporal so we can clean up it pretty often if we are worried about the size.

Comment on lines +208 to +209
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
Maybe don't even expose the API to the public, just use it internally.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having this feature in the dashboard, but the "link me to relevant files changed in this PR" feature would be great for authors as well, which may not have access to the RTD dashboard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that, but also is something we have 100% control over and we can play with it before going live and let users to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should use addons UI to start.

I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.

- Detect changes in sections of HTML files.
- Expand to other file types?
- Integrate with addons?
- Allow doing a diff between versions of different projects?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this would be useful, hmm.


The endpoint will be:

GET /api/v3/projects/{project_slug}/versions/{version_slug}/diff/{other_version_slug}/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint could also be:

  • /api/v3/projects/project-slug/diff?version_a=one&version_b=foo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm liking more this endpoint. It's easier for clients to just change a query param than a path.

Comment on lines +210 to +212
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
to trigger the action from our webhook. This requires the user to do some additional setup,
and for our webhooks to support custom headers.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want, we could automate this a little, and just ask the user for a GitHub token, and we create the webhook behind the scenes, and then ask them to use our GitHub action as a final step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.

Copy link
Member Author

@stsewd stsewd Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will require using a GitHub app :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.

For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting all of this together. This is going in a great direction 💯

I left a few questions (that could be clarified in the document as well) and suggestions about the proposed implementation that I'd like to discuss more before moving forward with this.

It would be good to receive feedback from other folks as well, since this feature could be resources/costs intensive.

docs/dev/design/file-tree-diff.rst Show resolved Hide resolved
docs/dev/design/file-tree-diff.rst Show resolved Hide resolved
Lines changed between two files
-------------------------------

Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have the list of changed files, will we download all those files and execute the diff over each of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?

I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.

I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the count of lines changed could be useful to sort the changed files, so it's easy to know which files require the most attention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if the current date or commit is injected into each page, all pages will be marked as changed.

Copy link
Contributor

@agjohnson agjohnson Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another thought, maybe our detection of changes should be some combination of user configuration and HTML parsing. Users give us the project's main content selector(s), or we have a reliable default, and we store hashes of this content inside this selector at the end of the build process? However even this might only help filter out file changes where dynamic content comes from the theme templates (commit id, date changed, etc in the footer/header).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could use our search indexing code to get a better diff, that already parses all sections on the main content. We already have that information in ES (well, not for PR previews).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will still give the full list of files that changed, but the sorting will help the user to identify which files are more relevant.

A little bit, but in that UX we're still giving the user a huge list of files that changed, putting the burden on the user to figure out what is important. We should be aiming to give the user a minimal, accurate list of files changed by a PR. If a PR only makes small changes to some files, our UX should point the user to those files. Sorting only seems to be a work around for a number of cases, not a UX we actually want to give to users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking something like "quick list of maybe relevant changes" (10 top files deleted/changed/added) and then a "view all changed files" link below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with that, if we can't distinguish between a small change an author makes and a small change that the build tool makes, sorting won't accurately give the top 10 files changed. Legitimate small changes wouldn't sort any higher up than many dynamic content changes.

I don't know if parsing HTML and only hashing nested content gets around this issue, but rclone will definitely be subject to this.

Comment on lines 130 to 139
{
"version_a": {"id": 1, "build": {"id": 1}},
"version_b": {"id": 2, "build": {"id": 2}},
"diff": {
"added": [{"file": "new.txt"}],
"removed": [{"file": "deleted.txt"}],
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to use a JSON field here on the database?

If that's the case, I'd propose to use a JSON field only for the diff attribute since we will want to perform queries quickly over the version that are compared:

class VersionDiff:
  version_a = models.ForeignKey(Version)
  version_b = models.ForeignKey(Version)
  diff = models.JSONField()

That will simplify the queries a lot in the future and make the faster as well, due to the indexes created for the version fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example for the storing this in S3, for the DB version it will be FKs, we also need to link to the builds of each version.

docs/dev/design/file-tree-diff.rst Outdated Show resolved Hide resolved
}
}

The version and build can be the full objects, or just the IDs and slugs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the complete Version objects to keep consistency with all the other endpoints, yes 👍🏼

docs/dev/design/file-tree-diff.rst Show resolved Hide resolved
The version and build can be the full objects, or just the IDs and slugs.

We will generate a lock on this request, to avoid multiple calls to the API for the same versions.
We can reply with a ``202 Accepted`` if the diff is being calculated in another request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "being calculated in another request"? I understand the VersionDiff data is going to be pre-calculated after build succeeds, so the data will be always ready for the endpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't run the diff on every build, only on demand. Running on every build will be expensive if not all projects need or use this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.

We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature. We had a similar thought some time ago when we built PR builds and it ended not being that expensive and being one of our top features.

I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.

Don't think complexity is a problem here, triggering a task after each build or calling that task when the API is requested are not a complex task.

We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature.

If we do that, we are starting big, running this task on each PR preview, for everyone.

I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.

Depending on how this will integrate with the GitHub action, users will still need to wait for the build to finish in order to be able to retrieve the diff, there is no downside by doing this on demand if used with the GH pooling this API every x seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

Copy link
Member

@humitos humitos Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

This is, actually, a great idea and not only for cost of s3 calls. We already have the modeling for this: we can expand the HTMLFile model to add a hash512 or hashmd5 field there. Then use that field to compare against versions when hitting the API endpoint. All data will be in the database and we won't require hitting S3 at all -- that should be fast.

Having all this data will allow us to compare any version against any other version in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't tracking all HTML files in the DB anymore, that was taking a lot of space in our DB. We are only tracking the existence of index.html files. Saving a manifest seems good, maybe save it as a field in the DB for the latest build only, but also having that S3 should be ok as well.

Comment on lines +192 to +199
Integrations
------------

You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong!

Doing the API call is easy, but knowing *when* to call it is hard.
We need to call the API after the build has finished successfully,
or we will be comparing the files of an incomplete or stale build.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry too much about this yet. Jumping into this terrain at this stage will make everything more complex from the beginning. I think the first step here is having the diff data stored in the database and the API endpoint that returns the diff if the data exists.

The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.

The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.

The action doesn't know when the build finished, that's the problem.

The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.

Could be an option, but it may take a while for some builds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API will return the build object, which has the commit. But polling will still be an issue for long builds, the action will need to keep running until it times out or a diff is returned.

Comment on lines +218 to +219
But since we are doing this on demand, and we can cache the results, we can minimize the costs
(maybe is not that much).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, are you considering to trigger a task on the background when the API endpoint is hit the first time?

I thought we wanted to pre- calculate the diff only for projects with PR build enabled after a build for the PR succeeded. I think that all (most) users using PR builds will use this feature; so it makes sense to me to do it automatically at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.

@ericholscher ericholscher linked an issue Jul 31, 2024 that may be closed by this pull request
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of thinking about the logistics of managing file line count diffing, but I think we really just want to start with the actual files changed. The tradeoff between the features provided by knowing the number of lines changes (smarter sorting) vs. complexity of the feature feels like it's not worth it to me.

I think v1 of this is just using rclone, and showing the files changed. If we keep the functionality lightweight a ton of the complexity here seems like it goes away, we can run it on all PR builds, and we can figure out other approaches in the future for file diffing.

I'd recommend that we investigate mapping the changes files to the git repo changes before we do any of the proposed solutions where we download files from S3, but that should definitely be in a v2 after we do the initial work here of making a useful feature without sorting.

docs/dev/design/file-tree-diff.rst Show resolved Hide resolved
Git providers may already offer a way to compare file trees, but again,
they work on the source files, and not on the generated files.

All hope was lost for having nice features like this, until now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥼 Feels like the start of a Sci-Fi plot :D

docs/dev/design/file-tree-diff.rst Outdated Show resolved Hide resolved
$ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
+ new.txt
- deleted.txt
= unchanged.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like grep -v "= " or similar, but that still returns it, but we can avoid it in the Python process at least.


To start, we will only consider HTML files (``--include=*.html``).

Lines changed between two files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we need the lines changed in v1. I'd like to avoid this overhead if we can -- I think the start should just be the rclone approach, since sorting the files is a "nice to have" and not required for any of the functionality we build. It would be an obvious v2 feature once we see how much people are using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.

I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.

API
---

This operation is expensive, so it won't be available to unauthenticated users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree making this authed seems like a bad place to start. I'd much rather make the endpoint be less expensive. I don't really understand why this would be slow if we already have VersionDiff objects in the DB? It seems like it should just be a DB query.

Comment on lines +192 to +199
Integrations
------------

You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong!

Doing the API call is easy, but knowing *when* to call it is hard.
We need to call the API after the build has finished successfully,
or we will be comparing the files of an incomplete or stale build.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?

Comment on lines +208 to +209
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
Maybe don't even expose the API to the public, just use it internally.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.

Comment on lines +210 to +212
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
to trigger the action from our webhook. This requires the user to do some additional setup,
and for our webhooks to support custom headers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.

docs/dev/design/file-tree-diff.rst Show resolved Hide resolved
@humitos
Copy link
Member

humitos commented Aug 6, 2024

I just checked the latest commit with the updates and I feel there are still some things I'd like to discuss and to be reflected in this document:

  • Discard the idea of using S3 to save the data and only use the database.
  • Mention potential sorting of files changed based on PageViews.
  • Discard the sorting based on lines changed for the v1 implementation.
  • Use a VersionDiff database model that points to two different Build objects.
  • Keep the endpoint un-authed and leave the authed version for v2.
  • Change the endpoint URL to use query parameters.
  • Always generate the VersionDiff on successful builds instead of in a background task pattern the first time the API endpoint is hit ("lock on request").
  • Limit the API endpoint to only work over external versions for v1.
  • Always return full objects in the API response instead of only IDs/slugs (following the APIv3 pattern).
  • Add it under a feature flag so we can roll it out slowly and start testing the new feature in our own projects first. That may help us to estimate the impact into S3 queries.

@stsewd
Copy link
Member Author

stsewd commented Aug 6, 2024

Discard the idea of using S3 to save the data and only use the database.

I'm +1 on that, but wanted to see other's opinions

Mention potential sorting of files changed based on PageViews.

I'm not really sold on this idea, the order will be kind of random and unexpected for the changes done on the PR.

Discard the sorting based on lines changed for the v1 implementation.

I'm okay leaving that for later

Use a VersionDiff database model that points to two different Build objects.

It should also point to the versions, since builds aren't deleted when a version is deleted. Or we can manually delete diff objects once a version is deleted.

Keep the endpoint un-authed and leave the authed version for v2.

That will be a breaking change, we should decide if we want to have it under auth or not.

Always generate the VersionDiff on successful builds instead of in a background task pattern the first time the API endpoint is hit ("lock on request").

This is assuming we will only run a diff between two sets of versions, PR preview and latest? or stable? I still think it's useful to allow diffing between any versions on demand.


From Eric's comments, I think we don't want to expose this as an API yet anyway, but only from the addons endpoint. Which again, requires us to choose which version we will make the comparison over, I guess we will choose the one that the default branch points to.

@humitos
Copy link
Member

humitos commented Aug 6, 2024

Keep the endpoint un-authed and leave the authed version for v2.

That will be a breaking change, we should decide if we want to have it under auth or not.

I'm saying that we can implement the authed version only for Read the Docs for Business to access private projects. For now, we can start implementing this feature on Read the Docs for Community to get some experience/feedback and grow from there for the v2.

requires us to choose which version we will make the comparison over, I guess we will choose the one that the default branch points to.

We should use the same setting we are already using for DocDiff, which is currently LATEST. In the future, we should allow to configure the version to compare against from the addons configuration.

This is assuming we will only run a diff between two sets of versions, PR preview and latest? or stable? I still think it's useful to allow diffing between any versions on demand.

There are good points for each of the different approaches. I think this decision is important since it will dictate how we and other people will use this feature. @ericholscher @agjohnson what are your thoughts on each of these different approaches?

@ericholscher
Copy link
Member

ericholscher commented Aug 6, 2024

It doesn't really seem like we need to make a decision about diffing different versions now? We just ship with the initial implementation that diffs the PR build against the latest version? We can always add some way to trigger diffing of arbitrary versions in the future, but it sounds like we don't need to do that in the initial implementation ("v1")?

@humitos
Copy link
Member

humitos commented Aug 7, 2024

@ericholscher what's your position about 1) implementing the diffing as a background task and requiring hitting the API twice to get the data, or 2) always perform the diffing at build time and hit the API endpoint only once always getting the same response?

@stsewd
Copy link
Member Author

stsewd commented Aug 7, 2024

  1. implementing the diffing as a background task and requiring hitting the API twice to get the data

For clarification, the first call will trigger the task and wait for the result.

@ericholscher
Copy link
Member

It seems pretty straightforward to support both the situation where we have the files on disk, or need to query S3 for a set of them. I think to start we do it during build time while we have the files on disk for PR builds?

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes to the doc look good to me 👍

Comment on lines 135 to 136
build_a = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_a')
build_b = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_b')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if this should be a build object, or just a commit hash? I guess they work similarly, but there could be multiple different versions on the same commit, so it would be nice to not have to duplicate that work. However, we can also pretty easily get the commit via the Build.

Not 100% sure what is best here, but perhaps something to consider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A build can be triggered for the same commit and still change the output, say it takes into consideration the time or an env variable.

Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The immediate plan here seems reasonable 👍

It seems we have the most questions on future plans for next iterations, but this seems like discussion we can come back to. There are some technical implementations in this document that can be dialed back a good deal to match what we want to build initially, and for the medium term -- for example, async execution of diffing and line diffing incur the most complexity that we don't yet need. Perhaps it's worth clarifying here before we get to building this.


To start, we will only consider HTML files (``--include=*.html``).

Lines changed between two files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.

I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.

Lines changed between two files
-------------------------------

Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?

I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.

I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.

Comment on lines +146 to +147
- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``.
We can store that information as well, or just calculate it on the fly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some edge cases where users might want to compare two non-default versions, but this will be a rare use case.

This sort of comparison feels like a better tool for git/GitHub than something we maintain. I'd agree with the above that comparing against only the default version is definitely where we should start, but I suspect there isn't much incentive for us to complicate this feature too much more than that.

Either way, the UI for linking to files that changed is only going to be visible on PR builds -- either as a comment/description in the PR or via a hovering UI in the PR built documentation. We don't have either option for non-PR build versions.

The version and build can be the full objects, or just the IDs and slugs.

We will generate a lock on this request, to avoid multiple calls to the API for the same versions.
We can reply with a ``202 Accepted`` if the diff is being calculated in another request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

Comment on lines +208 to +209
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
Maybe don't even expose the API to the public, just use it internally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should use addons UI to start.

I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.

Comment on lines +210 to +212
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
to trigger the action from our webhook. This requires the user to do some additional setup,
and for our webhooks to support custom headers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.

For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.

- Expose this feature only via the addons feature.
- Allow to diff an external version against the version that points to the default branch/tag of the project only.
- Use a feature flag to enable this feature on projects.
- Run the diff while we have the files on disk (end of the build), if possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, regarding all the discussion above, this does match where we've discussed starting with this feature 👍

I think this is a medium term plan too. Your ideas on how to build on top of this feature sounds like we have some ways to iterate on this feature if we want, but I think this plan gives users the most useful feature we can without building something that is too complex for us to maintain well.

Comment on lines +218 to +219
But since we are doing this on demand, and we can cache the results, we can minimize the costs
(maybe is not that much).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.

@stsewd
Copy link
Member Author

stsewd commented Aug 14, 2024

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

I like this idea

@ericholscher
Copy link
Member

ericholscher commented Aug 15, 2024

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

I like this idea

We already built this, and then ripped it out :)

#8111

@agjohnson
Copy link
Contributor

We already built this, and then ripped it out :)

Heh yeah, though we ripped it out because it was expensive and we also weren't using it. I don't feel we'd need a per-file model for this either way, a single manifest file seems reasonable.

I am starting to feel that rclone is the wrong solution to build on top of. Rclone only does comparison at the file metadata level, which is only the bare minimum we need. It still requires us to do secondary comparison to avoid giving the user a long list of files that haven't actually changed.

This may or may not require something like a hash manifest file, but definitely requires we do something on top of rclone.


Some prior art to look at here is sitediff:

A few different options it has for avoiding false positive diffs are:

So, really quite advanced, and obviously way more than we'd build. But it shows the difficulty of comparing two sets of HTML and some of the solutions necessary. Selector targeting seems like the easiest to accomplish.


I think it would help to see actual output of rclone between two versions of our docs, maybe another project too. We've only discussed many of the plans here theoretically so far, it would help to figure out exactly what we're going to be working with.

@agjohnson
Copy link
Contributor

I gave a quick test to sitediff, it's a neat tool. I didn't want to crawl our prod docs, so just did the dev docs. I'd expect no changes between these two versions.

Unconfigured, the output is what I expected from rclone -- quite noisy:

image

Diffs show a lot of noise outside of content changes in all these files, any place latest, stable, or a commit is compared. This would be comparable to using rclone to detect changes in files.

Using a configuration with selector: "div.document[role='main']" the results are far cleaner:

image

The noisy changes with version names and commit hashs are larger avoided, but the diffs here are mostly from intersphinx altering the link titles. Unfortunately, these are also invisible changes to the user, so the UX of telling users these files changed and them linking them to a doc page where nothing looks changed is going to be something we need to work around.

The second is better, but I am still expecting no changes from this at all, so not ideal still.

@humitos
Copy link
Member

humitos commented Aug 16, 2024

@agjohnson the research around sitediff you've done is great! It does almost exactly what we want to build here. There are really good ideas and thoughts to consider there.

I did a quick test by myself as well and it worked great, finding differences only on one file after defining the selector 👍🏼

The noisy changes with version names and commit hashs are larger avoided, but the diffs here are mostly from intersphinx altering the link titles.

They mention similar issues on their README file and solve them by using sanitization config.


That said, I think we should continue discussing what's the path we want to follow here since there are a few possible implementations opened already and it seems there are more research we can do before making the next step.

@agjohnson
Copy link
Contributor

Yeah, and I wouldn't build exactly what sitediff built but testing with it does highlight what we can expect from similar approaches we're discussing. We'll hit these hurdles too, and might find ourselves needing similar features -- not that I think we should build something this complex.

Agreed we should continue discussing this more. I don't have a great feeling for where we do go, but it's feeling like we'll need to give more user control than file-based comparison.

I suppose to echo what we're doing with output paths, we could expect the build to output a manifest. We could make this a job step that the user can override and apply advanced transform rules if they want.

A left field option could be using docdiff to do the comparisons, and crawling each site. This is back to the async issue though and could be slow. However, it would be much better UX in that the list of files would align with what docdiff would give the user visually.

@humitos
Copy link
Member

humitos commented Aug 20, 2024

I thought more about this. I started thinking about "build an integration with sitediff and use it behind the scenes to get the list of files changed". I took a look at this and it seems it's an application on itself and doesn't allow to be used "as a library" or as an "external command" to grab all the metadata/data it generates and uses to show those nice pages. Besides, it's written in Ruby and it will require setting up another environment just for this and also find a way to interact with our Python code... I discarded this idea after this quick research.


Then, I went back to the origins of all this discussion: "we need a way to map a source file into a URL" 💡

If we are thinking of giving users the ability to somehow configure the "file tree diff" feature to be able to detect what are the changes between two rendered HTML by defining what's the main selector, and potentially allowing them to define other selectors to ignore some chunks, and... blah blah -- I would like to propose some a lot easier for everybody: why not just ask them "how to map a source file into a URL" 🪄 ?

Knowing that, we can use the right tool designed to compare files that we all know: Git. With that information, we can then run the mapping the user provided us 1 over each of the changed files to get the URL. Git will tell us:

  • what are the files that changed
  • how many lines have changed on each file (useful for sorting)
  • ... a lot more useful data -- it's Git

There are two cons that I'm seeing here:

  • we need to ask the user for a mapping 2
  • we won't be able to find changes outside source files (e.g. template changes) 3

In my opinion, this looks like a good idea to discuss and explore more compared to the alternatives we have been discussing. It may not be perfect, but looks a lot easier to understand, to implement, and to explain to users.

I'd love to know your feedback about this. Where do you see potential issues? What do you like? What do you dislike? How do you compare it with the alternatives?

Footnotes

  1. we could have pre-defined regex for documentation tools we already know MkDocs, Sphinx, etc or let the user to create their custom one. We could add more pre-defined values for those new tools as we grow.

  2. not a problem if we have a list of pre-defined ones, tho

  3. even with something like sitediff this would communicate changes everywhere

@agjohnson
Copy link
Contributor

I discarded this idea after this quick research.

Agreed. And sorry if I wasn't clear above but I was just testing sitediff, I also did not want to build on top of this tool. It just already does much of what we described building, so worth comparing at least.

why not just ask them "how to map a source file into a URL" 🪄 ?

I also had this same thought path. Though I was also considering building on top of what files PRs list as changed. I realized that this is not a good option as the project might not build the PR base branch or the version might not match.

Relying on Git to detect changes is a closer possibility. I would still vote to keep this as a backup option for now. It requires us maintaining more custom Git operations and the user has to give us a mapping.

I could see us requiring a mapping file in the build, or we might even use a <meta name="source-file" value="docs/install/index.rst" /> in the built HTML.

However, after thinking through source file mapping, and requiring the user tell us how to map changes, I came all the way back to just asking the user to tell us what changed:

I suppose to echo what we're doing with output paths, we could expect the build to output a manifest. We could make this a job step that the user can override and apply advanced transform rules if they want.

That is, projects would have to output a hash manifest file of changes to HTML files in their build:

  • To start, our build job would just hash whole files. This is just an easy way to move forward for now and gives us noisy output similar to rclone.
  • Our default job would eventually support Sphinx/Mkdocs output generically. We hash just the contents of parsed HTML div.document[role="main"] etc -- we have some good defaults for the majority of projects.
  • Projects can override this job and hash content however they please, performing transform steps to content, etc. There just needs to be a _readthedocs/html/checksum.md5/etc at the end of the build.

The plan above for describing differences is still just pulling the two checksum files and noting what doesn't match. We have the page URLs in the checksum file, can note page deletions/additions, and when our defaults don't work the user has full control.

It does seem like we should involve the user at some point, at least for some portion of projects.

@ericholscher
Copy link
Member

ericholscher commented Sep 4, 2024

We discussed this in the call today. Takeaways:

  • We need to diff HTML, because there are a number of edge cases where content can change based on outside files (eg. include files, code API rendering, etc.)
  • Doing file diffs is too noisy because of metadata changes (eg. dates, commit hashes, etc.)
  • We should be doing content diffs, where content is defined by a content CSS selector. This should be user configurable in v2, but for v1, we can use our existing content guessing logic from search and other places.
  • We should store the content hash on the ImportedFile on build, as well as support the build outputting them into a file ($READTHEDOCS_OUTPUT/readthedocs-hashes.json).
  • When we want to compare 2 verisons, we can compare content hashes, and don't need to re-read the full file contents from S3.

@agjohnson
Copy link
Contributor

For a first pass, I feel we should do whatever is quickest so that we can work on the UX here too. This might still be file based hashing just to start experimenting even. Relying on our search code to break down a DOM to text only representation feels like a second or third pass maybe?

Thinking about text representation vs DOM comparison more, DOM comparison seems like where we should be ultimately though. Users should be alerted on changes like image URLs, HTML classes, and non-text DOM changes, not just text changes. Maybe text-only mode is a configurable option later though.

@stsewd
Copy link
Member Author

stsewd commented Sep 4, 2024

This might still be file based hashing just to start experimenting even. Relying on our search code to break down a DOM to text only representation feels like a second or third pass maybe?

I think we can just start with extracting the main content of the file (we already do this for search) and just hash it as is.

I have updated the design doc to include things that we discussed, I left the old ideas in case they are useful, but I'm fine removing them.

@agjohnson
Copy link
Contributor

I think we can just start with extracting the main content of the file (we already do this for search) and just hash it as is.

Yeah, I think we're talking about the same thing. I'm saying we avoid rendering the DOM to text and instead hash the inside contents of an element inside the HTML document. I'm not familiar with what we are technically doing with search indexing.

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update looks good to me. I think there are more explicit descriptions to be done still because I'm still not sure we are all in the same page. I left some suggestion to explicitly mention what we discuss in the meeting as way to be sure of that.

Besides those comments in the PR,

  • The document mentions rclone, but I understand we are not going to use it at all for this. If we want to keep that section, we should add a note or similar saying that this was considered originally but we are not going to implement it.
  • Same for "Lines changed between two files" section, since we won't implement that approach.
  • VersionDiff object from "Storing results" is not required anymore; since we will use ImportedFile with some modifications if required. So, that section may need to be re-written/updated/removed.

Also, with the new implementation discussed for v1; how would be the API response? Do we need to adjust anything there?

Comment on lines +57 to +58
When a build finishes, we generate this manifest for all HTML files, and store it.
When we need to compare two versions, we can just compare the manifests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand these manifest will hash only the "main content" of the document. We should mention that here.

When a build finishes, we generate this manifest for all HTML files, and store it.
When we need to compare two versions, we can just compare the manifests.

This doesn't require downloading the files, but it requires building a version to generate the manifest.

Using rclone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need rclone at all now with the new proposal? I understand we won't use rclose for anything and only base our diffing using the manifest.

Comment on lines +276 to +279
- Generate a manifest of all HTML files from the versions that we want to compare.
This will be done at the end of the build.
- Generate the hash based on the main content of the file,
not the whole file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to create two different manifest files? 🤔

Comment on lines +282 to +283
- Don't store the results in the DB,
we can store the results in a next iteration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What results are you referencing here? Didn't we say to store the hash in ImportFile or similar object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File tree diff API
4 participants