Version file tree diff: design doc #11507

stsewd · 2024-07-30T21:59:50Z

Manual linking to the preview since we don't have this feature yet :D https://dev--11507.org.readthedocs.build/en/11507/design/file-tree-diff.html

📚 Documentation previews 📚

User's documentation (docs): https://docs--11507.org.readthedocs.build/en/11507/

Developer's documentation (dev): https://dev--11507.org.readthedocs.build/en/11507/

ref #11319

stsewd · 2024-07-30T22:15:56Z

docs/dev/design/file-tree-diff.rst

+   # ['+ ore', '- one', '- two', '  three', '  four', '+ five']
+
+A good thing of using Python is that we don't need to write the files to disk,
+and the result is easier to parse.


It's probably slower as well, but I think I'm liking the idea of doing the diff in Python.

stsewd · 2024-07-30T22:17:49Z

docs/dev/design/file-tree-diff.rst

+   $ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
+   + new.txt
+   - deleted.txt
+   = unchanged.txt


Didn't find an option to exclude the unchanged files from the list, we could log each type to three different files if we want, but I like no having to deal with files...

Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like grep -v "= " or similar, but that still returns it, but we can avoid it in the Python process at least.

stsewd · 2024-07-30T22:19:48Z

docs/dev/design/file-tree-diff.rst

+A combination of using the DB and S3 would be really useful,
+as the model will serve as the lock for the API,
+and will be easier to keep track of which build was used to generate the diff
+without having to download the diff file from S3.


I'm liking the idea of using both, or even just store everything in DB, maybe automatically clean diffs that haven't been accessed in a while if we are worried about this taking a lot of space.

Yeah, I would go with the DB-only approach. Dealing with static data on S3 is complicated, in particular if we need to perform queries on them or update its structure in the future.

The VersionDiff data is temporal so we can clean up it pretty often if we are worried about the size.

stsewd · 2024-07-30T22:22:12Z

docs/dev/design/file-tree-diff.rst

+- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
+  Maybe don't even expose the API to the public, just use it internally.


I like having this feature in the dashboard, but the "link me to relevant files changed in this PR" feature would be great for authors as well, which may not have access to the RTD dashboard.

I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.

Yeah, that, but also is something we have 100% control over and we can play with it before going live and let users to use it.

Agreed we should use addons UI to start.

I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.

docs/dev/design/file-tree-diff.rst

stsewd · 2024-07-30T22:26:50Z

docs/dev/design/file-tree-diff.rst

+- Detect changes in sections of HTML files.
+- Expand to other file types?
+- Integrate with addons?
+- Allow doing a diff between versions of different projects?


Not sure if this would be useful, hmm.

stsewd · 2024-07-30T22:28:13Z

docs/dev/design/file-tree-diff.rst

+
+The endpoint will be:
+
+   GET /api/v3/projects/{project_slug}/versions/{version_slug}/diff/{other_version_slug}/


This endpoint could also be:

/api/v3/projects/project-slug/diff?version_a=one&version_b=foo

I'm liking more this endpoint. It's easier for clients to just change a query param than a path.

stsewd · 2024-07-30T22:31:25Z

docs/dev/design/file-tree-diff.rst

+- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
+  to trigger the action from our webhook. This requires the user to do some additional setup,
+  and for our webhooks to support custom headers.


If we want, we could automate this a little, and just ask the user for a GitHub token, and we create the webhook behind the scenes, and then ask them to use our GitHub action as a final step.

I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.

That will require using a GitHub app :)

That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.

For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.

humitos

Thanks for putting all of this together. This is going in a great direction 💯

I left a few questions (that could be clarified in the document as well) and suggestions about the proposed implementation that I'd like to discuss more before moving forward with this.

It would be good to receive feedback from other folks as well, since this feature could be resources/costs intensive.

docs/dev/design/file-tree-diff.rst

humitos · 2024-07-31T10:32:21Z

docs/dev/design/file-tree-diff.rst

+Lines changed between two files
+-------------------------------
+
+Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed.


Once we have the list of changed files, will we download all those files and execute the diff over each of them?

What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?

I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.

I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.

Having the count of lines changed could be useful to sort the changed files, so it's easy to know which files require the most attention.

For example, if the current date or commit is injected into each page, all pages will be marked as changed.

Also, another thought, maybe our detection of changes should be some combination of user configuration and HTML parsing. Users give us the project's main content selector(s), or we have a reliable default, and we store hashes of this content inside this selector at the end of the build process? However even this might only help filter out file changes where dynamic content comes from the theme templates (commit id, date changed, etc in the footer/header).

Yeah, we could use our search indexing code to get a better diff, that already parses all sections on the main content. We already have that information in ES (well, not for PR previews).

We will still give the full list of files that changed, but the sorting will help the user to identify which files are more relevant.

A little bit, but in that UX we're still giving the user a huge list of files that changed, putting the burden on the user to figure out what is important. We should be aiming to give the user a minimal, accurate list of files changed by a PR. If a PR only makes small changes to some files, our UX should point the user to those files. Sorting only seems to be a work around for a number of cases, not a UX we actually want to give to users.

I'm thinking something like "quick list of maybe relevant changes" (10 top files deleted/changed/added) and then a "view all changed files" link below.

Even with that, if we can't distinguish between a small change an author makes and a small change that the build tool makes, sorting won't accurately give the top 10 files changed. Legitimate small changes wouldn't sort any higher up than many dynamic content changes.

I don't know if parsing HTML and only hashing nested content gets around this issue, but rclone will definitely be subject to this.

humitos · 2024-07-31T10:36:09Z

docs/dev/design/file-tree-diff.rst

+   {
+       "version_a": {"id": 1, "build": {"id": 1}},
+       "version_b": {"id": 2, "build": {"id": 2}},
+       "diff": {
+           "added": [{"file": "new.txt"}],
+           "removed": [{"file": "deleted.txt"}],
+           "modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
+       }
+   }
+


Are you suggesting to use a JSON field here on the database?

If that's the case, I'd propose to use a JSON field only for the diff attribute since we will want to perform queries quickly over the version that are compared:

class VersionDiff: version_a = models.ForeignKey(Version) version_b = models.ForeignKey(Version) diff = models.JSONField()

That will simplify the queries a lot in the future and make the faster as well, due to the indexes created for the version fields.

This is an example for the storing this in S3, for the DB version it will be FKs, we also need to link to the builds of each version.

docs/dev/design/file-tree-diff.rst

humitos · 2024-07-31T10:57:13Z

docs/dev/design/file-tree-diff.rst

+       }
+   }
+
+The version and build can be the full objects, or just the IDs and slugs.


We should use the complete Version objects to keep consistency with all the other endpoints, yes 👍🏼

docs/dev/design/file-tree-diff.rst

humitos · 2024-07-31T11:02:43Z

docs/dev/design/file-tree-diff.rst

+The version and build can be the full objects, or just the IDs and slugs.
+
+We will generate a lock on this request, to avoid multiple calls to the API for the same versions.
+We can reply with a ``202 Accepted`` if the diff is being calculated in another request.


What do you mean by "being calculated in another request"? I understand the VersionDiff data is going to be pre-calculated after build succeeds, so the data will be always ready for the endpoint.

We won't run the diff on every build, only on demand. Running on every build will be expensive if not all projects need or use this feature.

I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.

We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature. We had a similar thought some time ago when we built PR builds and it ended not being that expensive and being one of our top features.

I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.

I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.

Don't think complexity is a problem here, triggering a task after each build or calling that task when the API is requested are not a complex task.

We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature.

If we do that, we are starting big, running this task on each PR preview, for everyone.

I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.

Depending on how this will integrate with the GitHub action, users will still need to wait for the build to finish in order to be able to retrieve the diff, there is no downside by doing this on demand if used with the GH pooling this API every x seconds.

I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

This is, actually, a great idea and not only for cost of s3 calls. We already have the modeling for this: we can expand the HTMLFile model to add a hash512 or hashmd5 field there. Then use that field to compare against versions when hitting the API endpoint. All data will be in the database and we won't require hitting S3 at all -- that should be fast.

Having all this data will allow us to compare any version against any other version in the future.

We aren't tracking all HTML files in the DB anymore, that was taking a lot of space in our DB. We are only tracking the existence of index.html files. Saving a manifest seems good, maybe save it as a field in the DB for the latest build only, but also having that S3 should be ok as well.

humitos · 2024-07-31T11:13:56Z

docs/dev/design/file-tree-diff.rst

+Integrations
+------------
+
+You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong!
+
+Doing the API call is easy, but knowing *when* to call it is hard.
+We need to call the API after the build has finished successfully,
+or we will be comparing the files of an incomplete or stale build.


I wouldn't worry too much about this yet. Jumping into this terrain at this stage will make everything more complex from the beginning. I think the first step here is having the diff data stored in the database and the API endpoint that returns the diff if the data exists.

The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.

The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.

The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.

The action doesn't know when the build finished, that's the problem.

The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.

Could be an option, but it may take a while for some builds.

Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?

The API will return the build object, which has the commit. But polling will still be an issue for long builds, the action will need to keep running until it times out or a diff is returned.

humitos · 2024-07-31T11:21:19Z

docs/dev/design/file-tree-diff.rst

+But since we are doing this on demand, and we can cache the results, we can minimize the costs
+(maybe is not that much).


Hrm, are you considering to trigger a task on the background when the API endpoint is hit the first time?

I thought we wanted to pre- calculate the diff only for projects with PR build enabled after a build for the PR succeeded. I think that all (most) users using PR builds will use this feature; so it makes sense to me to do it automatically at this point.

see #11507 (comment)

I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.

ericholscher

This seems like a lot of thinking about the logistics of managing file line count diffing, but I think we really just want to start with the actual files changed. The tradeoff between the features provided by knowing the number of lines changes (smarter sorting) vs. complexity of the feature feels like it's not worth it to me.

I think v1 of this is just using rclone, and showing the files changed. If we keep the functionality lightweight a ton of the complexity here seems like it goes away, we can run it on all PR builds, and we can figure out other approaches in the future for file diffing.

I'd recommend that we investigate mapping the changes files to the git repo changes before we do any of the proposed solutions where we download files from S3, but that should definitely be in a v2 after we do the initial work here of making a useful feature without sorting.

docs/dev/design/file-tree-diff.rst

ericholscher · 2024-08-01T15:34:24Z

docs/dev/design/file-tree-diff.rst

+Git providers may already offer a way to compare file trees, but again,
+they work on the source files, and not on the generated files.
+
+All hope was lost for having nice features like this, until now.


🥼 Feels like the start of a Sci-Fi plot :D

docs/dev/design/file-tree-diff.rst

ericholscher · 2024-08-01T15:36:19Z

docs/dev/design/file-tree-diff.rst

+   $ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
+   + new.txt
+   - deleted.txt
+   = unchanged.txt


Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like grep -v "= " or similar, but that still returns it, but we can avoid it in the Python process at least.

ericholscher · 2024-08-01T15:38:30Z

docs/dev/design/file-tree-diff.rst

+
+To start, we will only consider HTML files (``--include=*.html``).
+
+Lines changed between two files


I'm not convinced we need the lines changed in v1. I'd like to avoid this overhead if we can -- I think the start should just be the rclone approach, since sorting the files is a "nice to have" and not required for any of the functionality we build. It would be an obvious v2 feature once we see how much people are using it.

Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.

I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.

ericholscher · 2024-08-01T15:47:12Z

docs/dev/design/file-tree-diff.rst

+API
+---
+
+This operation is expensive, so it won't be available to unauthenticated users.


I agree making this authed seems like a bad place to start. I'd much rather make the endpoint be less expensive. I don't really understand why this would be slow if we already have VersionDiff objects in the DB? It seems like it should just be a DB query.

ericholscher · 2024-08-01T15:48:47Z

docs/dev/design/file-tree-diff.rst

+Integrations
+------------
+
+You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong!
+
+Doing the API call is easy, but knowing *when* to call it is hard.
+We need to call the API after the build has finished successfully,
+or we will be comparing the files of an incomplete or stale build.


Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?

ericholscher · 2024-08-01T15:49:21Z

docs/dev/design/file-tree-diff.rst

+- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
+  Maybe don't even expose the API to the public, just use it internally.


I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.

ericholscher · 2024-08-01T15:50:30Z

docs/dev/design/file-tree-diff.rst

+- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
+  to trigger the action from our webhook. This requires the user to do some additional setup,
+  and for our webhooks to support custom headers.


I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.

docs/dev/design/file-tree-diff.rst

humitos · 2024-08-06T10:12:51Z

I just checked the latest commit with the updates and I feel there are still some things I'd like to discuss and to be reflected in this document:

Discard the idea of using S3 to save the data and only use the database.
Mention potential sorting of files changed based on PageViews.
Discard the sorting based on lines changed for the v1 implementation.
Use a VersionDiff database model that points to two different Build objects.
Keep the endpoint un-authed and leave the authed version for v2.
Change the endpoint URL to use query parameters.
Always generate the VersionDiff on successful builds instead of in a background task pattern the first time the API endpoint is hit ("lock on request").
Limit the API endpoint to only work over external versions for v1.
Always return full objects in the API response instead of only IDs/slugs (following the APIv3 pattern).
Add it under a feature flag so we can roll it out slowly and start testing the new feature in our own projects first. That may help us to estimate the impact into S3 queries.

stsewd · 2024-08-06T14:39:47Z

Discard the idea of using S3 to save the data and only use the database.

I'm +1 on that, but wanted to see other's opinions

Mention potential sorting of files changed based on PageViews.

I'm not really sold on this idea, the order will be kind of random and unexpected for the changes done on the PR.

Discard the sorting based on lines changed for the v1 implementation.

I'm okay leaving that for later

Use a VersionDiff database model that points to two different Build objects.

It should also point to the versions, since builds aren't deleted when a version is deleted. Or we can manually delete diff objects once a version is deleted.

Keep the endpoint un-authed and leave the authed version for v2.

That will be a breaking change, we should decide if we want to have it under auth or not.

Always generate the VersionDiff on successful builds instead of in a background task pattern the first time the API endpoint is hit ("lock on request").

This is assuming we will only run a diff between two sets of versions, PR preview and latest? or stable? I still think it's useful to allow diffing between any versions on demand.

From Eric's comments, I think we don't want to expose this as an API yet anyway, but only from the addons endpoint. Which again, requires us to choose which version we will make the comparison over, I guess we will choose the one that the default branch points to.

humitos · 2024-08-06T14:49:39Z

Keep the endpoint un-authed and leave the authed version for v2.

That will be a breaking change, we should decide if we want to have it under auth or not.

I'm saying that we can implement the authed version only for Read the Docs for Business to access private projects. For now, we can start implementing this feature on Read the Docs for Community to get some experience/feedback and grow from there for the v2.

requires us to choose which version we will make the comparison over, I guess we will choose the one that the default branch points to.

We should use the same setting we are already using for DocDiff, which is currently LATEST. In the future, we should allow to configure the version to compare against from the addons configuration.

This is assuming we will only run a diff between two sets of versions, PR preview and latest? or stable? I still think it's useful to allow diffing between any versions on demand.

There are good points for each of the different approaches. I think this decision is important since it will dictate how we and other people will use this feature. @ericholscher @agjohnson what are your thoughts on each of these different approaches?

ericholscher · 2024-08-06T17:19:34Z

It doesn't really seem like we need to make a decision about diffing different versions now? We just ship with the initial implementation that diffs the PR build against the latest version? We can always add some way to trigger diffing of arbitrary versions in the future, but it sounds like we don't need to do that in the initial implementation ("v1")?

humitos · 2024-08-07T09:10:32Z

@ericholscher what's your position about 1) implementing the diffing as a background task and requiring hitting the API twice to get the data, or 2) always perform the diffing at build time and hit the API endpoint only once always getting the same response?

stsewd · 2024-08-07T14:03:42Z

implementing the diffing as a background task and requiring hitting the API twice to get the data

For clarification, the first call will trigger the task and wait for the result.

ericholscher · 2024-08-07T15:30:48Z

It seems pretty straightforward to support both the situation where we have the files on disk, or need to query S3 for a set of them. I think to start we do it during build time while we have the files on disk for PR builds?

ericholscher

These changes to the doc look good to me 👍

ericholscher · 2024-08-13T20:03:02Z

docs/dev/design/file-tree-diff.rst

+       build_a = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_a')
+       build_b = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_b')


I do wonder if this should be a build object, or just a commit hash? I guess they work similarly, but there could be multiple different versions on the same commit, so it would be nice to not have to duplicate that work. However, we can also pretty easily get the commit via the Build.

Not 100% sure what is best here, but perhaps something to consider?

A build can be triggered for the same commit and still change the output, say it takes into consideration the time or an env variable.

agjohnson

The immediate plan here seems reasonable 👍

It seems we have the most questions on future plans for next iterations, but this seems like discussion we can come back to. There are some technical implementations in this document that can be dialed back a good deal to match what we want to build initially, and for the medium term -- for example, async execution of diffing and line diffing incur the most complexity that we don't yet need. Perhaps it's worth clarifying here before we get to building this.

agjohnson · 2024-08-13T20:55:43Z

docs/dev/design/file-tree-diff.rst

+
+To start, we will only consider HTML files (``--include=*.html``).
+
+Lines changed between two files


Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.

I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.

agjohnson · 2024-08-13T20:58:15Z

docs/dev/design/file-tree-diff.rst

+Lines changed between two files
+-------------------------------
+
+Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed.


What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?

I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.

I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.

agjohnson · 2024-08-13T21:16:04Z

docs/dev/design/file-tree-diff.rst

+- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``.
+  We can store that information as well, or just calculate it on the fly.


There are some edge cases where users might want to compare two non-default versions, but this will be a rare use case.

This sort of comparison feels like a better tool for git/GitHub than something we maintain. I'd agree with the above that comparing against only the default version is definitely where we should start, but I suspect there isn't much incentive for us to complicate this feature too much more than that.

Either way, the UI for linking to files that changed is only going to be visible on PR builds -- either as a comment/description in the PR or via a hovering UI in the PR built documentation. We don't have either option for non-PR build versions.

agjohnson · 2024-08-13T23:30:51Z

docs/dev/design/file-tree-diff.rst

+The version and build can be the full objects, or just the IDs and slugs.
+
+We will generate a lock on this request, to avoid multiple calls to the API for the same versions.
+We can reply with a ``202 Accepted`` if the diff is being calculated in another request.


I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

agjohnson · 2024-08-13T23:35:00Z

docs/dev/design/file-tree-diff.rst

+- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
+  Maybe don't even expose the API to the public, just use it internally.


Agreed we should use addons UI to start.

I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.

agjohnson · 2024-08-13T23:38:04Z

docs/dev/design/file-tree-diff.rst

+- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
+  to trigger the action from our webhook. This requires the user to do some additional setup,
+  and for our webhooks to support custom headers.


That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.

For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.

agjohnson · 2024-08-13T23:44:10Z

docs/dev/design/file-tree-diff.rst

+- Expose this feature only via the addons feature.
+- Allow to diff an external version against the version that points to the default branch/tag of the project only.
+- Use a feature flag to enable this feature on projects.
+- Run the diff while we have the files on disk (end of the build), if possible.


However, regarding all the discussion above, this does match where we've discussed starting with this feature 👍

I think this is a medium term plan too. Your ideas on how to build on top of this feature sounds like we have some ways to iterate on this feature if we want, but I think this plan gives users the most useful feature we can without building something that is too complex for us to maintain well.

agjohnson · 2024-08-13T23:45:16Z

docs/dev/design/file-tree-diff.rst

+But since we are doing this on demand, and we can cache the results, we can minimize the costs
+(maybe is not that much).


I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.

stsewd · 2024-08-14T01:30:11Z

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

I like this idea

ericholscher · 2024-08-15T17:19:03Z

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

I like this idea

We already built this, and then ripped it out :)

#8111

agjohnson · 2024-08-15T20:59:42Z

We already built this, and then ripped it out :)

Heh yeah, though we ripped it out because it was expensive and we also weren't using it. I don't feel we'd need a per-file model for this either way, a single manifest file seems reasonable.

I am starting to feel that rclone is the wrong solution to build on top of. Rclone only does comparison at the file metadata level, which is only the bare minimum we need. It still requires us to do secondary comparison to avoid giving the user a long list of files that haven't actually changed.

This may or may not require something like a hash manifest file, but definitely requires we do something on top of rclone.

Some prior art to look at here is sitediff:

https://github.com/evolvingweb/sitediff

A few different options it has for avoiding false positive diffs are:

Selector targeting https://github.com/evolvingweb/sitediff?tab=readme-ov-file#selector
Rule based excludes https://github.com/evolvingweb/sitediff?tab=readme-ov-file#sanitization
And others like DOM transforms etc

So, really quite advanced, and obviously way more than we'd build. But it shows the difficulty of comparing two sets of HTML and some of the solutions necessary. Selector targeting seems like the easiest to accomplish.

I think it would help to see actual output of rclone between two versions of our docs, maybe another project too. We've only discussed many of the plans here theoretically so far, it would help to figure out exactly what we're going to be working with.

agjohnson · 2024-08-15T21:50:47Z

I gave a quick test to sitediff, it's a neat tool. I didn't want to crawl our prod docs, so just did the dev docs. I'd expect no changes between these two versions.

Unconfigured, the output is what I expected from rclone -- quite noisy:

Diffs show a lot of noise outside of content changes in all these files, any place latest, stable, or a commit is compared. This would be comparable to using rclone to detect changes in files.

Using a configuration with selector: "div.document[role='main']" the results are far cleaner:

The noisy changes with version names and commit hashs are larger avoided, but the diffs here are mostly from intersphinx altering the link titles. Unfortunately, these are also invisible changes to the user, so the UX of telling users these files changed and them linking them to a doc page where nothing looks changed is going to be something we need to work around.

The second is better, but I am still expecting no changes from this at all, so not ideal still.

humitos · 2024-08-16T16:48:04Z

@agjohnson the research around sitediff you've done is great! It does almost exactly what we want to build here. There are really good ideas and thoughts to consider there.

I did a quick test by myself as well and it worked great, finding differences only on one file after defining the selector 👍🏼

The noisy changes with version names and commit hashs are larger avoided, but the diffs here are mostly from intersphinx altering the link titles.

They mention similar issues on their README file and solve them by using sanitization config.

That said, I think we should continue discussing what's the path we want to follow here since there are a few possible implementations opened already and it seems there are more research we can do before making the next step.

agjohnson · 2024-08-16T18:00:42Z

Yeah, and I wouldn't build exactly what sitediff built but testing with it does highlight what we can expect from similar approaches we're discussing. We'll hit these hurdles too, and might find ourselves needing similar features -- not that I think we should build something this complex.

Agreed we should continue discussing this more. I don't have a great feeling for where we do go, but it's feeling like we'll need to give more user control than file-based comparison.

I suppose to echo what we're doing with output paths, we could expect the build to output a manifest. We could make this a job step that the user can override and apply advanced transform rules if they want.

A left field option could be using docdiff to do the comparisons, and crawling each site. This is back to the async issue though and could be slow. However, it would be much better UX in that the list of files would align with what docdiff would give the user visually.

humitos · 2024-08-20T16:43:56Z

I thought more about this. I started thinking about "build an integration with sitediff and use it behind the scenes to get the list of files changed". I took a look at this and it seems it's an application on itself and doesn't allow to be used "as a library" or as an "external command" to grab all the metadata/data it generates and uses to show those nice pages. Besides, it's written in Ruby and it will require setting up another environment just for this and also find a way to interact with our Python code... I discarded this idea after this quick research.

Then, I went back to the origins of all this discussion: "we need a way to map a source file into a URL" 💡

If we are thinking of giving users the ability to somehow configure the "file tree diff" feature to be able to detect what are the changes between two rendered HTML by defining what's the main selector, and potentially allowing them to define other selectors to ignore some chunks, and... blah blah -- I would like to propose some a lot easier for everybody: why not just ask them "how to map a source file into a URL" 🪄 ?

Knowing that, we can use the right tool designed to compare files that we all know: Git. With that information, we can then run the mapping the user provided us ¹ over each of the changed files to get the URL. Git will tell us:

what are the files that changed
how many lines have changed on each file (useful for sorting)
... a lot more useful data -- it's Git

There are two cons that I'm seeing here:

we need to ask the user for a mapping ²
we won't be able to find changes outside source files (e.g. template changes) ³

In my opinion, this looks like a good idea to discuss and explore more compared to the alternatives we have been discussing. It may not be perfect, but looks a lot easier to understand, to implement, and to explain to users.

I'd love to know your feedback about this. Where do you see potential issues? What do you like? What do you dislike? How do you compare it with the alternatives?

we could have pre-defined regex for documentation tools we already know MkDocs, Sphinx, etc or let the user to create their custom one. We could add more pre-defined values for those new tools as we grow. ↩
not a problem if we have a list of pre-defined ones, tho ↩
even with something like sitediff this would communicate changes everywhere ↩

agjohnson · 2024-08-20T17:35:03Z

I discarded this idea after this quick research.

Agreed. And sorry if I wasn't clear above but I was just testing sitediff, I also did not want to build on top of this tool. It just already does much of what we described building, so worth comparing at least.

why not just ask them "how to map a source file into a URL" 🪄 ?

I also had this same thought path. Though I was also considering building on top of what files PRs list as changed. I realized that this is not a good option as the project might not build the PR base branch or the version might not match.

Relying on Git to detect changes is a closer possibility. I would still vote to keep this as a backup option for now. It requires us maintaining more custom Git operations and the user has to give us a mapping.

I could see us requiring a mapping file in the build, or we might even use a <meta name="source-file" value="docs/install/index.rst" /> in the built HTML.

However, after thinking through source file mapping, and requiring the user tell us how to map changes, I came all the way back to just asking the user to tell us what changed:

I suppose to echo what we're doing with output paths, we could expect the build to output a manifest. We could make this a job step that the user can override and apply advanced transform rules if they want.

That is, projects would have to output a hash manifest file of changes to HTML files in their build:

To start, our build job would just hash whole files. This is just an easy way to move forward for now and gives us noisy output similar to rclone.
Our default job would eventually support Sphinx/Mkdocs output generically. We hash just the contents of parsed HTML div.document[role="main"] etc -- we have some good defaults for the majority of projects.
Projects can override this job and hash content however they please, performing transform steps to content, etc. There just needs to be a _readthedocs/html/checksum.md5/etc at the end of the build.

The plan above for describing differences is still just pulling the two checksum files and noting what doesn't match. We have the page URLs in the checksum file, can note page deletions/additions, and when our defaults don't work the user has full control.

It does seem like we should involve the user at some point, at least for some portion of projects.

ericholscher · 2024-09-04T17:25:18Z

We discussed this in the call today. Takeaways:

We need to diff HTML, because there are a number of edge cases where content can change based on outside files (eg. include files, code API rendering, etc.)
Doing file diffs is too noisy because of metadata changes (eg. dates, commit hashes, etc.)
We should be doing content diffs, where content is defined by a content CSS selector. This should be user configurable in v2, but for v1, we can use our existing content guessing logic from search and other places.
We should store the content hash on the ImportedFile on build, as well as support the build outputting them into a file ($READTHEDOCS_OUTPUT/readthedocs-hashes.json).
When we want to compare 2 verisons, we can compare content hashes, and don't need to re-read the full file contents from S3.

agjohnson · 2024-09-04T19:54:49Z

For a first pass, I feel we should do whatever is quickest so that we can work on the UX here too. This might still be file based hashing just to start experimenting even. Relying on our search code to break down a DOM to text only representation feels like a second or third pass maybe?

Thinking about text representation vs DOM comparison more, DOM comparison seems like where we should be ultimately though. Users should be alerted on changes like image URLs, HTML classes, and non-text DOM changes, not just text changes. Maybe text-only mode is a configurable option later though.

stsewd · 2024-09-04T21:40:24Z

This might still be file based hashing just to start experimenting even. Relying on our search code to break down a DOM to text only representation feels like a second or third pass maybe?

I think we can just start with extracting the main content of the file (we already do this for search) and just hash it as is.

I have updated the design doc to include things that we discussed, I left the old ideas in case they are useful, but I'm fine removing them.

agjohnson · 2024-09-04T22:02:57Z

I think we can just start with extracting the main content of the file (we already do this for search) and just hash it as is.

Yeah, I think we're talking about the same thing. I'm saying we avoid rendering the DOM to text and instead hash the inside contents of an element inside the HTML document. I'm not familiar with what we are technically doing with search indexing.

humitos

Update looks good to me. I think there are more explicit descriptions to be done still because I'm still not sure we are all in the same page. I left some suggestion to explicitly mention what we discuss in the meeting as way to be sure of that.

Besides those comments in the PR,

The document mentions rclone, but I understand we are not going to use it at all for this. If we want to keep that section, we should add a note or similar saying that this was considered originally but we are not going to implement it.
Same for "Lines changed between two files" section, since we won't implement that approach.
VersionDiff object from "Storing results" is not required anymore; since we will use ImportedFile with some modifications if required. So, that section may need to be re-written/updated/removed.

Also, with the new implementation discussed for v1; how would be the API response? Do we need to adjust anything there?

humitos · 2024-09-09T11:46:15Z

docs/dev/design/file-tree-diff.rst

+When a build finishes, we generate this manifest for all HTML files, and store it.
+When we need to compare two versions, we can just compare the manifests.


I understand these manifest will hash only the "main content" of the document. We should mention that here.

humitos · 2024-09-09T11:47:07Z

docs/dev/design/file-tree-diff.rst

+When a build finishes, we generate this manifest for all HTML files, and store it.
+When we need to compare two versions, we can just compare the manifests.
+
+This doesn't require downloading the files, but it requires building a version to generate the manifest.

 Using rclone


Do we need rclone at all now with the new proposal? I understand we won't use rclose for anything and only base our diffing using the manifest.

humitos · 2024-09-09T11:49:12Z

docs/dev/design/file-tree-diff.rst

+- Generate a manifest of all HTML files from the versions that we want to compare.
+  This will be done at the end of the build.
+- Generate the hash based on the main content of the file,
+  not the whole file.


Are we going to create two different manifest files? 🤔

humitos · 2024-09-09T11:50:10Z

docs/dev/design/file-tree-diff.rst

+- Don't store the results in the DB,
+  we can store the results in a next iteration.


What results are you referencing here? Didn't we say to store the hash in ImportFile or similar object?

Version file tree diff: design doc

2ba73fa

ref #11319

stsewd requested a review from a team as a code owner July 30, 2024 21:59

stsewd requested a review from ericholscher July 30, 2024 21:59

auto-assign bot assigned stsewd Jul 30, 2024

stsewd added 3 commits July 30, 2024 17:00

format

07e2dba

This is just plain text

8fb3941

After thoughts

1f2b58a

stsewd commented Jul 30, 2024

View reviewed changes

docs/dev/design/file-tree-diff.rst Show resolved Hide resolved

stsewd commented Jul 30, 2024

View reviewed changes

docs/dev/design/file-tree-diff.rst Show resolved Hide resolved

stsewd commented Jul 30, 2024

View reviewed changes

humitos mentioned this pull request Jul 31, 2024

Insert URL to specific pages readthedocs/actions#4

Open

humitos reviewed Jul 31, 2024

View reviewed changes

ericholscher linked an issue Jul 31, 2024 that may be closed by this pull request

File tree diff API #11319

Open

ericholscher reviewed Aug 1, 2024

View reviewed changes

Updates from review

299e000

stsewd added 2 commits August 13, 2024 14:22

Update doc

71f5b18

Merge branch 'main' into version-file-tree-diff-desing-doc

32ce43d

Linter

e41c720

ericholscher reviewed Aug 13, 2024

View reviewed changes

agjohnson reviewed Aug 14, 2024

View reviewed changes

Updates from comments

173c56d

Updates from recent conversations

de82f28

humitos reviewed Sep 9, 2024

View reviewed changes

		- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
		Maybe don't even expose the API to the public, just use it internally.


		The endpoint will be:

		GET /api/v3/projects/{project_slug}/versions/{version_slug}/diff/{other_version_slug}/

		But since we are doing this on demand, and we can cache the results, we can minimize the costs
		(maybe is not that much).


		To start, we will only consider HTML files (``--include=*.html``).

		Lines changed between two files

		build_a = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_a')
		build_b = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_b')

		- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``.
		We can store that information as well, or just calculate it on the fly.

		When a build finishes, we generate this manifest for all HTML files, and store it.
		When we need to compare two versions, we can just compare the manifests.

		- Don't store the results in the DB,
		we can store the results in a next iteration.

Version file tree diff: design doc #11507

Are you sure you want to change the base?

Version file tree diff: design doc #11507

Conversation

stsewd commented Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

stsewd Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agjohnson Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Aug 6, 2024 • edited Loading

stsewd commented Aug 6, 2024

humitos commented Aug 6, 2024

ericholscher commented Aug 6, 2024 • edited Loading

humitos commented Aug 7, 2024

stsewd commented Aug 7, 2024

ericholscher commented Aug 7, 2024

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agjohnson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Jul 30, 2024 •

edited

Loading

stsewd Jul 30, 2024 •

edited

Loading

stsewd Jul 30, 2024 •

edited

Loading

stsewd Aug 1, 2024 •

edited

Loading

agjohnson Aug 14, 2024 •

edited

Loading

humitos Aug 14, 2024 •

edited

Loading

ericholscher left a comment •

edited

Loading

humitos commented Aug 6, 2024 •

edited

Loading

ericholscher commented Aug 6, 2024 •

edited

Loading

ericholscher commented Aug 15, 2024 •

edited

Loading