Build: implement build output contract #9088

agjohnson · 2022-04-07T18:40:42Z

This is what so far has been called "metadata.yaml".

@humitos has the most background on what this work is.

Originally discussed in a draft PR here: #8190 and some notes on hackmd from a meeting.

agjohnson · 2022-04-07T19:22:29Z

I think structure is mostly figured out here.

We do need this build contract file eventually, as we'll also use this file to support upload of HTML tarballs. So, regardless of whether these file paths need to be customized by users during build time, we do need some dedicated file here eventually.

A conversation for a second implementation, but I'd love to see some way to bring the contract file and readthedocs.yaml together. I can't think of a reason why that's needed up front, but if someone feels otherwise, I'm describing {build: {jobs: {collect_artifacts: cat metadata.yaml}}} as the effective command users could override.

I would advocate for a clearer and more unique file name than metadata.yaml though -- doc-output.yaml, docs.yaml, artifacts.yaml, etc.

humitos · 2022-04-20T11:22:33Z

A conversation for a second implementation, but I'd love to see some way to bring the contract file and readthedocs.yaml together. I can't think of a reason why that's needed up front

I wrote about why I think it's a good idea to have this in a separate file in #8190 (comment)

agjohnson · 2022-04-20T21:54:02Z

That discussion seems limited to whether or not the configuration file is better in readthedocs.yaml or as a separate file. I'm not describing keeping this as static configuration in readthedocs.yaml

I'm describing {build: {jobs: {collect_artifacts: cat metadata.yaml}}} as the effective command users could override.

We'd treat this as a build step, to surface the interaction between our build process and the contract file to users. Otherwise, it's a magical file that does something if we encounter it.

humitos · 2022-04-21T07:19:38Z

Oh, I think I misunderstood your point originally then. I agree with collect_artifacts being a build step 👍🏼 --I'm sorry for the noise 😄

agjohnson · 2022-11-23T01:06:25Z

It might be worth reviewing this plan and discussing more, but I think we lean towards starting with communicating/documenting the output paths and anything else strictly necessary for build.jobs/commands usage and maybe pausing on the rest of the contract file.

@humitos is there anything else in the contract file besides the output paths the you feel is necessary?

ericholscher · 2022-11-23T01:08:55Z

To clarify, it feels like we're defining the contract incrementally each time we implement things, instead of all at once. It feels like we're mostly defining things by convention, and using YAML config data for the rest currently.

humitos · 2022-11-23T09:29:44Z

I'm not sure to understand what you are asking here. Can you expand?

Currently, we only have implemented the "output path" (just _readthedocs/html). There is no way to interact with the build process/hosting itself in any other way than just putting files there.

The original idea of this was:

input.yaml -> build process -> output.yaml -> (post processing HTML) -> hosting

The only thing implemented currently is the readthedocs.html_output from the output.yaml.

agjohnson · 2022-11-23T19:11:32Z

What I'm ultimately asking is: do we even need a contract file?

Output paths are required, but I don't see a need to make them configurable. Convention here seems like it's plenty, as long as we communicate these somewhere for users.

HTML output goes in _readthedocs/html
PDF output goes in _readthedocs/pdf
EPUB output goes in _readthedocs/epub
HTMLzip output goes in _readthedocs/htmlzip

(We might need to specify a filename for other formats though, as we only support one file per output type, but that's another conversation)

Past that, it's not really clear we have a need for a separate contract file or additional configuration at all perhaps.

For example, you originally highlighted some ideas here:

https://github.com/readthedocs/readthedocs.org/pull/8190/files#diff-88e870e051679647ed5d25f90e929304026661c768ad12165e3e6f1da7276d6cR118-R124

But these seem like they could either be in .readthedocs.yaml, or may not even need to be configuration options at all. Or might be JS integration points in our library.

Are there any configuration options besides output paths that you feel are necessary @humitos ?

Jackenmen · 2022-11-23T19:20:24Z

Could it be determined whether a format will be built at runtime? For example, I could want to only build HTML for PRs to make the build take less time and allow to see the preview quicker but build all formats when building from the main branch or tag so that they're actually built when it matters.

I imagine it could be possible simply by not creating the _readthedocs/pdf file/directory during the build based on the condition I specified above but I'm not sure if you envision it like that or if not creating such a file would cause the whole build to fail.

humitos · 2022-11-24T10:31:30Z

@agjohnson

What I'm ultimately asking is: do we even need a contract file?

We need a contract file because we want to output a lot more information than just "where the output files are".

If you are asking "do wee need a contract file for this particular case?". I'd say we may not --we are already doing this for _readthedocs/html.

(We might need to specify a filename for other formats though, as we only support one file per output type, but that's another conversation)

If we are using a convention for the PDF output's path, we can use a convention for this as well: "if you have multiple .pdf files in _readthedocs/pdf, we will serve them all".

Are there any configuration options besides output paths that you feel are necessary @humitos ?

I mean, yes. That's why I wrote the original design doc ¹. I feel that we are recovering that old long conversation we had here. If that's the plan, I need to sit down again and re-read all the documents and notes we have around; but I'd ask other people to do the same so we are all in the same page because this "simple" question does not have a quick answer.

I'm fine changing directions here if needed, since some things have changed already. However, I'd suggest to not do that without reviewing all those conversations and documents.

Also, we are again breaking this conversation in multiple issues which makes everything more confusing. We are also talking about _readthedocs/pdf in https://github.com/readthedocs/meta/issues/77

the build contract not only tells us where the files are, or what's the doctool used to build the docs; it opens the door to be documentation tool agnostic while keeping the resulting documentation 100% integrated with Read the Docs ↩

agjohnson · 2022-11-28T18:59:09Z

"if you have multiple .pdf files in _readthedocs/pdf, we will serve them all"

Well, eventually 😄

We have to make the rest of RTD aware of multiple PDF files first. I wouldn't take that work on yet, nor would I pre-optimize our implementation here for this constraint yet. It seems fine to start with support for single files first in our convention -- _readthedocs/pdf/output.pdf or whatever seems easy to communicate.

We can update our convention to support multiple files later, but it's still not clear we need secondary configuration here.

I'm fine changing directions here if needed, since some things have changed already. However, I'd suggest to not do that without reviewing all those conversations and documents.

I'm mostly asking for clarification, not a change in direction. We started a change in direction already by moving towards convention over configuration. This provides users with the same functionality in a much easier to maintain way for us, and likely is easier for users to use as well. I think is an important point.

I was asking about the other configuration options specifically above. I noted that it's not clear anymore why those need to be in a secondary configuration file.

We already have trouble with one configuration file, two configuration files seems like it will slow us down too much. But with part of this contract already defined by convention, we'd need a strong reason to implement the remaining options in a secondary configuration file and take on this work in our roadmap.

Also, we are again breaking this conversation in multiple issues which makes everything more confusing.

I'm discussing this at a top level here as the rest of the contract file is something we need to place/remove from our roadmap. I think discussion around _readthedocs/pdf convention addition is separate and already ongoing and don't want to derail conversation at readthedocs/meta#77 with this.

humitos · 2022-11-29T15:50:01Z

@agjohnson

We already have trouble with one configuration file, two configuration files seems like it will slow us down too much. But with part of this contract already defined by convention, we'd need a strong reason to implement the remaining options in a secondary configuration file and take on this work in our roadmap.

I'm fine using conventions where it's possible. However, if we want more people/doctools to integrate with us, we need to give them flexibility instead of forcing them to do things exactly as we want. There is a trade off here, of course.

I don't have a strong preference for a separate file. I think it's enough with your idea of using a build.jobs command to collect the contract data. What I really want is to have a well-defined contract for:

hosting integrations
telemetry

An example of what I'm thinking here:

build:
  jobs:
    # Override default jobs:
    # It won't install "readthedocs-sphinx-ext" for example
    install:
      - pip install -r docs/requirements.txt

    # It won't run the default "sphinx-build" command for HTML nor PDF,
    # but a modified one by the user.
    # It uses conventions for the output directory.
    html:
      - python -m sphinx -b html docs/ _readthedocs/html
    pdf:
      - sphinx-build -b simplepdf docs/ _readthedocs/pdf/

    pre_metadata: null
    metadata:
      # Telemetry
      doctool:
       name: sphinx
       version: 5.0.1
       builder: html
       extensions:
         - autoapi
         - hoverxref
       theme: furo
      
      # Hosting integrations
      readthedocs:
       # It uses a convention to inject it in the following div
       #   <div id="readthedocs-embed-flyout"></div>
       flyout: true

       # Here we need to figure it out how we are going to implement
       # all the integrations we have in `readthedocs-sphinx-ext` since
       # people is not going to install this extension anymore.
       # These integrations have to be migrated into Javascript and enabled
       # and configured from here.
    post_metadata: null

Note that I'm using build.jobs.metadata to communicate all these things to Read the Docs. Here we could have all the items statically defined for the contract itself, or it can be a command that loads the YAML (as you suggested in a previous comment: cat metadata.yaml or python metadata.py). There are some information that requires being dynamically collected (like Telemetry data) --we are currently outputting a telemetry.json from inside the Sphinx build's process; but we need to migrate this outside the readthedocs-sphinx-ext in a similar way we are migrating all the other integrations.

The exact contract structure will be defined from the research we are doing in https://github.com/readthedocs/meta/issues/85#issuecomment-1329920589 and also in #9755 (comment). IMO, the first step here is migrating all the injections performed at build time to live in a Javascript library; being it the only file we inject on all the pages.

agjohnson · 2022-11-29T20:18:50Z

However, if we want more people/doctools to integrate with us, we need to give them flexibility instead of forcing them to do things exactly as we want. There is a trade off here, of course.

Looking very far out, I agree that configuration could be a polished experience for users/extension developers. However, what we've implemented so far certainly works as well.

The next places to invest our time for supporting third party tools will be in documentation UI/UX parity -- build jobs/commands parity, flyout injection, etc. Compared to a build contract concept, these pieces are much more limiting for users. If the contract isn't required for these pieces, then focusing on a contract file prematurely will delay working on these more important pieces.

We could still take on the contract file later in the roadmap, but I'd probably want to see more signs of adoption of third party tooling before we worry about first class support for tooling developers.

This was why I was advocating for experimenting with third party build tooling as much as possible, and regularly highlighting and promoting the results. Experimenting for our purposes is important, but we can't expect users to run with exploration here without some guidance. I doubt most users even know this is possible still. The work you were doing with https://github.com/readthedocs/meta/discussions/35 was great in this regard. I suppose I don't really share the vision in halting that work.

I don't have a strong preference for a separate file. I think it's enough with your idea of using a build.jobs command to collect the contract data.

I am not necessarily arguing for either option here, I'm probably arguing for neither in fact. At least for the short term, and until we know more about what uptake will look like. Currently, we aren't seeing much uptake on third party tools, so we do have some growth here still.

However, there are some strong points for implementing what we can from our build process as isolated, independent scripts/commands, and probably using build.jobs as much as possible. That is search indexing and flyout injection (or whatever we want) might all be candidates for implementing as an isolated script and separate build.job. This gives users potential control over these steps if they need.

If we tuck this implementation back in the application, we usually get bogged down in premature polish discussion and fiddly implementations, and the implementation isn't extensible. This is just a half-baked idea on my part though, not a serious constraint, but perhaps warrants some thought.

There are some information that requires being dynamically collected (like Telemetry data) --we are currently outputting a telemetry.json from inside the Sphinx build's process

Perhaps the current file is still a good place for this data, and this is a pattern to bring to other implementations. However, we are still a ways off from having community maintained support for third party tools, and we'd have to consider if we want community provided telemetry even.

For now, we're still relying on users to maintain their own integrations, and we almost certainly can't rely on users to give us our own telemetry data. So I think customization of telemetry data collected might also be a good implementation to save for later?

humitos · 2022-11-30T09:45:39Z

@agjohnson

The next places to invest our time for supporting third party tools will be in documentation UI/UX parity -- build jobs/commands parity, flyout injection, etc. Compared to a build contract concept, these pieces are much more limiting for users. If the contract isn't required for these pieces, then focusing on a contract file prematurely will delay working on these more important pieces.

100% agree on this. Hosting integrations will give users a way better experience than a build contract just for supporting other tools¹. Besides, I also think these integrations will guide the development of the build contract and make it a lot easier to design and implement.

To clarify, the more configuration/conventions we can move to the Javascript library required (+ HTML structure) for the hosting integration, the better in my opinion. At this point, we need to define if there are configurations/values we have to inject at build time for the Javascript library to use when serving the page. That's probably "the smallest contract" we have to discuss now --and that's the work I was trying to do at #9755 which is kind of blocked on "defining how this Javascript library will work"

Maybe it's time for another long meeting where you sit down in your porche, I prepare some mate and we talk for hours? 😄

For now, we're still relying on users to maintain their own integrations, and we almost certainly can't rely on users to give us our own telemetry data. So I think customization of telemetry data collected might also be a good implementation to save for later?

Sure, I'm not saying we have to implement it now or anything similar, just to keep in mind when working on the design because this is something that we will eventually want to have.

I'm seeing supporting "other doctools" as a company's growing goal that will bring new users, compared to "better UI/UX" given by the hosting integrations to current users ↩

agjohnson · 2022-11-30T22:03:49Z

At this point, we need to define if there are configurations/values we have to inject at build time for the Javascript library to use when serving the page.

I'm with you here. I feel like we've seeded this conversation, but I feel like we need a map of all of these pieces at some point. I get a bit overwhelmed considering everything.

Maybe it's time for another long meeting where you sit down in your porche, I prepare some mate and we talk for hours?

Hah, well my porch is covered in snow now, but my office is cozy! 😆

This would be a great place to go deep on pairing. I personally don't have this new build architecture model visualized yet and I think we'd all benefit from having this mapped out.

Relating to our conversation this morning, we need to start defining what build pieces are moving, where they are moving to, and how we'll support configuration for them. We've thrown around some really good ideas here, but it seems we're all unsure of the next steps.

Maybe to seed this conversation, I could share a few code examples of what front end configuration might look like. It would be similar to the EA client I think, in a lot of ways. This might help give some way to discuss technical patterns, and maybe we can work from what patterns we think users would like (or what configuration should live in our configuration, in build conventions, or in a metadata contract).

Sure, I'm not saying we have to implement it now or anything similar, just to keep in mind when working on the design because this is something that we will eventually want to have.

Cool, I think we're on the same page 🤝

And to clarify, I'm not at all opposed to a contract file. I do think it's a great plan, but I'm also excited to work on the implementation rearchitecture first, if possible. The contract file seems like it will be a natural progression, and we very well might just suddenly find it required, and that would still be a great outcome.

humitos · 2022-12-01T12:09:12Z

@agjohnson

Maybe to seed this conversation, I could share a few code examples of what front end configuration might look like. It would be similar to the EA client I think, in a lot of ways. This might help give some way to discuss technical patterns, and maybe we can work from what patterns we think users would like (or what configuration should live in our configuration, in build conventions, or in a metadata contract).

Cool 💯 . I think this is a great next step. Knowing this will let us know (me, in particular) where those configuration has to come from (e.g. static in the YAML or dynamic at build time)

humitos · 2023-03-11T11:19:04Z

I'm with you here. I feel like we've seeded this conversation, but I feel like we need a map of all of these pieces at some point. I get a bit overwhelmed considering everything.

An initial implementation putting all these pieces together is happening at #10127. I'm closing this issue to avoid confusions for now. We should continue the conversation there in the new PR and come back here if we need some extra old context.

After re-reading this issue completely again, it seems the most important doubt yet is if there is required a new YAML file generated during the build process or not. In case it's required, there are some ideas here about how to get this data:

Output all the data in readthedocs-build.yaml
Use a build.jobs.metadata config key in .readthedocs.yaml that can collect this metadata statically or dynamically by running a command

This "is this file required?" question was risen again in #10127. So, I think it's gonna be better to continue that discussion there instead.

agjohnson added the Needed: design decision A core team decision is required label Apr 7, 2022

This was referenced Apr 7, 2022

Build: implement build.commands override of build commands #9062

Closed

Design doc: forward path to a future builder #8190

Merged

agjohnson assigned agjohnson and unassigned agjohnson Apr 7, 2022

This was referenced Jun 13, 2022

A way to disable installation of default requirements #8208

Closed

Docs: explain how to enable Read the Docs integrations when using build.commands #9342

Closed

This was referenced Oct 11, 2022

build.commands shall support custom package installation (build.apt_packages) #9599

Open

Use dirhtml builder for Read the Docs python/devguide#965

Merged

humitos mentioned this issue Nov 29, 2022

PDF builder: tectonic #9598

Closed

humitos mentioned this issue Feb 22, 2023

Build: expose VCS-related environment variables #9423

Closed

agjohnson mentioned this issue Mar 10, 2023

Hosting: manual integrations via build contract #10127

Merged

humitos closed this as completed Mar 11, 2023

humitos mentioned this issue Jun 8, 2023

Multiple pdf files for single project #2045

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build: implement build output contract #9088

Build: implement build output contract #9088

agjohnson commented Apr 7, 2022

agjohnson commented Apr 7, 2022 •

edited

Loading

humitos commented Apr 20, 2022

agjohnson commented Apr 20, 2022

humitos commented Apr 21, 2022

agjohnson commented Nov 23, 2022

ericholscher commented Nov 23, 2022

humitos commented Nov 23, 2022 •

edited

Loading

agjohnson commented Nov 23, 2022

Jackenmen commented Nov 23, 2022

humitos commented Nov 24, 2022

agjohnson commented Nov 28, 2022 •

edited

Loading

humitos commented Nov 29, 2022

agjohnson commented Nov 29, 2022

humitos commented Nov 30, 2022

agjohnson commented Nov 30, 2022

humitos commented Dec 1, 2022

humitos commented Mar 11, 2023

Build: implement build output contract #9088

Build: implement build output contract #9088

Comments

agjohnson commented Apr 7, 2022

agjohnson commented Apr 7, 2022 • edited Loading

humitos commented Apr 20, 2022

agjohnson commented Apr 20, 2022

humitos commented Apr 21, 2022

agjohnson commented Nov 23, 2022

ericholscher commented Nov 23, 2022

humitos commented Nov 23, 2022 • edited Loading

agjohnson commented Nov 23, 2022

Jackenmen commented Nov 23, 2022

humitos commented Nov 24, 2022

Footnotes

agjohnson commented Nov 28, 2022 • edited Loading

humitos commented Nov 29, 2022

agjohnson commented Nov 29, 2022

humitos commented Nov 30, 2022

Footnotes

agjohnson commented Nov 30, 2022

humitos commented Dec 1, 2022

humitos commented Mar 11, 2023

agjohnson commented Apr 7, 2022 •

edited

Loading

humitos commented Nov 23, 2022 •

edited

Loading

agjohnson commented Nov 28, 2022 •

edited

Loading