Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation is too big #3479

Closed
odow opened this issue Sep 2, 2023 · 21 comments
Closed

Documentation is too big #3479

odow opened this issue Sep 2, 2023 · 21 comments

Comments

@odow
Copy link
Member

odow commented Sep 2, 2023

The documentation preview from #3478 failed to build because it's very big:
https://github.com/jump-dev/JuMP.jl/actions/runs/6054461407/job/16431976144

image

I've gone and removed some old previews, but we should consider removing some old versions.

Do we need https://jump.dev/JuMP.jl/v0.19.2/ or as far back as https://jump.dev/JuMP.jl/v0.12/?

Perhaps we could just keep the latest patch release for each minor and redirect.

Each copy of the documentation is ~21 Mb, so we can have ~50 copies before we hit 1 Gb.

(base) oscar@Oscars-MBP JuMP % du -sh v1.14.0
 21M	v1.14.0
(base) oscar@Oscars-MBP JuMP % du -sh v1.14.0/*
4.2M	v1.14.0/JuMP.pdf
608K	v1.14.0/api
1.5M	v1.14.0/assets
 40K	v1.14.0/background
156K	v1.14.0/changelog
260K	v1.14.0/developers
 64K	v1.14.0/extensions
 36K	v1.14.0/index.html
 56K	v1.14.0/installation
588K	v1.14.0/manual
2.7M	v1.14.0/moi
1.8M	v1.14.0/packages
228K	v1.14.0/release_notes
3.0M	v1.14.0/search_index.js
 36K	v1.14.0/should_i_use
4.0K	v1.14.0/siteinfo.js
5.5M	v1.14.0/tutorials

We could also redirect all of the PDFs to the latest copy, rather than storing a 4 Mb PDF for every version.

cc @mortenpi

@odow
Copy link
Member Author

odow commented Sep 3, 2023

I looked at de-duplicating files between versions. But every HTML has a timestamp of when it was compiled 😢. They also contain the navigation tree, so any change to navigation (new page, header, etc) will prevent de-duplication.

@odow
Copy link
Member Author

odow commented Sep 4, 2023

Thoughts @mortenpi?

@mortenpi
Copy link
Contributor

mortenpi commented Sep 4, 2023

Could look into just retaining the latest patch version per minor maybe?

I don't have any amazing ideas though. I guess the "official" recommendation here would be to look at deploying to self-hosted S3 buckets or something if the docs get too big?

@odow
Copy link
Member Author

odow commented Sep 4, 2023

It's a larger change, but ideally:

  • We wouldn't have the date in the footer
  • The <nav> wouldn't be hard-coded so that any top-level required a rebuild of every page
  • Assets like plots would be separate files
  • Files would be de-duplicated to point to the oldest version of the file at the same path with the same sha256.

There are still problems, like the fact that search_index.js is 3 Mb. (I get the motivation, but its still quite large, especially if we want to have 50 copies of the docs lying around.)

Here's a script I tried, but I ran into the nav issue.

import SHA
function sha256(filename)
    if endswith(filename, ".html")
        # <span class="colophon-date" title="Saturday 2 September 2023 03:04">Saturday 2 September 2023</span>
        contents = read(filename, String)
        io = IOBuffer()
        write(
            io,
            replace(contents, r"<span class=\"colophon-date\".+?</span>" => ""),
        )
        seekstart(io)
        return bytes2hex(SHA.sha256(io))
    else
        return bytes2hex(open(SHA.sha256, filename))
    end
end

dirs = filter!(!isnothing, match.(r"v(\d+.\d+).(\d+)", readdir(".")))
versions = sort([VersionNumber(m.match) for m in dirs])

sha_file_to_version = Dict{Tuple{String,String},String}()
for v in versions
    version = "v$v"
    for (root, dirs, files) in walkdir(version)
        for file in files
            path = joinpath(root, file)
            sha = sha256(path)
            filename = replace(path, "$(version)/" => "")
            first_version = get(sha_file_to_version, (sha, filename), nothing)
            if first_version !== nothing
                new_file = joinpath(first_version, filename)
                run(`ln -fs $new_file $path`)
            else
                sha_file_to_version[(sha, filename)] = version
            end
        end
    end
end

The SciML docs are going to have the same problem at some point. There new versions are 16Mb (although they have a 3.9 Gb .git folder for some reason) (cc @ChrisRackauckas)

(base) oscar@Oscars-MBP SciMLDocs % du -sh *
 25M	dev
1.2M	docs
4.0K	index.html
  0B	stable
1.2M	v0.1
  0B	v0.1.0
1.8M	v0.2
  0B	v0.2.0
  0B	v0.3
1.9M	v0.3.0
  0B	v1
  0B	v1.0
6.4M	v1.0.0
  0B	v1.1
6.4M	v1.1.0
  0B	v1.2
 16M	v1.2.0
 16M	v1.2.1
4.0K	versions.js
(base) oscar@Oscars-MBP SciMLDocs % du -sh .
4.0G	.
(base) oscar@Oscars-MBP SciMLDocs % du -sh .git
3.9G	.git

@mortenpi
Copy link
Contributor

mortenpi commented Sep 4, 2023

Just to double check -- does the deduplication even work? I wouldn't be surprised if gh-pages resolves the symlinks when it bundles up the tarball.

The timestamp could have an option to disable it (or maybe even populate it dynamically from siteinfo.js / siteinfo.json). But I'm very skeptical of the dynamic navigation -- I think for that to work properly, we'd need a whole different approach to the front end.

And also, I'm not sure we want the de-duplication complexity in Documenter.. I understand that JuMP is hitting this edge case, but it is an edge case. That said, I guess it would have to be a deploydocs-level post-processing step anyway, and so it can pretty easily be prototyped here.

SciML deploys to S3 by the way: https://github.com/SciML/SciMLDocs/blob/b9b5008c1fdca03b9365ec78ff220cefab48b632/.buildkite/aggregate.yml#L26-L31

@mortenpi
Copy link
Contributor

mortenpi commented Sep 4, 2023

One more note: if your gh-pages is hitting 1+ GiB, you may want to consider moving the docs to a separate repo. I think it should make it much faster for people who are cloning the repo (since right now they're also downloading all the docs when they clone).

@odow
Copy link
Member Author

odow commented Sep 4, 2023

Aren't the symlinks how we're doing the stable links anyway? I assumed if folders work then files should too. I haven't tested.

you may want to consider moving the docs to a separate repo

Fresh clones are not very big

(base) oscar@Oscars-MBP /tmp % git clone https://github.com/jump-dev/JuMP.jl
Cloning into 'JuMP.jl'...
remote: Enumerating objects: 39893, done.
remote: Counting objects: 100% (5587/5587), done.
remote: Compressing objects: 100% (1218/1218), done.
remote: Total 39893 (delta 3764), reused 4660 (delta 2963), pack-reused 34306
Receiving objects: 100% (39893/39893), 132.23 MiB | 14.00 MiB/s, done.
Resolving deltas: 100% (26193/26193), done.
(base) oscar@Oscars-MBP /tmp % cd JuMP.jl 
(base) oscar@Oscars-MBP JuMP.jl % du -sh . 
149M	.
(base) oscar@Oscars-MBP JuMP.jl % ls
CITATION.bib		LICENSE.md		README.md		src
CODE_OF_CONDUCT.md	NEWS.md			docs			test
CONTRIBUTING.md		Project.toml		ext
(base) oscar@Oscars-MBP JuMP.jl % du -sh .git
145M	.git

The overhead of managing a separate doc repo outweighs, I think.

@mortenpi
Copy link
Contributor

mortenpi commented Sep 4, 2023

That's fair, it clones reasonably quickly, on a fast connection anyhow. But just as a datapoint, 90% of the time and disk space is spent on gh-pages:

$ git clone https://github.com/jump-dev/JuMP.jl JuMP.jl-full
Cloning into 'JuMP.jl-full'...
remote: Enumerating objects: 39893, done.
remote: Counting objects: 100% (5589/5589), done.
remote: Compressing objects: 100% (1220/1220), done.
remote: Total 39893 (delta 3765), reused 4664 (delta 2963), pack-reused 34304
Receiving objects: 100% (39893/39893), 132.33 MiB | 14.19 MiB/s, done.
Resolving deltas: 100% (26189/26189), done.

$ du -sh JuMP.jl-full/.git
134M	JuMP.jl-full/.git



$ time git clone --single-branch -b gh-pages https://github.com/jump-dev/JuMP.jl JuMP.jl-gh-pages/
Cloning into 'JuMP.jl-gh-pages'...
remote: Enumerating objects: 12161, done.
remote: Counting objects: 100% (5470/5470), done.
remote: Compressing objects: 100% (1177/1177), done.
remote: Total 12161 (delta 3722), reused 4508 (delta 2891), pack-reused 6691
Receiving objects: 100% (12161/12161), 120.37 MiB | 11.64 MiB/s, done.
Resolving deltas: 100% (5771/5771), done.
Updating files: 100% (8465/8465), done.

real	0m13.270s
user	0m4.573s
sys	0m1.051s


$ du -sh JuMP.jl-gh-pages/.git
122M	JuMP.jl-gh-pages/.git



$ time git clone --single-branch -b master https://github.com/jump-dev/JuMP.jl
Cloning into 'JuMP.jl'...
remote: Enumerating objects: 26840, done.
remote: Counting objects: 100% (328/328), done.
remote: Compressing objects: 100% (261/261), done.
remote: Total 26840 (delta 142), reused 204 (delta 63), pack-reused 26512
Receiving objects: 100% (26840/26840), 12.56 MiB | 3.51 MiB/s, done.
Resolving deltas: 100% (19778/19778), done.

real	0m1.963s
user	0m1.071s
sys	0m0.162s

$ du -sh JuMP.jl/.git
14M	JuMP.jl/.git

@mortenpi
Copy link
Contributor

mortenpi commented Sep 5, 2023

Aren't the symlinks how we're doing the stable links anyway? I assumed if folders work then files should too. I haven't tested.

I mean, they definitely work, as in they get deployed correctly. But are we sure they actually reduce the size of the tarball that GitHub tries to upload? I guess if they didn't, then the tarball would be much larger though, and so maybe they do work...

@mortenpi
Copy link
Contributor

mortenpi commented Sep 5, 2023

Actually, pretty sure it doesn't. So currently, the whole gh-pages is about ~700 MB:

$ du -sh JuMP.jl-gh-pages/
715M	JuMP.jl-gh-pages/

But the artifact is 900+ MB

image

And you can download it actually -- it's a zip archive (of a .tar archive....), so I expected that if it's unpacked, it's even bigger. And yeah, looking into the zip/tar, it seems like the symlinks have been replaced with copies:

image

@odow
Copy link
Member Author

odow commented Sep 5, 2023

Oh 😢

I guess we could replace a bunch with explicit re-directs then, like we did for https://github.com/JuliaOpt/juliaopt.github.io/blob/master/JuMP.jl/dev/index.html

@odow
Copy link
Member Author

odow commented Sep 5, 2023

Anyway, I get your point that this is something we can fix. I don't know if documenter needs to do anything.

@mortenpi
Copy link
Contributor

mortenpi commented Sep 5, 2023

For redirects, I have been working on some shared tooling in DocumenterTools: JuliaDocs/DocumenterTools.jl#76

It's still a draft though, since I am not sure what API we want exactly, so would be happy to have feedback there if it's something that you could use 🙂

@odow
Copy link
Member Author

odow commented Sep 6, 2023

So one problem with the re-directs is that pages can move or be deleted between patch releases. There are actually quite a few changes

@odow
Copy link
Member Author

odow commented Sep 6, 2023

One thing that would save a lot is to remove the search index: odow/SDDP.jl#661, https://odow.github.io/SDDP.jl/v0.3.13/search/

It saves quite a lot of space, and the only downside is that someone can't search through old versions of the docs.

I didn't realize that search_index.js is literally an uncompressed copy of every text string in the entire docs.

@odow
Copy link
Member Author

odow commented Sep 6, 2023

I also wonder about just removing all of the v1.X symlinked folders. Do we really need to be able to link to JuMP.jl/v1.3/ and JuMP.jl/v1.3.0/?

@mortenpi
Copy link
Contributor

mortenpi commented Sep 6, 2023

I also wonder about just removing all of the v1.X symlinked folders. Do we really need to be able to link to JuMP.jl/v1.3/ and JuMP.jl/v1.3.0/?

What about retaining just the v1.x directories (i.e. one version of docs per minor)? Or do you want to keep all minor versions?

Potentially, this type of post-processing is also something that can maybe be done in the GitHub workflow that actually deploys to GitHub Pages: https://docs.github.com/en/pages/getting-started-with-github-pages/configuring-a-publishing-source-for-your-github-pages-site#creating-a-custom-github-actions-workflow-to-publish-your-site

@odow
Copy link
Member Author

odow commented Sep 6, 2023

I guess the question is to what extend do we want to support permalinks. If the docs can change between patch, then v1.X isn't stable and may break.

I'm okay having a script that we periodically run to post-process things.

@odow
Copy link
Member Author

odow commented Sep 21, 2023

This has come up before JuliaDocs/Documenter.jl#1914

@odow
Copy link
Member Author

odow commented Sep 21, 2023

This was actually slightly non-trivial to get working.

First up, we can change the doc build to use a GitHub action:
image
We also need to enable gh-pages to deploy to the environment
image

Then because pushes from GITHUB_TOKEN cannot trigger a GitHub action:
https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow

We need to upload with an SSH key:
https://github.com/odow/SDDP.jl/blob/30874e11e91dbd118816070b34498112fe8d209d/.github/workflows/documentation.yml#L18-L22

So it can trigger the deploy action:
https://github.com/odow/SDDP.jl/blob/gh-pages/.github/workflows/static-deploy.yml

We still need GITHUB_TOKEN for previews: odow/SDDP.jl#679

I haven't tried yet, but here would be the place to delete or re-direct any pages that we want to delete

@odow
Copy link
Member Author

odow commented Mar 26, 2024

Closing because the warnings have gone away, and we haven't seen a failure recently. I'll re-open if this becomes a problem in future.

image

@odow odow closed this as completed Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants