Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Tooling: Optimize build by caching results #15967

Open
aduth opened this issue Jun 3, 2019 · 6 comments
Open

Build Tooling: Optimize build by caching results #15967

aduth opened this issue Jun 3, 2019 · 6 comments
Labels
[Type] Build Tooling Issues or PRs related to build tooling [Type] Enhancement A suggestion for improvement. [Type] Performance Related to performance efforts

Comments

@aduth
Copy link
Member

aduth commented Jun 3, 2019

Previously: #15230 , #13124

As explored previously in #13124 and discussed in #15230, we may be able to benefit from improved build times by leveraging a persisted cache of previously-built artifacts.

Implementation Notes:

@aduth aduth added [Type] Enhancement A suggestion for improvement. [Type] Build Tooling Issues or PRs related to build tooling [Type] Performance Related to performance efforts labels Jun 3, 2019
@aduth
Copy link
Member Author

aduth commented Jun 5, 2019

A complication here is that one source file will have multiple output files, so it's not quite straight-forward to detect a cached result by the presence of a hash.

A couple ideas:

  • We could refactor the build worker to have a function which produces the set of filenames which would result from a build, which is used both for checking for a cached result, and for the build itself.
  • The cache lookup result could be a directory of files rather than a single file

@noisysocks
Copy link
Member

noisysocks commented Jun 6, 2019

Last week I started experimenting with an approach where we cache objects instead of files by serialising the object as a JSON string. You can see where I got over in the try/build-caching branch.

It's not finished as there were errors when you running the build. Still, if anyone would like to pick this up: go for it!

The assumption I made that we will have to verify is that doing these JSON.parse() calls is faster than Babel's transform() function.

To keep Travis cache download time reasonable, we must prune cache, ideally automatically

Thinking that we just recursively traverse the cache directory and delete files that are more than 3 months old. If traversal itself is too slow then we could only perform this operation when Math.random() < 0.1.

@aduth
Copy link
Member Author

aduth commented Jun 6, 2019

Last week I started experimenting with an approach where we cache objects instead of files by serialising the object as a JSON string. You can see where I got over in the try/build-caching branch.

As written, are you assuming that we'd have multiple identical source files which could benefit from the memoization?

I could also imagine just writing this as a file with the stringified JSON object, as an alternative to what I considered in my previous comment of a directory of files, where the keys in the JSON object are equivalent to files in a directory. Unclear whether this would have a performance difference, e.g. if a file system would be faster at copying a file than to parse JSON and write a new one from the object contents.

Thinking that we just recursively traverse the cache directory and delete files that are more than 3 months old. If traversal itself is too slow then we could only perform this operation when Math.random() < 0.1.

This makes sense, but I'm not sure when this would occur. For me, ideally it'd be some "fire and forget" (non-blocking) action which occurs after a build completes.

There's various lifecycle steps like after_success, after_script (documentation).

If it's not possible to use these as "non-blocking" (i.e. assuming the container shuts down after the build completes), maybe instead it's just a separate task of the build to do the cleanup, since presumably it wouldn't take long and could be done in parallel to other tasks.

@noisysocks
Copy link
Member

noisysocks commented Jun 7, 2019

As written, are you assuming that we'd have multiple identical source files which could benefit from the memoization?

This isn't an explicit objective of what I wrote, no.

We could add the path of the input file to the string that is hashed so that two identical source files using the same cache entry. I figured this isn't necessary though since, if two source files are identical, they will compile to the same result.

@noisysocks
Copy link
Member

Unclear whether this would have a performance difference, e.g. if a file system would be faster at copying a file than to parse JSON and write a new one from the object contents.

Yeah, you're right that it's likely that a mv is faster.

Perhaps something like this for an e.g. .cache/manifest.json:

{
	"src/example.js": {
		"hash": "1a79a4d60de6718e8e5b326e338ae533",
		"output": {
			"build/example.js": ".cache/1a79a4d60de6718e8e5b326e338ae533.js",
			"build/example.js.map": ".cache/1a79a4d60de6718e8e5b326e338ae533.js.map"
		}
	}
}

@aduth
Copy link
Member Author

aduth commented Jun 10, 2019

I spent some time experimenting with this on Friday, and made some decent (but incomplete/buggy) progress. The work is currently reflected in the add/build-cache branch (specifically d2cf89e).

A few of my observations:

  • While it was natural to try to add caching as an enhancement of the current behavior in the build-worker.js file (the forked build process), and while it did mostly work and was fairly simple to implement, I quickly ran into an issue of race conditions. When multiple of these forked processes would try to copy over a cached result, they'd often simultaneously try creating the recursive folder structure at the same time, resulting in errors (EEXISTS folder exists). It might be possible to swallow these errors, but in the tools I was trying to leverage (fs-extra, recursive-copy) it was not handled gracefully.
  • This led me instead to consider caching as part of a stream workflow of the managing parent process. This feels like a very appropriate use of streams, where the build was just one step in a pipeline of writing to/copying from cache.
  • Cache invalidation rears its head, as tends to be the case. The implementation in d2cf89e maintains a list of files which are relevant to the build process, whose contents are incorporated into the cache checksum hash. I fear this is quite fragile and could easily fall out of sync (resulting in wrongful cache reuse).
  • The implementation of the cache is as the second point mentioned in Build Tooling: Optimize build by caching results #15967 (comment) , where the result is a mapping of files which are written to a directory with name of a checksum in node_modules/.cache/gutenbuild.
    • In reflection, I think there's a minor flaw to this which makes it non-optimal, which is that in copying files from cache, each copy must occur with a recursive folder check. For example, considering edit.js and save.scss from block-library/src/paragraph, each would have to ensure that the destination directory (build/paragraph) exist or are created before copying the file. Ideally this is at least memoized to know that a prior copy for the same directory had guaranteed its existence.
  • We now have fs-extra as a dependency (from the recent release tool introduced in Bootstrap an automation tool #15699), which helps as a minor refactor to avoid promisification of fs functions (see build-worker.js changes in d2cf89e)
  • The necessary refactor to change the build tasks to return an object of file mappings (file name to file contents) proved beneficial regardless of the introduction of caching. In retrospect, having the build task manage the creation of its output directory added some unnecessary complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Build Tooling Issues or PRs related to build tooling [Type] Enhancement A suggestion for improvement. [Type] Performance Related to performance efforts
Projects
None yet
Development

No branches or pull requests

2 participants