Refactored assembly to use in-memory model instead of IO model #464

fnqista · 2022-04-16T13:25:33Z

The advantage of this code is streaming IO, with minimal reliance on disk speed. References to files and jar entries are stored as () => InputStream. Unzip to disk is gone, which is slow on some systems/hard drives.

Everything retained:

all functionalities (shade, merge, caching)
reproducible build (entires built in a specific order, jar entry timestamps identical for all entries, hash consistent)

The following were new:

No unzip
I cannot rename directories, that is the advantage of the unzip function of the current implementation (since files are unzipped, parent renames will reflect automatically to children - I cannot do this efficiently with this implementation). For this implementation I can only rename files. If there is a project file that conflicts, the project name is appended (i.e. LICENSE_foo), where in the current implementation the folder named LICENSE is renamed instead. For dependencies, the old behavior was maintained (i.e.LICENSE_scala-library-2.13.3). Tests were updated to reflect this change
Corrected the reporting of file merge counts
I additionally check for conflicts of files with directories that have the same name and signal an error, which the current implementation does not (it only checks for LICENSE conflicts. The current implementation will only throw this error when the jar is being created. New tests were added to reflect this change
The option to create the jar in parallel which is faster! (1.5 seconds vs 5 seconds) The catch is the reproducible build and caching won't work if the jar is built in parallel. The new property is AssemblyOption.reproducibleBuild
Dropped cacheUnzip property in AssemblyOptions
Dropped excludedFiles in AssemblyOptions (?) this is not documented and seems unused
Caching is now implemented by checking the last update timestamps of all included dependencies, plus some of the properties from AssemblyOptions, and if the output jar exists already
Documentation updates
Merge strategy retains the same structure, but the implementation (inputs and outputs) have changed, to support the classes added (Dependency, Project, Library, MergedEntry, JarEntry)
Newline appending Input Stream (called AppendEofInputStream) added in AssemblyUtils to support MergeStrategy.concat with newlines

.scalafmt.conf

README.md

src/main/scala/sbtassembly/Assembly.scala

eed3si9n · 2022-04-16T17:52:51Z

@fnqista Thanks for this contribution. Looking forward to the perf improvements.
Since many Spark users use sbt-assembly and I am guessing that they use a lot of JARs, I wonder if we should do some before and after comparison to show case the difference in release notes etc.

src/main/scala/sbtassembly/Assembly.scala

fnqista · 2022-04-16T20:14:26Z

@fnqista Thanks for this contribution. Looking forward to the perf improvements. Since many Spark users use sbt-assembly and I am guessing that they use a lot of JARs, I wonder if we should do some before and after comparison to show case the difference in release notes etc.

When the computer is fast, like in my home computer which was upgraded recently, the difference between the old and new implementation is not that pronounced:

System: AMD Ryzen 5900X 12c/24t with NVME Gen4 drive (7000 MB/sec read)
Command: sbt clean assembly
Project: assembly jar ~90MB in size

Old implementation
[success] Total time: 22 s, completed Apr 16, 2022, 10:04:32 PM
New implementation
[success] Total time: 20 s, completed Apr 16, 2022, 10:13:26 PM

However in my work computer, I have pre-tested before and something that takes 15 minutes in the old implementation only takes 29 seconds now. I will gather evidence when I am back at work after the Easter.

README.md

eed3si9n · 2022-04-16T20:56:50Z

Here's a result from assembling a fresh Play app on my old mac laptop

sbt-assembly	assembly (median)	assembly (max out of 5)
Baseline (1.2.0)	10s	12s
This PR (default)	7s	9s
This PR (`ThisBuild / assemblyRepeatableBuild := false`)	5s	5s

Implemented review comments: - Fixed "getPackageName" API call that fails in JDK8, used "getPackage.getName" instead - Nixed scalafmt (need a future PR) - Changed unactionable user logging/timers to DEBUG - Removed the 2.0.0 features from the README (instead, need to include it later on as part of the RELEASE NOTES) - Updated README to show an example config of `ThisBuild / repeatableBuild` - Updated README to clarify that jar entries need to be built in a specific 'order' for a consistent hash Other fixes: - Removed scalactic dependency - Nixed extra spacing, inline some vals - Renamed some vals - Removed extra manifest checking - Nixed errant comment in custom merge strategy test - `transferAndClose(is, os)` changed to `transfer(is, os)` since the former doesn't close the os anyway, instead we rely on the wrapping `Using` - Changed `AssemblyOption.repeatableBuild` default to `true` in contraband (it should be fine as is since it will be set to `true` by the Plugin, but changed anyway to show the intent that it was meant to be defaulted to `true`)

Minor change: Print the path of the built jar

fnqista · 2022-04-18T11:36:20Z

@eed3si9n Release notes draft:
https://gist.github.com/fnqista/56fb6c5957302ba4af32c6c2b7dcb7c8

Bugfix: Renames should only be applied once (existing 1.x behavior) Found one bug while creating the release notes. The 1.x version correctly renames only once, so we adapt this behavior Added a test for this scenario Assigned the classes to a val for clarity

src/main/scala/sbtassembly/AssemblyPlugin.scala

fnqista · 2022-04-19T19:02:10Z

@eed3si9n The method signature of MergeStrategy.merge also has changed:

merge(conflicts: Vector[Dependency]): Either[String, Vector[JarEntry]] whereas the old one just uses standard primitive and collection types.

This is exposed to the user because they can create custom merge strategies. Shall we use contraband for these to at least guarantee some level of backward compatibility?

src/main/scala/sbtassembly/MergeStrategy.scala

Change the implementation of MergeStrategy for user convenience, as suggested by Eugene Yokota Namespaced built-in merge strategies from custom user-provided ones Rename of class files now reported as an error. Shade rules should be used instead Updated plug-in auto imports/created implicit for String -> Path for the convenience of creating a custom merge strategy Updated cache logic so renames are only processed if the cache is invalid Other minor code and test updates #418

fnqista · 2022-04-20T21:22:15Z

@eed3si9n
Major PR updates:

As agreed, I changed the implementation of MergeStrategy as per your suggestion:

val first: MergeStrategy = MergeStrategy("First") { conflicts =>
  Right(Vector(JarEntry(conflicts.head.target, conflicts.head.stream)))
}

For custom merge strategies (user-provided):

ThisBuild / assemblyMergeStrategy := {
...
case _ => CustomMergeStrategy("First") { conflicts =>
  Right(Vector(JarEntry(conflicts.head.target, conflicts.head.stream)))
}

I have to change the object for a custom merge strategy because:
- Primarily to namespace them. It is easy for a user to create a custom merge strategy that is also named First, but properly namespacing them makes us distinguish each other when needed in both the code and in the logs.
- To prevent future plugin contributors who will create/modify a built-in MergeStrategy in as a "custom" one, again breaking namespacing.
  You can browse the code under sbtassembly.MergeStrategy. I used a sealed trait that extends a Function1 plus two descendants to represent the namespacing. Companion objects are also created for them.
And another one - if the user includes .class files in a built-in rename strategy, I throw an error now (well, really, return a Left), and inform them that they should use shadeRules instead. In 1.2.0 this was ignored but also silently a no-op, without informing the user. I think it is important to be transparent with the plugin behavior regarding this. Code is in MergeStrategy, plus a new test is added for this.
I updated the README to clarify that a MergeStrategy doesn't only merge, it also transforms, such as in the case of a rename, where it probably matches only a single file. I think it is important that this is communicated to the users.
I exposed JarEntry and CustomMergeStrategy as an auto import in AssemblyPlugin, since I have tried this myself and found out that this increases the user's convenience.
I also created an implicit class conversion from String -> java.nio.file.Path since Path is used by JarEntry, again in AssemblyPlugin.
I implemented the other review comment about test in assembly := {},, which now properly returns Unit, again in AssemblyPlugin.
I type-aliased () => InputStream as LazyInputStream in Assembly
I finally moved the rename logic inside the cache block too. Now, to check the cache, nothing is every really done except to build the Dependency to check if their timestamps have changed and to build a Map of paths to merge strategies. Before, we always process the renames before building the merge strategy map/checking the cache. I don't know what I was thinking before, but it seems that we don't always need to process renames before checking the cache (I hope I am right). Anyway, I tested it quite a few times and it seems solid. Code is in Assembly.
I created the MergeStrategy.isBuiltIn property to namespace custom ones from, well, built-in ones - and make sure it is included in the cache checking. Also in Assembly.
Minor code and test updates due to the above changes.

src/main/scala/sbtassembly/Assembly.scala

Update case class member types from `java.nio.file.Path` to `String`, to communicate that it doesn't accept any kind of `Path`s (absolute or URI-based), but just plain relative `Path`s #418

eed3si9n

LGTM

fnqista · 2022-04-23T22:42:43Z

@eed3si9n
So, me and a few colleagues tested this at work over the past week and we found some bugs, all of them related to custom merge strategies.

The Custom rename strategy method signature must be changed so it can only return a String representing the renamed target, instead of Vector[JarEntry], to prevent users from corrupting the result, when they should only be renaming the target. So the correct signature should be CustomStrategy.rename(f: Dependency => String) instead of the old CustomStrategy.rename(f: Vector[Dependency] => Vector[Either[String, JarEntry]])
Renames should really be done before evaluating the cache, which I broke in my previous commit. Otherwise, if we do it after checking the cache, the assembly might not execute when it should, in cases where the rename might cause another conflict.
Custom merge strategies, other than rename, breaks the caching. In this PR, the only way to check if the merge strategies have changed is by creating a Map[String, String] representing the matched target and the merge strategy name. This is mostly ok for built-in strategies and the custom rename because their code/logic doesn't change. But with custom merge strategies, the name means nothing when caching since we cannot predict what the user will do, thereby it breaks caching again by the code thinking that nothing has changed when in fact the user updated the merge strategy logic and assembly should re-run.

I have prepared some fixes but I am still testing them, hopefully I can make a new push within a couple of days.

EDIT: Clarifying point no. 3 about custom renames being OK - If we fix no. 2 (rename before caching), then custom renames will happen before the cache.

Rename not reported due to `notifyThreshold` set to `2`, now set to `1` Prevent users from corrupting the `CustomMergeStrategy.rename` by restricting what they can do. The function now passes only a single `Dependency` and returns a `String` for the renamed path Fix caching bug - rename should be done before preparing the cache inputs, since renaming can cause another match on the second pass to rebuild the jar Fix caching bug - custom strategies other than rename, should always invalidate the cache because we cannot predict their output Others: Log shade result if on debug

fnqista · 2022-04-24T18:44:50Z

@eed3si9n So, me and a few colleagues tested this at work over the past week and we found some bugs, all of them related to custom merge strategies.

The Custom rename strategy method signature must be changed so it can only return a String representing the renamed target, instead of Vector[JarEntry], to prevent users from corrupting the result, when they should only be renaming the target. So the correct signature should be CustomStrategy.rename(f: Dependency => String) instead of the old CustomStrategy.rename(f: Vector[Dependency] => Vector[Either[String, JarEntry]])

Renames should really be done before evaluating the cache, which I broke in my previous commit. Otherwise, if we do it after checking the cache, the assembly might not execute when it should, in cases where the rename might cause another conflict.

Custom merge strategies, other than rename, breaks the caching. In this PR, the only way to check if the merge strategies have changed is by creating a Map[String, String] representing the matched target and the merge strategy name. This is mostly ok for built-in strategies and the custom rename because their code/logic doesn't change. But with custom merge strategies, the name means nothing when caching since we cannot predict what the user will do, thereby it breaks caching again by the code thinking that nothing has changed when in fact the user updated the merge strategy logic and assembly should re-run.

I have prepared some fixes but I am still testing them, hopefully I can make a new push within a couple of days.

EDIT: Clarifying point no. 3 about custom renames being OK - If we fix no. 2 (rename before caching), then custom renames will happen before the cache.

Fixes committed for the above...hopefully the last one.

EDIT: Also updated draft release notes: https://gist.github.com/fnqista/56fb6c5957302ba4af32c6c2b7dcb7c8

fnqista · 2022-04-29T08:03:04Z

@eed3si9n Can you review the pending changes and then create an RC1 candidate when you have time?

eed3si9n

If the test passes I am good here.

These two custom `MergeStrategy`s are no longer used: * `MergeLog4j2PluginCachesStrategy` : introduced with #161 and made obsolete with #682 . Note that, should you still need something like this elsewhere, https://github.com/idio/sbt-assembly-log4j2 also provides an implementation (but maybe better to just use logback?). * `MergeMimeTypesStrategy` : introduced with #292 and made obsolete with #412 Both classes extend `sbtassembly.MergeStrategy`, which was possible up until sbt/sbt-assembly#464 - see sbt/sbt-assembly#464 (comment) - where that type became a _sealed_ trait that could not be extended. While the existing build for this project specified an old version of `sbt-assembly`, it appears that the Scala Steward process brings in a newer version (>=v2.0.0) of `sbt-assembly`, that has the change introduced with sbt/sbt-assembly#464 . Consequently, our Scala Steward job for this repo is currently failing: https://github.com/guardian/scala-steward-public-repos/actions/runs/4513026623/jobs/7947284010#step:6:932 ...with errors like: ``` /home/runner/scala-steward/workspace/repos/guardian/mobile-n10n/project/MergeLog4j2PluginCachesStrategy.scala:10:47: illegal inheritance from sealed trait MergeStrategy class MergeLog4j2PluginCachesStrategy extends MergeStrategy { ``` Upgrading `mobile-n10n` to use the latest version of `sbt-assembly` makes this problem apparent with just a regular build (as opposed to the special Scala Steward build), and the easiest fix to the problem is just to remove the superfluous custom `MergeStrategy`s - Delete The Obsolete!

Refactored assembly to use in-memory model instead of IO model

6939a02