Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored assembly to use in-memory model instead of IO model #464

Merged
merged 7 commits into from
May 1, 2022
Merged

Refactored assembly to use in-memory model instead of IO model #464

merged 7 commits into from
May 1, 2022

Conversation

fnqista
Copy link

@fnqista fnqista commented Apr 16, 2022

The advantage of this code is streaming IO, with minimal reliance on disk speed. References to files and jar entries are stored as () => InputStream. Unzip to disk is gone, which is slow on some systems/hard drives.

Everything retained:

  • all functionalities (shade, merge, caching)
  • reproducible build (entires built in a specific order, jar entry timestamps identical for all entries, hash consistent)

The following were new:

  • No unzip
  • I cannot rename directories, that is the advantage of the unzip function of the current implementation (since files are unzipped, parent renames will reflect automatically to children - I cannot do this efficiently with this implementation). For this implementation I can only rename files. If there is a project file that conflicts, the project name is appended (i.e. LICENSE_foo), where in the current implementation the folder named LICENSE is renamed instead. For dependencies, the old behavior was maintained (i.e.LICENSE_scala-library-2.13.3). Tests were updated to reflect this change
  • Corrected the reporting of file merge counts
  • I additionally check for conflicts of files with directories that have the same name and signal an error, which the current implementation does not (it only checks for LICENSE conflicts. The current implementation will only throw this error when the jar is being created. New tests were added to reflect this change
  • The option to create the jar in parallel which is faster! (1.5 seconds vs 5 seconds) The catch is the reproducible build and caching won't work if the jar is built in parallel. The new property is AssemblyOption.reproducibleBuild
  • Dropped cacheUnzip property in AssemblyOptions
  • Dropped excludedFiles in AssemblyOptions (?) this is not documented and seems unused
  • Caching is now implemented by checking the last update timestamps of all included dependencies, plus some of the properties from AssemblyOptions, and if the output jar exists already
  • Documentation updates
  • Merge strategy retains the same structure, but the implementation (inputs and outputs) have changed, to support the classes added (Dependency, Project, Library, MergedEntry, JarEntry)
  • Newline appending Input Stream (called AppendEofInputStream) added in AssemblyUtils to support MergeStrategy.concat with newlines

.scalafmt.conf Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@eed3si9n
Copy link
Member

@fnqista Thanks for this contribution. Looking forward to the perf improvements.
Since many Spark users use sbt-assembly and I am guessing that they use a lot of JARs, I wonder if we should do some before and after comparison to show case the difference in release notes etc.

@fnqista
Copy link
Author

fnqista commented Apr 16, 2022

@fnqista Thanks for this contribution. Looking forward to the perf improvements. Since many Spark users use sbt-assembly and I am guessing that they use a lot of JARs, I wonder if we should do some before and after comparison to show case the difference in release notes etc.

When the computer is fast, like in my home computer which was upgraded recently, the difference between the old and new implementation is not that pronounced:

System: AMD Ryzen 5900X 12c/24t with NVME Gen4 drive (7000 MB/sec read)
Command: sbt clean assembly
Project: assembly jar ~90MB in size

  • Old implementation
    [success] Total time: 22 s, completed Apr 16, 2022, 10:04:32 PM

  • New implementation
    [success] Total time: 20 s, completed Apr 16, 2022, 10:13:26 PM

However in my work computer, I have pre-tested before and something that takes 15 minutes in the old implementation only takes 29 seconds now. I will gather evidence when I am back at work after the Easter.

README.md Show resolved Hide resolved
@eed3si9n
Copy link
Member

eed3si9n commented Apr 16, 2022

Here's a result from assembling a fresh Play app on my old mac laptop

sbt-assembly assembly (median) assembly (max out of 5)
Baseline (1.2.0) 10s 12s
This PR (default) 7s 9s
This PR (ThisBuild / assemblyRepeatableBuild := false) 5s 5s

Implemented review comments:
- Fixed "getPackageName" API call that fails in JDK8, used "getPackage.getName" instead
- Nixed scalafmt (need a future PR)
- Changed unactionable user logging/timers to DEBUG
- Removed the 2.0.0 features from the README (instead, need to include it later on as part of the RELEASE NOTES)
- Updated README to show an example config of `ThisBuild / repeatableBuild`
- Updated README to clarify that jar entries need to be built in a specific 'order' for a consistent hash

Other fixes:
- Removed scalactic dependency
- Nixed extra spacing, inline some vals
- Renamed some vals
- Removed extra manifest checking
- Nixed errant comment in custom merge strategy test
- `transferAndClose(is, os)` changed to `transfer(is, os)` since the former doesn't close the os anyway, instead we rely on the wrapping `Using`
- Changed `AssemblyOption.repeatableBuild` default to `true` in contraband (it should be fine as is since it will be set to `true` by the Plugin, but changed anyway to show the intent that it was meant to be defaulted to `true`)
Minor change:
Print the path of the built jar
@fnqista
Copy link
Author

fnqista commented Apr 18, 2022

@fnqista fnqista requested a review from eed3si9n April 18, 2022 11:36
Bugfix:
Renames should only be applied once (existing 1.x behavior)

Found one bug while creating the release notes. The 1.x version correctly renames only once, so we adapt this behavior
Added a test for this scenario
Assigned the classes to a val for clarity
@fnqista
Copy link
Author

fnqista commented Apr 19, 2022

@eed3si9n The method signature of MergeStrategy.merge also has changed:

merge(conflicts: Vector[Dependency]): Either[String, Vector[JarEntry]] whereas the old one just uses standard primitive and collection types.

This is exposed to the user because they can create custom merge strategies. Shall we use contraband for these to at least guarantee some level of backward compatibility?

Change the implementation of MergeStrategy for user convenience, as suggested by Eugene Yokota
Namespaced built-in merge strategies from custom user-provided ones
Rename of class files now reported as an error. Shade rules should be used instead
Updated plug-in auto imports/created implicit for String -> Path for the convenience of creating a custom merge strategy
Updated cache logic so renames are only processed if the cache is invalid
Other minor code and test updates
#418
@fnqista
Copy link
Author

fnqista commented Apr 20, 2022

@eed3si9n
Major PR updates:

  • As agreed, I changed the implementation of MergeStrategy as per your suggestion:
val first: MergeStrategy = MergeStrategy("First") { conflicts =>
  Right(Vector(JarEntry(conflicts.head.target, conflicts.head.stream)))
}
  • For custom merge strategies (user-provided):
ThisBuild / assemblyMergeStrategy := {
...
case _ => CustomMergeStrategy("First") { conflicts =>
  Right(Vector(JarEntry(conflicts.head.target, conflicts.head.stream)))
}
  • I have to change the object for a custom merge strategy because:

    • Primarily to namespace them. It is easy for a user to create a custom merge strategy that is also named First, but properly namespacing them makes us distinguish each other when needed in both the code and in the logs.
    • To prevent future plugin contributors who will create/modify a built-in MergeStrategy in as a "custom" one, again breaking namespacing.
      You can browse the code under sbtassembly.MergeStrategy. I used a sealed trait that extends a Function1 plus two descendants to represent the namespacing. Companion objects are also created for them.
  • And another one - if the user includes .class files in a built-in rename strategy, I throw an error now (well, really, return a Left), and inform them that they should use shadeRules instead. In 1.2.0 this was ignored but also silently a no-op, without informing the user. I think it is important to be transparent with the plugin behavior regarding this. Code is in MergeStrategy, plus a new test is added for this.

  • I updated the README to clarify that a MergeStrategy doesn't only merge, it also transforms, such as in the case of a rename, where it probably matches only a single file. I think it is important that this is communicated to the users.

  • I exposed JarEntry and CustomMergeStrategy as an auto import in AssemblyPlugin, since I have tried this myself and found out that this increases the user's convenience.

  • I also created an implicit class conversion from String -> java.nio.file.Path since Path is used by JarEntry, again in AssemblyPlugin.

  • I implemented the other review comment about test in assembly := {},, which now properly returns Unit, again in AssemblyPlugin.

  • I type-aliased () => InputStream as LazyInputStream in Assembly

  • I finally moved the rename logic inside the cache block too. Now, to check the cache, nothing is every really done except to build the Dependency to check if their timestamps have changed and to build a Map of paths to merge strategies. Before, we always process the renames before building the merge strategy map/checking the cache. I don't know what I was thinking before, but it seems that we don't always need to process renames before checking the cache (I hope I am right). Anyway, I tested it quite a few times and it seems solid. Code is in Assembly.

  • I created the MergeStrategy.isBuiltIn property to namespace custom ones from, well, built-in ones - and make sure it is included in the cache checking. Also in Assembly.

  • Minor code and test updates due to the above changes.

@fnqista fnqista requested a review from eed3si9n April 20, 2022 21:22
Update case class member types from `java.nio.file.Path` to `String`, to communicate that it doesn't accept any kind of `Path`s (absolute or URI-based), but just plain relative `Path`s
#418
@fnqista fnqista requested a review from eed3si9n April 21, 2022 15:37
Copy link
Member

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fnqista
Copy link
Author

fnqista commented Apr 23, 2022

@eed3si9n
So, me and a few colleagues tested this at work over the past week and we found some bugs, all of them related to custom merge strategies.

  • The Custom rename strategy method signature must be changed so it can only return a String representing the renamed target, instead of Vector[JarEntry], to prevent users from corrupting the result, when they should only be renaming the target. So the correct signature should be CustomStrategy.rename(f: Dependency => String) instead of the old CustomStrategy.rename(f: Vector[Dependency] => Vector[Either[String, JarEntry]])

  • Renames should really be done before evaluating the cache, which I broke in my previous commit. Otherwise, if we do it after checking the cache, the assembly might not execute when it should, in cases where the rename might cause another conflict.

  • Custom merge strategies, other than rename, breaks the caching. In this PR, the only way to check if the merge strategies have changed is by creating a Map[String, String] representing the matched target and the merge strategy name. This is mostly ok for built-in strategies and the custom rename because their code/logic doesn't change. But with custom merge strategies, the name means nothing when caching since we cannot predict what the user will do, thereby it breaks caching again by the code thinking that nothing has changed when in fact the user updated the merge strategy logic and assembly should re-run.

I have prepared some fixes but I am still testing them, hopefully I can make a new push within a couple of days.

EDIT: Clarifying point no. 3 about custom renames being OK - If we fix no. 2 (rename before caching), then custom renames will happen before the cache.

Rename not reported due to `notifyThreshold` set to `2`, now set to `1`
Prevent users from corrupting the `CustomMergeStrategy.rename` by restricting what they can do. The function now passes only a single `Dependency` and returns a `String` for the renamed path
Fix caching bug - rename should be done before preparing the cache inputs, since renaming can cause another match on the second pass to rebuild the jar
Fix caching bug - custom strategies other than rename, should always invalidate the cache because we cannot predict their output
Others:
Log shade result if on debug
@fnqista
Copy link
Author

fnqista commented Apr 24, 2022

@eed3si9n So, me and a few colleagues tested this at work over the past week and we found some bugs, all of them related to custom merge strategies.

  • The Custom rename strategy method signature must be changed so it can only return a String representing the renamed target, instead of Vector[JarEntry], to prevent users from corrupting the result, when they should only be renaming the target. So the correct signature should be CustomStrategy.rename(f: Dependency => String) instead of the old CustomStrategy.rename(f: Vector[Dependency] => Vector[Either[String, JarEntry]])
  • Renames should really be done before evaluating the cache, which I broke in my previous commit. Otherwise, if we do it after checking the cache, the assembly might not execute when it should, in cases where the rename might cause another conflict.
  • Custom merge strategies, other than rename, breaks the caching. In this PR, the only way to check if the merge strategies have changed is by creating a Map[String, String] representing the matched target and the merge strategy name. This is mostly ok for built-in strategies and the custom rename because their code/logic doesn't change. But with custom merge strategies, the name means nothing when caching since we cannot predict what the user will do, thereby it breaks caching again by the code thinking that nothing has changed when in fact the user updated the merge strategy logic and assembly should re-run.

I have prepared some fixes but I am still testing them, hopefully I can make a new push within a couple of days.

EDIT: Clarifying point no. 3 about custom renames being OK - If we fix no. 2 (rename before caching), then custom renames will happen before the cache.

Fixes committed for the above...hopefully the last one.

EDIT: Also updated draft release notes: https://gist.github.com/fnqista/56fb6c5957302ba4af32c6c2b7dcb7c8

@fnqista fnqista requested a review from eed3si9n April 24, 2022 18:45
@fnqista
Copy link
Author

fnqista commented Apr 29, 2022

@eed3si9n Can you review the pending changes and then create an RC1 candidate when you have time?

Copy link
Member

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test passes I am good here.

@eed3si9n eed3si9n merged commit 592fe87 into sbt:develop May 1, 2022
rtyley added a commit to guardian/mobile-n10n that referenced this pull request Mar 24, 2023
These two custom `MergeStrategy`s are no longer used:

* `MergeLog4j2PluginCachesStrategy` : introduced with
   #161 and made obsolete
   with #682 . Note that,
   should you still need something like this elsewhere,
   https://github.com/idio/sbt-assembly-log4j2 also provides an
   implementation (but maybe better to just use logback?).
* `MergeMimeTypesStrategy` : introduced with
   #292 and made obsolete
   with #412

Both classes extend `sbtassembly.MergeStrategy`, which was possible up
until sbt/sbt-assembly#464 - see
sbt/sbt-assembly#464 (comment) -
where that type became a _sealed_ trait that could not be extended.

While the existing build for this project specified an old version of
`sbt-assembly`, it appears that the Scala Steward process brings in
a newer version (>=v2.0.0) of `sbt-assembly`, that has the change
introduced with sbt/sbt-assembly#464 .
Consequently, our Scala Steward job for this repo is currently failing:

https://github.com/guardian/scala-steward-public-repos/actions/runs/4513026623/jobs/7947284010#step:6:932

...with errors like:

```
/home/runner/scala-steward/workspace/repos/guardian/mobile-n10n/project/MergeLog4j2PluginCachesStrategy.scala:10:47: illegal inheritance from sealed trait MergeStrategy
class MergeLog4j2PluginCachesStrategy extends MergeStrategy {
```

Upgrading `mobile-n10n` to use the latest version of `sbt-assembly`
makes this problem apparent with just a regular build (as opposed to
the special Scala Steward build), and the easiest fix to the problem
is just to remove the superfluous custom `MergeStrategy`s - Delete The
Obsolete!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants