Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a generated license-classifications.yml based on ScanCode's License DB #64

Merged
merged 4 commits into from
Oct 10, 2022

Conversation

fviernau
Copy link
Member

Add a Gradle task which generates license-classifications.yml from ScanCode's license data hosted at aboutcode.org and make use of the generated file.

Fixes #59.

Note: As a follow up, policy rules will need to be extended to handle new license categories.
For now, the newly added categories will still be flagged by the evaluator rule UNHANDLED_LICENSE.

@fviernau fviernau force-pushed the generate-license-classificiations branch 5 times, most recently from b546063 to fc8f779 Compare September 29, 2022 11:37
@fviernau fviernau marked this pull request as ready for review September 29, 2022 11:37
@fviernau fviernau requested review from a team as code owners September 29, 2022 11:37
@fviernau fviernau force-pushed the generate-license-classificiations branch 2 times, most recently from dc22349 to ab86897 Compare September 30, 2022 07:26
- name: "permissive"
- name: "proprietary-free"
- name: "public-domain"
- name: "source-available"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ScanCode have 2-3 description for each of the categories? Maybe ask @pombredanne.
Would be nice if we could add a category description as even I struggle to fully understand the source-available and unstated-license categories

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it makes me feel a bit uneasy to solely rely on ScanCode's categorization here. We already have an IMO too tight coupling to ScanCode and its license texts in ORT's SPDX module, and I'd rather not introduce similar couplings elsewhere, but stay scanner-agnostic where we can.

So I'm wondering why we can't do as proposed here and simply use LDBcollector's classification, which at least does not rely on a single source / "opinion" only, and also nicely prefixes (at least some) category names with a hint about what the category semantically is about.

Copy link
Member Author

@fviernau fviernau Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it makes me feel a bit uneasy to solely rely on ScanCode's categorization here. We already have an IMO too tight coupling to ScanCode and its license texts in ORT's SPDX module, and I'd rather not introduce similar couplings elsewhere, but stay scanner-agnostic where we can.

I believe 'ort-config' is about providing a real world example, providing amongst others the following value

  1. Users can get started quickly.
  2. Users can see how a real world setup could look like.
  3. It allows building and sharing a vision of a good config we consider reasonable.
  4. Users can see how curations / package configurations and policy rules work together
  5. Users can instantly run all stages: analyzer, scanner, evaluator, advisor, reporter and
    see the report with the violations

Regarding point 5: When looking at the report of an arbitrary scan, it will be cluttered with a bunch of UNHANDLED_LICENSE issues. So, it's quite obfuscated how effort / issues is there to fix, as classifying the licenses "maybe" lead to further issues. Having a lot of licenses classified, which this PR does, improves that situation quite a bit, the report created by default will be more meaningful.

In terms of getting the user started quickly, I believe it makes sense to stick to the de-facto default scanner ScanCode for the following reasons:

  1. A setup for other scanners would look similar, so it does not illustrate really new cases.
    As it's similar. it's easy for the user to create if the ort-config is understood.
  2. Adding categories from mutliple scanners adds complexity. It makes it harder to understand and iterate on the configuration.
  3. Categories from mutliple scanners may overlap semantically
  4. To have the setup kind of complete, policy rules for all the categories have to be written.
    Who's gonna do that?

So, at the very minimum, I believe adding categories beyond ScanCode should be considered out of scope of this PR.
I'd vote against adding other categories due to how I see the effort / benefit ratio.

Comment on lines +3 to +7
import com.fasterxml.jackson.databind.PropertyNamingStrategies
import com.fasterxml.jackson.databind.json.JsonMapper
import com.fasterxml.jackson.module.kotlin.KotlinFeature
import com.fasterxml.jackson.module.kotlin.KotlinModule
import com.fasterxml.jackson.module.kotlin.readValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're trying to avoid Jackson in ORT Core (also see oss-review-toolkit/ort#3904), would you agree that it also makes sense to rather use kotlinx-serialization here?

Copy link
Member Author

@fviernau fviernau Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make sense yes. However, Jackson has been used in that module already before, so I went for local consistency. I don't want to put in the effort to refactor this PR to use 'kotlinx-serialization' now. I wouldn't object if someone did it on top, after my PR is merged.

Comment on lines 98 to 127
private fun downloadLicenseDetailsBatched(
licenses: Collection<License>,
threadCount: Int = 60
): Map<License, LicenseDetails> {
val queue = ConcurrentLinkedQueue(licenses)
val result = ConcurrentHashMap<License, LicenseDetails>()

val threads = List(threadCount) { _ ->
Thread {
var license = queue.poll()

while (license != null) {
result[license] = downloadLicenseDetails(license)
license = queue.poll()
}
}
}

val executionTimeMillis = measureTimeMillis {
threads.run {
forEach { it.start() }
forEach { it.join() }
}
}

LOGGER.quiet("Downloaded ${result.size} files in ${executionTimeMillis / 1000f} seconds on $threadCount threads.")

return result.toMap()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using the download plugin instead to safe some custom code, similar to like we already do in the spdx-utils module?

In general, Gradle discourages plugins to manually create threads to parallelize work. Instead, the Worker API should be used. The download plugin already does that.

Copy link
Member Author

@fviernau fviernau Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using the download plugin instead to safe some custom code, similar to like we already do in the spdx-utils module?

I decided against it as it leads to a coupling with Gradle tasks, and seemed in the end a bit more complex.
This way the code is easily readable even for people who are not into Gradle tasks.
I found it nicer not to add a dependency for such small thing as parallelizing a download and instead have full control over parallelization.

In general, Gradle discourages plugins to manually create threads to parallelize work. Instead, the Worker API should be used. The download plugin already does that.

This task is run manually only very rarely. Given that I believe the manual creation of threads is just a theoretical problem in this case. Would you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way the code is easily readable even for people who are not into Gradle tasks.

Actually, I found it harder to read (and to check for correctness) than a Download-task.

This task is run manually only very rarely. Given that I believe the manual creation of threads is just a theoretical problem in this case. Would you agree?

Independently of whether the problem is theoretical or real, I believe we should not introduce code that does not adhere to Gradle best practices. But I'm curious about what other @oss-review-toolkit/kotlin-devs say.

Copy link
Member Author

@fviernau fviernau Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I found it harder to read (and to check for correctness) than a Download-task.

I believe that's subjective.

Independently of whether the problem is theoretical or real, I believe we should not introduce code that does not adhere to Gradle best practices.

In general I believe if there can be good reasons to not follow best practices, e.g. there's an exception to almost every rule. So, I believe using the fact that something is best practices as rationale why an exception wrt to the best practices should not be made, does not make much sense.

Apart from that, I was arguing that I would like to keep the implementation separate from Gradle.
So, why should Gradle best practice apply to code which is just plain Kotlin without Gradle dependencies?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I found it harder to read (and to check for correctness) than a Download-task.

I believe that's subjective.

Sure it is. Just like your statement before, that the code would be easier to read.

So, why should Gradle best practice apply to code which is just plain Kotlin without Gradle dependencies?

Because it's actually only being used within Gradle build files, and thus using Gradle infrastructure, which has its own rules. And I currently don't see the code ever being used elsewhere.

Apart from that, I was arguing that I would like to keep the implementation separate from Gradle.

Why? If you have concrete plans to use the code elsewhere later, please share them so we can make more provident reviews.

@sschuberth sschuberth dismissed their stale review October 10, 2022 08:45

Dismissing my review in favor of others.

@sschuberth sschuberth requested a review from a team October 10, 2022 08:46
Copy link
Member

@MarcelBochtler MarcelBochtler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think that this comment should still be addressed:
#64 (comment)

This revision has the fix for serializing license classifications, see
oss-review-toolkit/ort#5882.

Signed-off-by: Frank Viernau <frank_viernau@epam.com>
The task downloads the "ScanCode License DB" [1] data from [2], derives
ORT license classifications from that data and finally writes that
result to `license-classifcations.yml` in the root directory of the
code repository.

The data isn't modified except for the following:

1. Exceptions are filtered out.
2. License IDs which do not refer to specific licenses, like the ones
   with `is_unknown == true` or `is_generic == true`, are re-assigned
   to separate categories.

[1] https://github.com/nexB/scancode-licensedb
[2] https://scancode-licensedb.aboutcode.org

Signed-off-by: Frank Viernau <frank_viernau@epam.com>
Update the file by running [1] which generates license classifications
based on [2]. That license data corresponds to ScanCode version
`31.1.1`, see also the bottom of [2].

Note: For all licenses which are assigned to a category which is new in
`license-classifications.yml` the `UNHANDLED_LICENSE` violations will
continue to be flagged. `UNHANDLED_LICENSE` violations for all licenses
which are now assigned to a non-offending category, like
e.g. "permissive", dissappear.

[1] `./gradlew generateLicenseClassifications`
[2] https://scancode-licensedb.aboutcode.org/

Signed-off-by: Frank Viernau <frank_viernau@epam.com>
Signed-off-by: Frank Viernau <frank_viernau@epam.com>
@fviernau fviernau force-pushed the generate-license-classificiations branch from ab86897 to be6d816 Compare October 10, 2022 09:39
@fviernau
Copy link
Member Author

fviernau commented Oct 10, 2022

LGTM, but I think that this comment should still be addressed:
#64 (comment)

I've addressed that comment (just now).

@fviernau fviernau merged commit 96cb835 into main Oct 10, 2022
@fviernau fviernau deleted the generate-license-classificiations branch October 10, 2022 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Gradle task for generating a license-classifications.yml from ScanCode's license data
4 participants