Allow curations to be applied without rerunning the analyzer #6188

sschuberth · 2022-12-07T12:58:52Z

Currently, the turn around-times to test (technical) curations (also see #6187) locally or on CI are rather high as curations are "baked" into Analyzer results. This means one needs to rerun the analyzer, even if the previous analysis was successful, just in order to get updated curation data applied to its results.

Just brainstorming some ideas how to address this (without any ordering implied):

Do not store curations as part of Analyzer results at all, but create a new "Curator" tool that takes the Analyzer result as input and adds curations data to a separate section in the ORT result file, similar to like any other tool except for the Analyzer works.
Create a helper command that can patch-up analyzer results with updates / different curations. Actually, there already is the PackageCurationsCommand's SetCommand which could probably be used for this.

sschuberth · 2022-12-12T09:19:12Z

BTW, depending on the implementation, this could also solve #5637.

mnonnenmacher · 2022-12-19T09:32:47Z

Suggested name for the new ORT result section that could contain package curations, package configurations and resolutions: "corrections"

mnonnenmacher · 2022-12-19T10:17:05Z

To ensure consistency within a pipeline of the ORT tools it is important that configuration is only applied or replaced before it is used as input for a tool. For example, replacing package curations after running the scanner could lead to inconsistencies if the provenance metadata of a package is changed and as a result the scan result for that package does not match the provenance anymore.

A summary of the order in which tools should be run and configuration should be added to the result file to ensure consistency in later tools in the pipeline (package curations are split into technical "metadata corrections" and "legal curations" based on the idea from #6187):

Analyzer
-> package metadata corrections
-> Scanner (uses provenance corrections), Advisor (potentially uses identifier corrections, e.g. purl)
-> package configurations, issue resolutions, vulnerability resolutions, legal curations
-> Evaluator
-> rule violation resolutions
-> Reporter

mnonnenmacher · 2022-12-29T13:13:05Z

Based on my above comment I propose that we extend the OrtResult model to contain the configuration used by each tool. I see two options:

Extend the section for each tool to contain the configuration consumed by this tool. For example, the AnalyzerRun could contain the list of curations, or the EvaluatorRun the lists of PackageConfigurations and resolutions used during the run.
Extend the OrtResult to contain a separate section which contains all of the configuration values. For example, the analyzer adds curations to this model and later tools like the scanner or advisor only use curations from there but not from any other sources.

For both solutions it would be possible to add commands that replace the existing configuration, for example one command could update the curations in an OrtResult and also check if it already contains data that depends on them, like a scan result, and in this case either fail or print a warning that the result might be inconsistent.

@oss-review-toolkit/core-devs Any preference which direction we should take?

sschuberth · 2022-12-30T09:41:46Z

I'm having a hard time to decide / make up an opinion:

Extend the section for each tool to contain the configuration consumed by this tool.

On the one hand I like this as things that seemingly belong together are stored together, and we already have AnalyzerConfiguration etc. However, the question rises how to deal with configuration that might might be consumed by multiple tools.

Extend the OrtResult to contain a separate section which contains all of the configuration values.

This would work around the question how to deal with configuration used by different tools, and better match the also already existing RepositoryConfiguration (which in turn also contains an AnalyzerConfiguration). But what about the existing "top-level" AnalyzerConfiguration etc. then?

So we already have a mix of (primarily) storing configuration by origin, or by consumer. Maybe we should in this context also consider to make the "origin" of configuration a property of the configuration itself.

mnonnenmacher · 2022-12-30T11:33:06Z

I'm having a hard time to decide / make up an opinion:

Extend the section for each tool to contain the configuration consumed by this tool.

On the one hand I like this as things that seemingly belong together are stored together, and we already have AnalyzerConfiguration etc. However, the question rises how to deal with configuration that might might be consumed by multiple tools.

Extend the OrtResult to contain a separate section which contains all of the configuration values.

This would work around the question how to deal with configuration used by different tools, and better match the also already existing RepositoryConfiguration (which in turn also contains an AnalyzerConfiguration). But what about the existing "top-level" AnalyzerConfiguration etc. then?

I have a tendency towards a top-level section for those properties, but I think we need a better name than "configuration" to separate it from the "tool configuration" in config.yml and .ort.yml. Above "corrections" was suggested, this works for curations and package configurations but not for resolutions. So maybe two top-level sections corrections and resolutions?

So we already have a mix of (primarily) storing configuration by origin, or by consumer. Maybe we should in this context also consider to make the "origin" of configuration a property of the configuration itself.

I agree, at least for corrections and resolutions. For example, for curations we should store the provider. But for the tool configurations this might not be possible, because it can come from multiple sources like environment variables, command line parameters or config.yml, and this could be different for each property.

To make a draft:

ort:
  ...
  corrections:
    packageCurations:
      ...
    packageConfigurations:
      ...
  resolutions:
    issues:
      ...
    ruleViolations:
      ...
    vulnerabilities:
      ...

fviernau · 2023-01-12T08:03:17Z

I want to bring to attention the following use case relevant for support workflows, e.g. when using orthw:

Make changes of curations in your config repository
Recompute the reports

This IMO should be as fast as possible, so I fear if we

Add a separate command to patch-up curations the turn-around time significantly increases
for large ORT result files
Only have the possibility to replace all curations including querying remote curation providers if one uses them.
But to keep turn-around time low, one should be able to replace only the local curations but keep the ones queried from
remote.

What do you think?

edit: I've decided to write this down to an issue, see oss-review-toolkit/orthw-shell#60.

mnonnenmacher · 2023-01-13T09:31:29Z

I want to bring to attention the following use case relevant for support workflows, e.g. when using orthw:
1. Make changes of curations in your config repository

2. Recompute the reports

The problem with that workflow is that depending on the curation changes it could not be sufficient to only re-run the reporter, but it might be required to also re-run the advisor, scanner, and evaluator to get correct results.

This IMO should be as fast as possible, so I fear if we
1. Add a separate command to patch-up curations the turn-around time significantly increases
   for large ORT result files

I think the curations should still be fetched by the analyzer command to not always require an extra step to fetch curations. The benefit of the separate command is that can be run independent of which other tool needs to be run afterwards, so it could be used to fix a purl before running the advisor, to fix a VCS URL before running the scanner, and so on. But yes, it introduces the overhead of an additional serialization and deserialization.

2. Only have the possibility to replace all curations including querying remote curation providers if one uses them.
   But to keep turn-around time low, one should be able to replace only the local curations but keep the ones queried from
   remote.

To implement this correctly #5668 needs to be done first.

fviernau · 2023-01-13T09:39:34Z

The problem with that workflow is that depending on the curation changes it could not be sufficient to only re-run the reporter, but it might be required to also re-run the advisor, scanner, and evaluator to get correct results.

That's only for provenance curations. The workflow still provides a lot of value, even when these provenance curations don't take effect. It targets: fixing issues using data which has already slowly been gathered (not re-gathering all data from scratch again to be up-to-date) .

mnonnenmacher · 2023-01-13T09:53:59Z

The problem with that workflow is that depending on the curation changes it could not be sufficient to only re-run the reporter, but it might be required to also re-run the advisor, scanner, and evaluator to get correct results.

That's only for provenance curations. The workflow still provides a lot of value, even when these provenance curations don't take effect. It targets: fixing issues using data which has already slowly been gathered (not re-gathering all data from scratch again to be up-to-date) .

It's not only about provenance, for the advisor purl curations could be relevant and for the evaluator basically any curated property could affect rule violations. I somehow assumed that when you wrote "Recompute the reports" that this would also include at least re-running the the evaluator, because the results of only re-running the reporter after changing curations should be completely predictable. My point is, I'm fine if orthw supports only a specific use-case, but ORT should support all of them.

fviernau · 2023-01-13T10:00:12Z

The use case for re-applying curations in the evaluator stage would work with the following approach in a general way which IMO would be quite nice

The configuration contains an ID for each configured provider. E.g. the config for the providers get extended.
The OrtResult file contains the ORT config, containing the providers configured on CI.
User download scan-result.json from CI to fix issues with it
User makes local changes to package curations dir
User runs evaluator specifying the IDs of providers to re-apply
a. The evaluator command looks-up the configuration corresponding to the provider ID by the local (not the CI) ORT
configuration. This is necessary because the local configuration is different from the CI one. E.g. path of package curation dir differs. Also credential may differ.
b. The evaluator fetches curations from all specified providers (IDs)
c. The evaluator replaces curations for the specified provider IDs, but keeps all curations for providers
whose ID was not specified
d. Execute evaluation as usual.

The above method can be implemented in a separate dedicated command as well as in the evaluator, which should be possible without any code duplication.

Requirement: Each curation in the ORT file must be associated with a provider ID where it came from.

fviernau · 2023-01-13T10:09:56Z

It's not only about provenance, for the advisor purl curations could be relevant and for the evaluator basically any curated property could affect rule violations. I somehow assumed that when you wrote "Recompute the reports" that this would also include at least re-running the the evaluator, because the results of only re-running the reporter after changing curations should be completely predictable. My point is, I'm fine if orthw supports only a specific use-case, but ORT should support all of them.

Agree that ORT should support all of them. My point was more that ORT should also keep supporting seeing the effect of local changes quickly, which has been so far possible with --package-curations-dir option at the evaluator. Anyhow, my above comment proposes a solution which is generic for replacing curations.

fviernau · 2023-01-13T15:17:05Z

I believe the we should make a solution which works for solving [1][2][3] altogether.
This could be done by the following changes (primary idea is to introduce IDs for providers)

Change package array in analyzer result to store uncurated package meta-data,
store applicable curations separately and apply them on-the-fly.
Extend the package curation provider configuration by an identifier.
Note: Introducing an id makes sense for having different configuration for the same
provider, e.g. on developer machine and on CI: different clone paths of ort-config
or different credentials. So, ORT files downloaded from CI can be seamlessly used
on the local machine, e.g. download scan-result.json and re-apply curations for
specified provider IDs.
The configuration in ~/.ort/config/config.yml of a provider could then look like this:

packageCurationProviders:
- name: File
  config:
    id: 'ort-ort-config'
    path: '~/devel/ort-config'
- name: File
  config:
    id: 'my-org-ort-config'
    path: '~/devel/my-org-ort-config'

ORT file keeps association between curations and provider ID

option 1: List of { providerId, entries[] }

ort:
  ...
  corrections:
    packageCurations:
    - providerId: my-org-ort-config
      entries:
      - 
      ...
    - providerId: ort-ort-config
      entries:
      -
      - 
      ...

option 2: Use map: providerId -> curation[]

ort:
  ...
  corrections:
    packageCurations:
    - my-org-ort-config:
      - 
      ...
    - ort-ort-config:
      -
      - 
      ...

option 3: Add a providerId property to all curation entries

ort:
  ...
  corrections:
    packageCurations:
    - id: "Maven:example:project:1.00"
      providerId: "my-org-ort-config"
      concludedLicense: "NONE"
  - id:
    ....

Add a dedicated command for re-applying curations:
1. Drop all curations for given IDs, keep the others
2. Look-up providers by ID in ~/.ort/config/config.yml and create them
3. Query providers and add results to the ORT file

[1] oss-review-toolkit/orthw-shell#60
[2] #5668
[3] #5637

Note: I'm not certain whether corrections fit packageConfigurations.pathExcludes. We may
consider dropping corrections and moving the children one level up.

mnonnenmacher · 2023-01-16T10:32:38Z

Proposed ORT result model as discussed in the developer meeting:

ort:
  resolvedConfiguration:
    packageCurations:
      providers:
      - id: clearly-defined
        metadata:
          serverUrl: https://...
          revision: abc
      - id: local-file
        metadata:
          filename: curations.yml
      data:
        clearly-defined:
        - curation1
        - curation2
        local-file:
        - curation1
    packageConfigurations:
      ...
    issueResolutions:
      ...
    ruleViolationResolutions:
      ...
    vulnerabilityResolutions:
      ...

The metadata would come from a new function PackageCurationProvider.getMetadata(): Map<String, String>.

My preference for setting the id in config.yml would be:

packageCurationProviders:
- name: File
  id: 'ort-ort-config'
  config:
    path: '~/devel/ort-config'
- name: File
  id: 'my-org-ort-config'
  config:
    path: '~/devel/my-org-ort-config'

The `OrtResult` does not store the uncurated packages as part of the analyzer result, but only the curated packages along with the applied package curation data. This tightly couples the curations with the analyzer without need, because the analyzer does not need (to consume) any curations at all. Also, computing the respective uncurated package from each curated package is not always possible due to missing data [1]. So, curations currently cannot properly be (re-applied) without re-running the analyzer [2]. Furthermore, the current representation stores package curation data redundantly in case the curation applies to multiple packages. Given that, it makes sense to store the curations separately from the uncurated package. So, utilize the new toplevel `resolvedConfiguration` to store the package curations and change the analyzer result to contain uncurated instead of curated packages. Note that this partially implements [1] and [2]. Adjusting the logic which turns curated into uncurated packages, e.g. `toUncuratedPackage()`, is left for a future change to limit the size of this change. Apart from that [3] can be implemented by relatively easily without redundantly encoding the provider (for each curation data). [1] #5637 [2] #6188 [3] #5668 Signed-off-by: Frank Viernau <frank_viernau@epam.com>

Extend `ResolvedConfiguration` to associate the curations with the ID of the package curation provider to enable tracability of curations back to the provider. The separate `ResolvedConfiguration.provider` list is introduced to align with the idea of adding provider metadata, as outlined in [^1] and also mentioned in [^2]. This implementation also is a first step towards use cases involving: 1. Replacing the curations for a given provider ID with the given ones. 2. Re-resolve curations only for a particular provider ID. Both are left as TODO for future changes to limit the size of this change, while for 1. a TODO comment is left in the code. Fixes #5668. [^1] #6188 (comment) [^2]: #5668 Signed-off-by: Frank Viernau <frank_viernau@epam.com>

Extend `ResolvedConfiguration` to associate the curations with the ID of the package curation provider to enable tracability of curations back to the provider. The separate `ResolvedConfiguration.provider` list is introduced to align with the idea of adding provider metadata, as outlined in [1] and also mentioned in [2]. This implementation also is a first step towards use cases involving: 1. Replacing the curations for a given provider ID with the given ones. 2. Re-resolve curations only for a particular provider ID. Both are left as TODO for future changes to limit the size of this change, while for 1. a TODO comment is left in the code. Fixes #5668. [1]: #6188 (comment) [2]: #5668 Signed-off-by: Frank Viernau <frank_viernau@epam.com>

Extend `ResolvedConfiguration` to associate the curations with the ID of the package curation provider to enable tracability of curations back to the provider. The separate `ResolvedConfiguration.provider` list is introduced to align with the idea of adding provider metadata, as outlined in [^1] and also mentioned in [^2]. This implementation also is a first step towards use cases involving: 1. Replacing the curations for a given provider ID with the given ones. 2. Re-resolve curations only for a particular provider ID. Both are left as TODO for future changes to limit the size of this change, while for 1. a TODO comment is left in the code. Fixes #5668. [^1] #6188 (comment) [^2]: #5668 Signed-off-by: Frank Viernau <frank_viernau@epam.com>

sschuberth · 2023-05-17T12:19:21Z

@mnonnenmacher @fviernau I lost a bit track of this since we've merged the resolved configuration stuff / the new way of storing curations in the ORT result.

But going forward, how exactly do we plan to apply updated curations? IMO the partly implemented current approach of curation-override options per tool does not scale. I'd much more like to see a tool / command that can update the curations in a result file, and then that updated result file can be passed to other tools without specifying any override options.

This option can be used to rather quickly check whether packages from an analyzer result can be downloaded without actually running the scanner / downloader. As such the option can also be used to more quickly verify curations after (re-)applying them to the analyzer result. For the latter, a proper solution yet needs to be implemented, see [1]. Note that the implementation is not complete yet. E.g. not all cases where a real download would succeed can be verified, as guessing revisions while keeping downloads to a minimum is difficult to implement for a dry run. [1]: #6188 Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

mnonnenmacher · 2023-05-22T20:46:31Z

@mnonnenmacher @fviernau I lost a bit track of this since we've merged the resolved configuration stuff / the new way of storing curations in the ORT result.

But going forward, how exactly do we plan to apply updated curations? IMO the partly implemented current approach of curation-override options per tool does not scale. I'd much more like to see a tool / command that can update the curations in a result file, and then that updated result file can be passed to other tools without specifying any override options.

Currently my preferred approach would be to introduce a new ORT CLI command like resolve-configuration, but I'm open for other ideas. Such a command could provide options to re-resolve all contained resolved configurations or resolve only specific parts of the resolved configuration.

sschuberth · 2023-05-23T06:44:09Z

Currently my preferred approach would be to introduce a new ORT CLI command

Ok, good, so we're in line about having a new command.

but I'm open for other ideas.

We already have the config subcommand, and even if that currently only deals with global configuration, should we maybe also bundle resolved config stuff there to not get too many config-related subcommands?

mnonnenmacher · 2023-05-23T06:55:13Z

We already have the config subcommand, and even if that currently only deals with global configuration, should we maybe also bundle resolved config stuff there to not get too many config-related subcommands?

It can make the command difficult to use and implement if it can be used for two different things. For example, which options are relevant for which use case?

sschuberth · 2023-05-23T07:10:40Z

For example, which options are relevant for which use case?

That could be solved via prefixes to options (though I agree that might not be the nicest user experience), or we could use sub-subcommands, like the helper-cli already does.

This option can be used to rather quickly check whether packages from an analyzer result can be downloaded without actually running the scanner / downloader. As such the option can also be used to more quickly verify curations after (re-)applying them to the analyzer result. For the latter, a proper solution yet needs to be implemented, see [1]. Note that the implementation is not complete yet. E.g. not all cases where a real download would succeed can be verified, as guessing revisions while keeping downloads to a minimum is difficult to implement for a dry run. [1]: #6188 Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

sschuberth added enhancement Issues that are considered to be enhancements analyzer About the analyzer tool model About the data model labels Dec 7, 2022

sschuberth mentioned this issue Dec 7, 2022

Split package curation data according to its use #6187

Open

sschuberth changed the title ~~Allow curations to get applied without rerunning the analyzer~~ Allow curations to be applied without rerunning the analyzer Dec 12, 2022

mnonnenmacher mentioned this issue Dec 29, 2022

Think about ways to "share" configuration oss-review-toolkit/ort-workbench#51

Open

mnonnenmacher mentioned this issue Jan 11, 2023

Package curation provider plugins #6308

Merged

fviernau self-assigned this Jan 16, 2023

fviernau mentioned this issue Jan 31, 2023

OrtConfiguration: Introduce package curation provider IDs #6416

Merged

fviernau mentioned this issue Feb 8, 2023

OrtResult: Associate package curations with provider IDs #6456

Merged

mnonnenmacher mentioned this issue Sep 4, 2023

Add missing values to resolved configuration #7453

Open

sschuberth mentioned this issue Aug 8, 2024

Deduplicate source artifacts to scan based on file name and hash #8127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow curations to be applied without rerunning the analyzer #6188

Allow curations to be applied without rerunning the analyzer #6188

sschuberth commented Dec 7, 2022 •

edited

Loading

sschuberth commented Dec 12, 2022

mnonnenmacher commented Dec 19, 2022

mnonnenmacher commented Dec 19, 2022 •

edited

Loading

mnonnenmacher commented Dec 29, 2022

sschuberth commented Dec 30, 2022

mnonnenmacher commented Dec 30, 2022 •

edited

Loading

fviernau commented Jan 12, 2023 •

edited

Loading

mnonnenmacher commented Jan 13, 2023

fviernau commented Jan 13, 2023 •

edited

Loading

mnonnenmacher commented Jan 13, 2023

fviernau commented Jan 13, 2023 •

edited

Loading

fviernau commented Jan 13, 2023

fviernau commented Jan 13, 2023 •

edited

Loading

mnonnenmacher commented Jan 16, 2023

sschuberth commented May 17, 2023

mnonnenmacher commented May 22, 2023 •

edited

Loading

sschuberth commented May 23, 2023

mnonnenmacher commented May 23, 2023

sschuberth commented May 23, 2023

Allow curations to be applied without rerunning the analyzer #6188

Allow curations to be applied without rerunning the analyzer #6188

Comments

sschuberth commented Dec 7, 2022 • edited Loading

sschuberth commented Dec 12, 2022

mnonnenmacher commented Dec 19, 2022

mnonnenmacher commented Dec 19, 2022 • edited Loading

mnonnenmacher commented Dec 29, 2022

sschuberth commented Dec 30, 2022

mnonnenmacher commented Dec 30, 2022 • edited Loading

fviernau commented Jan 12, 2023 • edited Loading

mnonnenmacher commented Jan 13, 2023

fviernau commented Jan 13, 2023 • edited Loading

mnonnenmacher commented Jan 13, 2023

fviernau commented Jan 13, 2023 • edited Loading

fviernau commented Jan 13, 2023

fviernau commented Jan 13, 2023 • edited Loading

mnonnenmacher commented Jan 16, 2023

sschuberth commented May 17, 2023

mnonnenmacher commented May 22, 2023 • edited Loading

sschuberth commented May 23, 2023

mnonnenmacher commented May 23, 2023

sschuberth commented May 23, 2023

sschuberth commented Dec 7, 2022 •

edited

Loading

mnonnenmacher commented Dec 19, 2022 •

edited

Loading

mnonnenmacher commented Dec 30, 2022 •

edited

Loading

fviernau commented Jan 12, 2023 •

edited

Loading

fviernau commented Jan 13, 2023 •

edited

Loading

fviernau commented Jan 13, 2023 •

edited

Loading

fviernau commented Jan 13, 2023 •

edited

Loading

mnonnenmacher commented May 22, 2023 •

edited

Loading