ScanResult: Add a list of all files with checksum and directories #5681

fviernau · 2022-08-24T11:25:37Z

A complete list of all file and directory names in the repository along with the checksum can be valueable for various use cases.
For example:

Compute the package verification code on-the-fly
Implement basic file / dir existence checks without the need for cloning the source code
Auto generate excludes based on directory names

Moreover, the serialized scan result is redundant in terms of the file path. Each file path can be listed zero to multiple times, as it is listed once per license and copyright finding. That redundancy can be removed by changing the way license and copyright findings are serialized.

Goal:

A scan result should:

contain a full list of all files with checksums
A full list of all (empty) dirs, so that directory existance checks can be implemented on top
have the redundancy in terms of file paths eliminated

sschuberth · 2022-08-24T12:11:31Z

Some remarks from my side:

How do we know what "all" files are? Scanners might only report files with findings (this is also true for ScanCode when run with the --only-findings option), so we cannot generally just use the files from the scanner's raw result.
- If ORT should itself determine the list of "all" files, then ORT should probably not also calculate the hashes for all these files, as e.g. ScanCode already does that for files contained in its raw result. So as a performance optimization, ORT should only calculate the hashes of the remaining files.
We should be prepared to calculate multiple different hashes per file. SHA1 obviously makes sense, but so does SHA1GIT (for lookup at Software Heritage).
In general, starting to list all scanned files in the scan result, including those without findings, will make the scan result grow quite a bit. Does it still make sense to only add files with findings to the result, and only the hashes for those?

This aligns with the ScanCode result parser code. As noted in ece788a, strictly speaking passing `SpdxConstants.NONE` as the package verification code in not supported by the SPDX specification, but still doing so has clearer semantics than passing an empty string. Moreover, the package verification code will eventually anyway be removed in the context of [1]. [1]: #5681 Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>

fviernau · 2022-10-04T09:14:39Z

We've discussed my proposal with the core devs, here's the proposed changed to the OrtResult model,
and some notes from the meeting.

Planned changes in ORT result model

The following things I believe are somewhat on the agenda and will impact
the model, so these should be considered to avoid conflicts or going into the wrong direction:

Have scan results per provenance to drop redundancy, e.g. map package IDs to provenances.
Have a file list with hashes for all files for each resolved provenance
Have sniplet findings

Current data structure

OrtResult {
  - scanner: ScannerRun { 
    - startTime
    - endTime
    - environment
    - config
    - results: ScanRecord {
      - storageStats
      - scanResults: Map<Id, List<ScanResult>>
    }
  }     
}    
   
ScanResult {
  - provenance
  - scanner: ScannerDetails
  - additionalData
  - summary: ScanSummary {
    - startTime:
    - endTime:
    - packageVerificationCode
    - licenseFindings []
    - copyrightFindings []
    - issues []
  }
}

Target data structure

OrtResult {
  - scanner: ScannerRun { 
    - startTime
    - endTime
    - environment
    - config
    - provenances: Map<Id, Map<String, Provenance>> <-- resolved provenances
    - storageStats
    - scanResults: Map<Provenance (resolved), ScanResult>  <-- if no scan result, then no entry.
  }     
}

ScanResult {
  - provenance  (with resolved revision)
  - file listing of all files with hashes   
  - summaries[]: ScanSummary {
    - startTime:
    - endTime:
    - licenseFindings []     <--- references file listing
    - copyrightFindings []   <--- references file listing
    - snipletFindings []     <--- references file listing
    - issues []
    - scanner: ScannerDetails
    - additionalData
  }
}

some conclusions around above data structure:

Check if there is a reason in code why we have the ScannerRecord class, e.g. tests.
File listing must be stored in storage in MVP.
idea: do not require complete file lists in the first steps.
per provenance: implement only after removal of classic scanner
storage data structures should be considered separately, and can be de-coupled. Different models can be used.

sschuberth · 2023-03-15T16:43:55Z

FYI @mmurto, this is what would also enable "delta-scans" with ScanCode of only the modified files.

fviernau · 2023-05-02T12:06:47Z

Implementation will be covered by this epic: #6945

sschuberth · 2023-07-14T15:27:32Z

Implementation will be covered by this epic: #6945

As the epic was just closed, can this be closed now, too?

sschuberth · 2023-07-19T07:30:49Z

Implementation will be covered by this epic: #6945

As the epic was just closed, can this be closed now, too?

Ping @fviernau.

fviernau · 2023-07-19T08:33:03Z

Closing as done, as now all files are listed for all provenances in OrtResult, which are within the respective VCS path.
Note: empty directories are not listed (mentioned in ticket description). But that in most of the cases doesn't work anyway because Git does not have the concept of directories.

fviernau self-assigned this Aug 24, 2022

mnonnenmacher added the scanner About the scanner tool label Aug 24, 2022

mnonnenmacher mentioned this issue Sep 29, 2022

Replace legacy scanner with experimental scanner #5799

Merged

fviernau mentioned this issue Oct 14, 2022

OrtResult: Store scan results "by provenance" #5950

Closed

fviernau mentioned this issue Apr 25, 2023

Introduce and use OrtResult.getScanResults() #6921

Merged

jens-erdmann mentioned this issue May 2, 2023

OrtResult: Make the file listing complete #6946

Closed

fviernau closed this as completed Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScanResult: Add a list of all files with checksum and directories #5681

ScanResult: Add a list of all files with checksum and directories #5681

fviernau commented Aug 24, 2022 •

edited

Loading

sschuberth commented Aug 24, 2022

fviernau commented Oct 4, 2022

sschuberth commented Mar 15, 2023

fviernau commented May 2, 2023

sschuberth commented Jul 14, 2023

sschuberth commented Jul 19, 2023

fviernau commented Jul 19, 2023

ScanResult: Add a list of all files with checksum and directories #5681

ScanResult: Add a list of all files with checksum and directories #5681

Comments

fviernau commented Aug 24, 2022 • edited Loading

Goal:

sschuberth commented Aug 24, 2022

fviernau commented Oct 4, 2022

Planned changes in ORT result model

Current data structure

Target data structure

sschuberth commented Mar 15, 2023

fviernau commented May 2, 2023

sschuberth commented Jul 14, 2023

sschuberth commented Jul 19, 2023

fviernau commented Jul 19, 2023

fviernau commented Aug 24, 2022 •

edited

Loading