Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScanResult: Add a list of all files with checksum and directories #5681

Closed
fviernau opened this issue Aug 24, 2022 · 7 comments
Closed

ScanResult: Add a list of all files with checksum and directories #5681

fviernau opened this issue Aug 24, 2022 · 7 comments
Assignees
Labels
scanner About the scanner tool

Comments

@fviernau
Copy link
Member

fviernau commented Aug 24, 2022

A complete list of all file and directory names in the repository along with the checksum can be valueable for various use cases.
For example:

  1. Compute the package verification code on-the-fly
  2. Implement basic file / dir existence checks without the need for cloning the source code
  3. Auto generate excludes based on directory names

Moreover, the serialized scan result is redundant in terms of the file path. Each file path can be listed zero to multiple times, as it is listed once per license and copyright finding. That redundancy can be removed by changing the way license and copyright findings are serialized.

Goal:

A scan result should:

  1. contain a full list of all files with checksums
  2. A full list of all (empty) dirs, so that directory existance checks can be implemented on top
  3. have the redundancy in terms of file paths eliminated
@fviernau fviernau self-assigned this Aug 24, 2022
@sschuberth
Copy link
Member

Some remarks from my side:

  • How do we know what "all" files are? Scanners might only report files with findings (this is also true for ScanCode when run with the --only-findings option), so we cannot generally just use the files from the scanner's raw result.
    • If ORT should itself determine the list of "all" files, then ORT should probably not also calculate the hashes for all these files, as e.g. ScanCode already does that for files contained in its raw result. So as a performance optimization, ORT should only calculate the hashes of the remaining files.
  • We should be prepared to calculate multiple different hashes per file. SHA1 obviously makes sense, but so does SHA1GIT (for lookup at Software Heritage).
  • In general, starting to list all scanned files in the scan result, including those without findings, will make the scan result grow quite a bit. Does it still make sense to only add files with findings to the result, and only the hashes for those?

@mnonnenmacher mnonnenmacher added the scanner About the scanner tool label Aug 24, 2022
sschuberth added a commit that referenced this issue Aug 26, 2022
This aligns with the ScanCode result parser code. As noted in ece788a,
strictly speaking passing `SpdxConstants.NONE` as the package
verification code in not supported by the SPDX specification, but still
doing so has clearer semantics than passing an empty string. Moreover,
the package verification code will eventually anyway be removed in the
context of [1].

[1]: #5681

Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
sschuberth added a commit that referenced this issue Aug 29, 2022
This aligns with the ScanCode result parser code. As noted in ece788a,
strictly speaking passing `SpdxConstants.NONE` as the package
verification code in not supported by the SPDX specification, but still
doing so has clearer semantics than passing an empty string. Moreover,
the package verification code will eventually anyway be removed in the
context of [1].

[1]: #5681

Signed-off-by: Sebastian Schuberth <sschuberth@gmail.com>
@fviernau
Copy link
Member Author

fviernau commented Oct 4, 2022

We've discussed my proposal with the core devs, here's the proposed changed to the OrtResult model,
and some notes from the meeting.

Planned changes in ORT result model

The following things I believe are somewhat on the agenda and will impact
the model, so these should be considered to avoid conflicts or going into the wrong direction:

  1. Have scan results per provenance to drop redundancy, e.g. map package IDs to provenances.
  2. Have a file list with hashes for all files for each resolved provenance
  3. Have sniplet findings

Current data structure

OrtResult {
  - scanner: ScannerRun { 
    - startTime
    - endTime
    - environment
    - config
    - results: ScanRecord {
      - storageStats
      - scanResults: Map<Id, List<ScanResult>>
    }
  }     
}    
   
ScanResult {
  - provenance
  - scanner: ScannerDetails
  - additionalData
  - summary: ScanSummary {
    - startTime:
    - endTime:
    - packageVerificationCode
    - licenseFindings []
    - copyrightFindings []
    - issues []
  }
}   

Target data structure

OrtResult {
  - scanner: ScannerRun { 
    - startTime
    - endTime
    - environment
    - config
    - provenances: Map<Id, Map<String, Provenance>> <-- resolved provenances
    - storageStats
    - scanResults: Map<Provenance (resolved), ScanResult>  <-- if no scan result, then no entry.
  }     
}

ScanResult {
  - provenance  (with resolved revision)
  - file listing of all files with hashes   
  - summaries[]: ScanSummary {
    - startTime:
    - endTime:
    - licenseFindings []     <--- references file listing
    - copyrightFindings []   <--- references file listing
    - snipletFindings []     <--- references file listing
    - issues []
    - scanner: ScannerDetails
    - additionalData
  }
}   

some conclusions around above data structure:

  • Check if there is a reason in code why we have the ScannerRecord class, e.g. tests.
  • File listing must be stored in storage in MVP.
  • idea: do not require complete file lists in the first steps.
  • per provenance: implement only after removal of classic scanner
  • storage data structures should be considered separately, and can be de-coupled. Different models can be used.

@sschuberth
Copy link
Member

FYI @mmurto, this is what would also enable "delta-scans" with ScanCode of only the modified files.

@fviernau
Copy link
Member Author

fviernau commented May 2, 2023

Implementation will be covered by this epic: #6945

@sschuberth
Copy link
Member

Implementation will be covered by this epic: #6945

As the epic was just closed, can this be closed now, too?

@sschuberth
Copy link
Member

Implementation will be covered by this epic: #6945

As the epic was just closed, can this be closed now, too?

Ping @fviernau.

@fviernau
Copy link
Member Author

Closing as done, as now all files are listed for all provenances in OrtResult, which are within the respective VCS path.
Note: empty directories are not listed (mentioned in ticket description). But that in most of the cases doesn't work anyway because Git does not have the concept of directories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scanner About the scanner tool
Projects
None yet
Development

No branches or pull requests

3 participants