Version JSON output data format #2653

pombredanne · 2021-08-20T13:46:18Z

As part of #2601 this is a first essential step before we start modifying more things to improve package reporting
This is also needed to support:

Copyright and URLs data structure: Data structure inconsistency in results #2350
Improve copryight data stucture consistency #2350 #2381
Package license data structure: Improve tracing of license detection in package manifests #2389
License data structure: In ouput include SPDX expressions, report matched rule by expression and do not report individual licenses #2278

pombredanne · 2021-08-20T14:05:04Z

Here is a first take on the policy there:

the version string is using this format scancode-toolkit-data-format-1.1 where the last two segments represent a semver-like version.
the first segment is the major version of the data format; it is incremented when there are attributes that are removed, renamed, changed or moved.
the second segment is the minor version of the data format; it is incremented when there are attributes that are only added.
we store the version string in our JSON output and display that also in the help, using a data_format_version attribute
this data format versioning is strictly for the JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability
for now, the format version is incremented by hand and only only increment per ScanCode tagged release is needed
We will document in the CHANGELOG the format changes in new format versions
In a given released code version, ScanCode TK may support only data format versions: the default, current version, and the next experimental version. We will update the CLI and functions to accept a new flag to select the next, experimental data format version (may be --next-data-format or --experimental-data-format or another flag name TBD)
By using --from-json xxx --json yyy we should be able to convert data from the current, default data format to the next, experimental format.
For any version we should provide a doc on the format Create "data dictionary" for all SCTK fields #2008

pombredanne · 2021-08-20T14:06:45Z

@JonoYang @tdruez @AyanSinhaMahapatra @sschuberth @tsteenbe ping. Feedback welcomed. This is rather non-controversial.

sschuberth · 2021-08-20T14:35:06Z

it is incremented when there are attributes that are removed, renamed, changed or moved.

You probably should clarify that "moved" does not refer to changing the order at the same level, as that's not something what would break deserialization.

we store the version string in our JSON output and display that also in the help, using a data_format_version attribute

Maybe you should also make clear that the data_format_version attribute itself must never move, or that it always appears as the first attribute in the file, or something like that.

per ScanCode tagged release is needed

"if needed"

pombredanne · 2021-08-23T08:22:19Z

@mjherzog @DennisClark your comments are welcomed too.

mjherzog · 2021-08-23T15:38:17Z

This makes sense to me. It is a "nice" idea to have one versioning convention for a whole system, but we are all learning about the important differences in licensing between software and data. So this sounds like using the right tool for the job.

JonoYang · 2021-08-23T16:05:48Z

@pombredanne

Will the Codebase-Resource model schema from commoncode be versioned in the same way as the JSON output or is the output format independent of the Codebase-Resource model schema?

pombredanne · 2021-08-23T16:53:46Z

Will the Codebase-Resource model schema from commoncode be versioned in the same way as the JSON output or is the output format independent of the Codebase-Resource model schema?

@JonoYang that's a good point as these are tightly coupled . :|
This needs a bit of extra thinking.

pombredanne · 2021-08-23T16:54:56Z

@indirabhatt @maxhbr @soimkim ping too, FYI :)

Add output data format version numbers to the headers and version help text. Introduce new command line option to switch to the new experimental data format. Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

Adds a new attribute `output_format_version` to the scancode header to support output data format versioning. See aboutcode-org/scancode-toolkit#2653 for more details. Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

Add output data format version numbers to the headers and version help text. Introduce new command line option to switch to the new experimental data format. Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

pombredanne · 2021-08-30T20:16:07Z

Here is the updated documentation based on the feedback above (@sschuberth ❤️ ):

Output format version policy

We version the JSON output from ScanCode-Toolkit using this approach:

The version string is using this format scancode-toolkit-output-format-1.1 where the last segments after the dash represent a semver-like version of 1.1.
The first segment is the major version of the output format; it is incremented when attributes that are removed, renamed, changed or moved (but not reorder) in the JSON output. Reordering the attributes of a JSON object is not considered as a change and does not trigger a version change.
The second segment is the minor version of the output format; it is incremented when the changes are only for addition of attributes to the JSON output
We store the version string in the JSON output object as the first attribute and display that also in the help, using the new output_format_version attribute.
This output format versioning applies only to the JSON, pretty-printed JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability (or there some other rationale and convention for versioning like for SPDX)
For now, the output format version is incremented by hand and only incremented with a new ScanCode code tagged release if needed by output format changes.
We document in the CHANGELOG the output format changes in any new format version.
In a given released code version, ScanCode supports two output format versions: the default, current version, and the future version. The command line and core API functions will accept a new flag to select the future output format version (using --future-format option name).
When using --from-json xxx.json --json yyy.json --future-format we will able to convert data from a current, default JSON output format to the next, future JSON output format .
For any format version we will provide a documentation on the format and its updates using JSON examples and a comprehensive and updated data dictionary. See Create "data dictionary" for all SCTK fields #2008 for details

pombredanne · 2021-09-06T14:55:45Z

After extensive review, supporting multiple versions of the output data format at once is an immense task! much simpler on paper than in practice... therefore I think will instead only track which version of data format is in a given SCTK version and we can commit to limit the number of major data format version changes to possibly no more than once a quarter.

The data format version and the documentation should be enough for users IMHO. The effort to have the current and future version would be similar to maintain two branches in the same codebase and make continuous forward port and back ports to each branch. This is too much work for too little benefits.

See #2653 (comment) Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

Introduce output data format versioning #2653

AyanSinhaMahapatra · 2021-09-22T10:48:53Z

@pombredanne should this be closed as this is merged?

pombredanne · 2021-09-22T10:49:56Z

@AyanSinhaMahapatra not yet... we still need to re/write the documentation AND we need to add it to the docs

pombredanne · 2021-09-22T13:31:51Z

Here is an updated overall version policy. Because of the semver switch, and as discussed in the weekly community call the next version will be 30.0.0. The initial data format version is at 1.0.0

Versioning approach

ScanCode is composed of code and data (mostly license data used for license detection).

Historically we tried using calver to also convey that the data embedded in ScanCode was updated but it proved to be not as effective as thought so we are switching back to semver which is more useful for users.

We are therefore now using this new versioning approach:

Code and data releases are versioned using semver as documented at https://semver.org/.
Significant changes in the license of copyright detection data is considered a major version change even if there are no code changes. The rationale is that in our case the data has the same impact as the code. Using outdated data is like using old code and means that several licenses may not be detected correctly.
We will signal separately with warnings messages when ScanCode needs to be upgraded because its data and/or code are out of date.

In addition to the main code version, we also maintain a secondary output data format version using also semver with two segments. The versioning approach is adapted for data this way:

The first segment --the major version-- is incremented when data attributes that are removed, renamed, changed or moved (but not reordered) in the JSON output. Reordering the attributes of a JSON object is not considered as a change and does not trigger a version change.
The second segment --the minor version-- of the output format is incremented for an addition of attributes to the JSON output
We store the output format version string in the JSON output object as the first attribute and display that also in the help.
This output format versioning applies only to the JSON, pretty-printed JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability (or there may be some other rationale and convention for versioning like for SPDX)
The output format version is incremented by when a new ScanCode tagged release is published
We document in the CHANGELOG the output format changes in any new format version.
For any format version changes, we will provide a documentation on the format and its updates using JSON examples and a comprehensive and updated data dictionary. See Create "data dictionary" for all SCTK fields #2008 for details

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2021-09-23T08:44:15Z

As part of this I am also adding a new outdated version notification: this will be listed in the scan headers when the version is either out of date on PyPI or 90 days old. Before that we were only displaying a warning in the CLI on stderr doing a remote PyPI version check. Note that the remote PyPI check is now optional thanks to @yns88 patch

pombredanne · 2021-09-23T08:49:42Z

This is the CHANGELOG entry:

The scan results and the CLI now display an outdated version warning when
the installed ScanCode version is older than 90 days. This is to warn users
that they are relying on outdated, likely buggy, insecure and inaccurate scan
results and encourage them to update to a newer version. This is made entirely
locally based on date comparisons.

This shows up 90 days after a release date. We now also track the release date Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

AyanSinhaMahapatra · 2021-10-13T16:30:24Z

https://github.com/nexB/scancode-toolkit-reference-scans has been added to store scancode-toolkit reference scans and documentation on output version changes with diffs. Repository: https://scancode-toolkit-reference-scans.readthedocs.io/en/latest/

This needs to be added to scancode-toolkit documentation.

pombredanne · 2022-08-11T12:03:58Z

This is now versioned and merged. Closing

pombredanne added the GUI and outputs label Aug 20, 2021

AyanSinhaMahapatra mentioned this issue Aug 30, 2021

Add output data format version to header aboutcode-org/commoncode#28

Merged

AyanSinhaMahapatra added a commit that referenced this issue Sep 6, 2021

Remove --future-format flag

e5da48f

See #2653 (comment) Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

AyanSinhaMahapatra added a commit that referenced this issue Sep 7, 2021

Remove --future-format flag

457f782

See #2653 (comment) Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

pombredanne added a commit that referenced this issue Sep 14, 2021

Merge pull request #2682 from nexB/2653-output-data-format-version

d8d272a

Introduce output data format versioning #2653

pombredanne added a commit that referenced this issue Sep 22, 2021

Add versioning approach documentation #2653

140f232

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit that referenced this issue Sep 22, 2021

Add versioning approach documentation #2653

bebd79e

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit that referenced this issue Sep 23, 2021

Display outdated version WARNING in scan results #2653

da1c361

This shows up 90 days after a release date. We now also track the release date Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

AyanSinhaMahapatra self-assigned this Oct 13, 2021

sschuberth mentioned this issue Jan 5, 2022

Upgrade ScanCode to 30.1.0 oss-review-toolkit/ort#4916

Merged

pombredanne added this to the v31.0 milestone Aug 5, 2022

pombredanne closed this as completed Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version JSON output data format #2653

Version JSON output data format #2653

pombredanne commented Aug 20, 2021 •

edited

Loading

pombredanne commented Aug 20, 2021 •

edited

Loading

pombredanne commented Aug 20, 2021

sschuberth commented Aug 20, 2021

pombredanne commented Aug 23, 2021

mjherzog commented Aug 23, 2021

JonoYang commented Aug 23, 2021 •

edited

Loading

pombredanne commented Aug 23, 2021

pombredanne commented Aug 23, 2021

pombredanne commented Aug 30, 2021

pombredanne commented Sep 6, 2021

AyanSinhaMahapatra commented Sep 22, 2021

pombredanne commented Sep 22, 2021

pombredanne commented Sep 22, 2021 •

edited by AyanSinhaMahapatra

Loading

pombredanne commented Sep 23, 2021

pombredanne commented Sep 23, 2021 •

edited

Loading

AyanSinhaMahapatra commented Oct 13, 2021 •

edited

Loading

pombredanne commented Aug 11, 2022

Version JSON output data format #2653

Version JSON output data format #2653

Comments

pombredanne commented Aug 20, 2021 • edited Loading

pombredanne commented Aug 20, 2021 • edited Loading

pombredanne commented Aug 20, 2021

sschuberth commented Aug 20, 2021

pombredanne commented Aug 23, 2021

mjherzog commented Aug 23, 2021

JonoYang commented Aug 23, 2021 • edited Loading

pombredanne commented Aug 23, 2021

pombredanne commented Aug 23, 2021

pombredanne commented Aug 30, 2021

Output format version policy

pombredanne commented Sep 6, 2021

AyanSinhaMahapatra commented Sep 22, 2021

pombredanne commented Sep 22, 2021

pombredanne commented Sep 22, 2021 • edited by AyanSinhaMahapatra Loading

Versioning approach

pombredanne commented Sep 23, 2021

pombredanne commented Sep 23, 2021 • edited Loading

AyanSinhaMahapatra commented Oct 13, 2021 • edited Loading

pombredanne commented Aug 11, 2022

pombredanne commented Aug 20, 2021 •

edited

Loading

pombredanne commented Aug 20, 2021 •

edited

Loading

JonoYang commented Aug 23, 2021 •

edited

Loading

pombredanne commented Sep 22, 2021 •

edited by AyanSinhaMahapatra

Loading

pombredanne commented Sep 23, 2021 •

edited

Loading

AyanSinhaMahapatra commented Oct 13, 2021 •

edited

Loading