Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version JSON output data format #2653

Closed
pombredanne opened this issue Aug 20, 2021 · 17 comments
Closed

Version JSON output data format #2653

pombredanne opened this issue Aug 20, 2021 · 17 comments
Assignees
Milestone

Comments

@pombredanne
Copy link
Contributor

pombredanne commented Aug 20, 2021

@pombredanne
Copy link
Contributor Author

pombredanne commented Aug 20, 2021

Here is a first take on the policy there:

  1. the version string is using this format scancode-toolkit-data-format-1.1 where the last two segments represent a semver-like version.
  2. the first segment is the major version of the data format; it is incremented when there are attributes that are removed, renamed, changed or moved.
  3. the second segment is the minor version of the data format; it is incremented when there are attributes that are only added.
  4. we store the version string in our JSON output and display that also in the help, using a data_format_version attribute
  5. this data format versioning is strictly for the JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability
  6. for now, the format version is incremented by hand and only only increment per ScanCode tagged release is needed
  7. We will document in the CHANGELOG the format changes in new format versions
  8. In a given released code version, ScanCode TK may support only data format versions: the default, current version, and the next experimental version. We will update the CLI and functions to accept a new flag to select the next, experimental data format version (may be --next-data-format or --experimental-data-format or another flag name TBD)
  9. By using --from-json xxx --json yyy we should be able to convert data from the current, default data format to the next, experimental format.
  10. For any version we should provide a doc on the format Create "data dictionary" for all SCTK fields #2008

@pombredanne
Copy link
Contributor Author

@JonoYang @tdruez @AyanSinhaMahapatra @sschuberth @tsteenbe ping. Feedback welcomed. This is rather non-controversial.

@sschuberth
Copy link
Collaborator

it is incremented when there are attributes that are removed, renamed, changed or moved.

You probably should clarify that "moved" does not refer to changing the order at the same level, as that's not something what would break deserialization.

  1. we store the version string in our JSON output and display that also in the help, using a data_format_version attribute

Maybe you should also make clear that the data_format_version attribute itself must never move, or that it always appears as the first attribute in the file, or something like that.

per ScanCode tagged release is needed

"if needed"

@pombredanne
Copy link
Contributor Author

@mjherzog @DennisClark your comments are welcomed too.

@mjherzog
Copy link
Member

This makes sense to me. It is a "nice" idea to have one versioning convention for a whole system, but we are all learning about the important differences in licensing between software and data. So this sounds like using the right tool for the job.

@JonoYang
Copy link
Contributor

JonoYang commented Aug 23, 2021

@pombredanne

Will the Codebase-Resource model schema from commoncode be versioned in the same way as the JSON output or is the output format independent of the Codebase-Resource model schema?

@pombredanne
Copy link
Contributor Author

Will the Codebase-Resource model schema from commoncode be versioned in the same way as the JSON output or is the output format independent of the Codebase-Resource model schema?

@JonoYang that's a good point as these are tightly coupled . :|
This needs a bit of extra thinking.

@pombredanne
Copy link
Contributor Author

@indirabhatt @maxhbr @soimkim ping too, FYI :)

AyanSinhaMahapatra added a commit that referenced this issue Aug 30, 2021
Add output data format version numbers to the headers and version help
text. Introduce new command line option to switch to the new experimental
data format.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit to AyanSinhaMahapatra/commoncode that referenced this issue Aug 30, 2021
Adds a new attribute `output_format_version` to the scancode header
to support output data format versioning.
See aboutcode-org/scancode-toolkit#2653 for more details.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit to AyanSinhaMahapatra/commoncode that referenced this issue Aug 30, 2021
Adds a new attribute `output_format_version` to the scancode header
to support output data format versioning.
See aboutcode-org/scancode-toolkit#2653 for more details.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit that referenced this issue Aug 30, 2021
Add output data format version numbers to the headers and version help
text. Introduce new command line option to switch to the new experimental
data format.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@pombredanne
Copy link
Contributor Author

Here is the updated documentation based on the feedback above (@sschuberth ❤️ ):

Output format version policy

We version the JSON output from ScanCode-Toolkit using this approach:

  1. The version string is using this format scancode-toolkit-output-format-1.1 where the last segments after the dash represent a semver-like version of 1.1.

  2. The first segment is the major version of the output format; it is incremented when attributes that are removed, renamed, changed or moved (but not reorder) in the JSON output. Reordering the attributes of a JSON object is not considered as a change and does not trigger a version change.

  3. The second segment is the minor version of the output format; it is incremented when the changes are only for addition of attributes to the JSON output

  4. We store the version string in the JSON output object as the first attribute and display that also in the help, using the new output_format_version attribute.

  5. This output format versioning applies only to the JSON, pretty-printed JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability (or there some other rationale and convention for versioning like for SPDX)

  6. For now, the output format version is incremented by hand and only incremented with a new ScanCode code tagged release if needed by output format changes.

  7. We document in the CHANGELOG the output format changes in any new format version.

  8. In a given released code version, ScanCode supports two output format versions: the default, current version, and the future version. The command line and core API functions will accept a new flag to select the future output format version (using --future-format option name).

  9. When using --from-json xxx.json --json yyy.json --future-format we will able to convert data from a current, default JSON output format to the next, future JSON output format .

  10. For any format version we will provide a documentation on the format and its updates using JSON examples and a comprehensive and updated data dictionary. See Create "data dictionary" for all SCTK fields #2008 for details

@pombredanne
Copy link
Contributor Author

After extensive review, supporting multiple versions of the output data format at once is an immense task! much simpler on paper than in practice... therefore I think will instead only track which version of data format is in a given SCTK version and we can commit to limit the number of major data format version changes to possibly no more than once a quarter.

The data format version and the documentation should be enough for users IMHO. The effort to have the current and future version would be similar to maintain two branches in the same codebase and make continuous forward port and back ports to each branch. This is too much work for too little benefits.

AyanSinhaMahapatra added a commit that referenced this issue Sep 6, 2021
See #2653 (comment)

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit that referenced this issue Sep 7, 2021
See #2653 (comment)

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
pombredanne added a commit that referenced this issue Sep 14, 2021
@AyanSinhaMahapatra
Copy link
Contributor

@pombredanne should this be closed as this is merged?

@pombredanne
Copy link
Contributor Author

@AyanSinhaMahapatra not yet... we still need to re/write the documentation AND we need to add it to the docs

@pombredanne
Copy link
Contributor Author

pombredanne commented Sep 22, 2021

Here is an updated overall version policy. Because of the semver switch, and as discussed in the weekly community call the next version will be 30.0.0. The initial data format version is at 1.0.0

Versioning approach

ScanCode is composed of code and data (mostly license data used for license detection).

Historically we tried using calver to also convey that the data embedded in ScanCode was updated but it proved to be not as effective as thought so we are switching back to semver which is more useful for users.

We are therefore now using this new versioning approach:

  • Code and data releases are versioned using semver as documented at https://semver.org/.

  • Significant changes in the license of copyright detection data is considered a major version change even if there are no code changes. The rationale is that in our case the data has the same impact as the code. Using outdated data is like using old code and means that several licenses may not be detected correctly.

  • We will signal separately with warnings messages when ScanCode needs to be upgraded because its data and/or code are out of date.

In addition to the main code version, we also maintain a secondary output data format version using also semver with two segments. The versioning approach is adapted for data this way:

  • The first segment --the major version-- is incremented when data attributes that are removed, renamed, changed or moved (but not reordered) in the JSON output. Reordering the attributes of a JSON object is not considered as a change and does not trigger a version change.

  • The second segment --the minor version-- of the output format is incremented for an addition of attributes to the JSON output

  • We store the output format version string in the JSON output object as the first attribute and display that also in the help.

  • This output format versioning applies only to the JSON, pretty-printed JSON, YAML and JSON lines formats. It does not apply to CSV and any other formats. For these other formats there is no versioning and guaranteed format stability (or there may be some other rationale and convention for versioning like for SPDX)

  • The output format version is incremented by when a new ScanCode tagged release is published

  • We document in the CHANGELOG the output format changes in any new format version.

  • For any format version changes, we will provide a documentation on the format and its updates using JSON examples and a comprehensive and updated data dictionary. See Create "data dictionary" for all SCTK fields #2008 for details

pombredanne added a commit that referenced this issue Sep 22, 2021
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Sep 22, 2021
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Contributor Author

As part of this I am also adding a new outdated version notification: this will be listed in the scan headers when the version is either out of date on PyPI or 90 days old. Before that we were only displaying a warning in the CLI on stderr doing a remote PyPI version check. Note that the remote PyPI check is now optional thanks to @yns88 patch

@pombredanne
Copy link
Contributor Author

pombredanne commented Sep 23, 2021

This is the CHANGELOG entry:

The scan results and the CLI now display an outdated version warning when
the installed ScanCode version is older than 90 days. This is to warn users
that they are relying on outdated, likely buggy, insecure and inaccurate scan
results and encourage them to update to a newer version. This is made entirely
locally based on date comparisons.

pombredanne added a commit that referenced this issue Sep 23, 2021
This shows up 90 days after a release date. We now also track the
release date

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@AyanSinhaMahapatra
Copy link
Contributor

AyanSinhaMahapatra commented Oct 13, 2021

https://github.com/nexB/scancode-toolkit-reference-scans has been added to store scancode-toolkit reference scans and documentation on output version changes with diffs. Repository: https://scancode-toolkit-reference-scans.readthedocs.io/en/latest/

This needs to be added to scancode-toolkit documentation.

@pombredanne
Copy link
Contributor Author

This is now versioned and merged. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants