Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing YAML #7340

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

Add support for parsing YAML #7340

wants to merge 27 commits into from

Conversation

NaN-git
Copy link

@NaN-git NaN-git commented Nov 23, 2022

Why builtins.fromYAML?

YAML is widely used amongst other package managers and deployment tools. If we want better compatibility to these ecosystems, the ability to parse yaml efficiently in nix is useful, as it is for TOML and JSON, which we already support. Further discussion can be found, i.a., in #4910.

Description

YAML 1.2 is a complex standard and nix has a limited set of data types. Thus only a subset of YAML can be represented in nix. For example attribute sets require String keys, i.e. attribute sets can represent YAML maps with String keys only, and nix has no data types for binary data or dates. If builtins.fromYAML encounters YAML with incompatible data types, then it fails similar to builtins.fromTOML.

First, the implementation uses rapidyaml to parse the YAML string and afterwards the nix objects are created while traversing the YAML tree similar to builtins.fromTOML. Tags are mostly ignored and affect only scalars. Custom tags are always ignored and I don't see how custom tags could be handled by nix.

Why rapidyaml?

As part of nix robustness and safety of the implementation of builtins.fromYAML are important.

Some reasons why rapidyaml was chosen:

  • C++11 library with limited usage of dynamic memory allocations
  • tested with a wide range of compilers and on different platforms
  • extensive set of unit tests
  • heavy usage of assertions within the library
  • available as single header file, i.e. easy to bootstrap, so that no build tool could link another version of rapidyaml
  • good speed in comparison to yaml-cpp and other YAML libraries

limitations of rapidyaml

rapidyaml has a few limitations. Most of these limitations are not really relevant for the nix use case because of the limitations of nix.

Some comments with respect to the limitations:

  • The macro RYML_WITH_TAB_TOKENS is defined for builtins.fromYAML.
  • Anchor names terminated with a colon are not fully supported by rapidyaml. { &anchor: key: val, anchor: *anchor: } is parsed correctly, but this test case fails. The test case is ignored in the builtins.fromYAML-tests because it is an edge case and it's no valid YAML 1.3.
  • Otherwise the only issue that I found is that a string/block containing only tab-, space- and new line-characters might be parsed incorrectly as empty string (test case). I don't think that this limits the real world usage. UPDATE: This is fixed in the newest rapidyaml release.
  • ADDED: Rapidyaml has some issues parsing flow mappings entries with missing :-token and missing value, e.g. this example cannot be parsed.

Also rapidyaml parses some invalid YAML successfully, but that is actually helpful.
Otherwise the valid JSON {"a":"b"}, which is emitted by builtins.toJSON, could not be parsed by builtins.fromYAML because it is no valid YAML due to the missing separation white space after the :-token. UPDATE: Actually this is allowed in flow mappings.

Tested platforms

  • x86_64-linux
  • aarch64-linux

@DavHau
Copy link
Member

DavHau commented Nov 24, 2022

Like mentioned in #4910 (comment), we might want to consider to encode the yaml version into the function somehow. Instead of fromYAML it could be named fromYAML1_2. In case the YAML spec ever gets updated in an incompatible way, we could make a second function, like for example fromYAML2_0 and won't have to update/break the existing builtin.

@NaN-git
Copy link
Author

NaN-git commented Nov 24, 2022

Like mentioned in #4910 (comment), we might want to consider to encode the yaml version into the function somehow. Instead of fromYAML it could be named fromYAML1_2. In case the YAML spec ever gets updated in an incompatible way, we could make a second function, like for example fromYAML2_0 and won't have to update/break the existing builtin.

If this should be implemented, then a decision about the versioning scheme is needed and how the versioning should be done, i.e. whether the builtins should have different names or whether a function with additional version argument should be used instead. Having explicitly different function names would be advantageous with respect to reproducibility.

The versioning should apply to different code versions, not only to different YAML standards, so that bug compatibility can be retained. The YAML standard is ugly and we need to parse real world data, i.e. the "YAML" might be invalid, so that some tradeoffs are required...

I'm wondering how updates of builtins.fromJSON and builtins.fromTOML are handled. builtins.fromJSON uses nlohmann-json, but if nix isn't built by nix, then the linked library version isn't fixed. Did this cause any problems yet?

src/libexpr/primops/fromYAML.cc Outdated Show resolved Hide resolved
src/libexpr/primops/fromYAML.cc Outdated Show resolved Hide resolved
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/announcing-stacklock2nix-easily-build-a-haskell-project-that-contains-a-stack-yaml-lock-file/23563/11

@fricklerhandwerk fricklerhandwerk added the feature Feature request or proposal label Dec 5, 2022
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-to-convert-yaml-nix-object/23755/2

@DavHau
Copy link
Member

DavHau commented Dec 21, 2022

Like mentioned in #4910 (comment), we might want to consider to encode the yaml version into the function somehow. Instead of fromYAML it could be named fromYAML1_2. In case the YAML spec ever gets updated in an incompatible way, we could make a second function, like for example fromYAML2_0 and won't have to update/break the existing builtin.

I discussed this once more with @NaN-git and it doesn't seem like a good idea to encode the yaml spec version in the function name. rapidyaml doesn't distinguish between yaml versions. It's hard to forsee the reason for which we would have to introduce a new fromYaml in the future, if ever.

Whatever naming schema we would come up right now, there is a high chance that it won't be meaningful. Therefore I think the builtin should just be introduced as a plain fromYaml. If it ever needs to be updated then we can still make a fromYaml_2 or whatever makes sense in that situation.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-01-02-nix-team-meeting-minutes-20/24403/1

@fricklerhandwerk
Copy link
Contributor

fricklerhandwerk commented Jan 10, 2023

Discussed in Nix team meeting on 2023-01-09:

  • general sentiment is in favor of merging this
  • don't vendor the dependency, do it like with nlohmann
  • we will wait for @edolstra's input for the final decision
Complete discussion
  • @roberth: vendoring the library is actually nice, because it doesn't change underneath
    • @thufschmitt: should follow the decision on the nlohmann JSON library
      • that was a special case though, because it was very uncommon yet easy to vendor
    • @roberth: the problem with vendoring is that it requires re-exporting the symbols such that names don't collide
      • John: in short, vendoring is only a good fit for private deps, and so so for a public dep one needs to do these things.
    • @thufschmitt: not vendoring makes packaging for third parties more painful. e.g. there is no rapidyaml on Debian
  • @roberth: in favor of the addition after being against it. In general adding more (external) code does not help with the guarantee for evaluation reproducibility 10 years down the line
  • @Ericson2314: see the practical reasons. it would be better for IFD to work efficiently so one can do the conversion independent of Nix proper, but I also don't see that happening in the near term.
    • @roberth: the main deficiency of IFD is system specificity
  • @thufschmitt: in favor of the feature. we should provide strong control of inputs to derivations. IFD breaks the separation between evaluation and build, and therefore breaks reproducibility guarantees (one can create any inputs at build time, which may be non-deterministic)
  • @fricklerhandwerk: generally agree with @Ericson2314, we should be really careful about growing the API surface. not against the feature though, because it's obviously useful and adds symmetry to the API. we should just make sure we know how to deal with such additions in the future, and have a clear stance to present to contributors.
  • agreement:
    • don't vendor the dependency, do it like with nlohmann
  • @tomberek: agree in principle, it's a natural parallel to the to/fromJSON builtins.
  • we will wait for @edolstra's input

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-01-09-nix-team-meeting-minutes-22/24577/1

@domenkozar
Copy link
Member

@edolstra any reason not to merge this?

@tomberek
Copy link
Contributor

Decisions

  • agreement on not vendoring
  • agreement on making this an experimental feature

Issues to clarify

  • Nix's compile time impact?
  • closure size?
  • static link?
  • show various features and how they map to Nix values
    • add tests for the features
  • testing framework improvements
    • add explanation of testing
    • convert bash to google-test approach?

Versioning

If the underlying library returns different results over time, this impacts reproducibility

@NaN-git
Copy link
Author

NaN-git commented Jan 31, 2023

Issues to clarify

[...]
* [ ] show various features and how they map to Nix values

  * [ ]  add tests for the features

* [ ]  testing framework improvements
  
  * [ ]  add explanation of testing
  * [ ]  convert bash to google-test approach?

Regarding the tests:

  • The tests from yaml-test-suite are converted to gtest tests by a bash script. The resulting test suite is part of the PR.
  • The tests contain a JSON representation ("json") of the YAML ("yaml") test cases. Thus the tests check whether builtins.fromJSON json == builtins.fromYAML json == builtins.fromYAML yaml holds true by serializing the nix expressions. Several hundred test cases are executed and compatibility between builtins.fromJSON and builtins.fromYAML is verified at least for these cases. If yaml should be invalid then the test case is skipped because rapidyaml will often parse this without error.
  • builtins.fromYAML converts a YAML file to the following nix types: maps, lists, strings, integers (int64) and floats (double). Conversion to custom data types or interpreting data as nix expression would be very dangerous.
    Alias nodes (references) are automatically resolved, e.g.
builtins.fromYAML ''
---
values:
  - &value someValue
--- 
# some YAML template
template: *value
''

evaluates to (I'm not very happy with the handling of multiple documents)

[
  {
    values = [ "someValue" ];
  }
  {
    template = "someValue";
  }
]

Build questions

rapidyaml uses cmake as build system. I don't think that this a good choice for nix because bootstrapping of nix has to be easy. The only dependency of rapidyaml is c4core from the same author.
Thus it should be easy to statically link the library. At the moment the library is included as single header file similar to builtins.fromTOML.
What is the best way to include this library into nix?

Even with rapidyaml as single header file the build time of builtins.fromYAML seems to be faster than the build time of builtins.fromTOML and the size of the object file is much smaller than the size of the builtins.fromTOML object file.

@Ericson2314
Copy link
Member

What is the best way to include this library into nix?

I would package it and its dependency in Nixpkgs. I am happy to help with that part if you want.


I personally am not to concerned about the bootstrapping because we could always make YAML a (compile time of Nix) optional feature.

@NaN-git
Copy link
Author

NaN-git commented Feb 1, 2023

I would package it and its dependency in Nixpkgs. I am happy to help with that part if you want.

I personally am not to concerned about the bootstrapping because we could always make YAML a (compile time of Nix) optional feature.

Packaging rapidyaml is rather easy:

{ cmake
, fetchFromGitHub
, git
, stdenv
, enableStatic ? true
}:
stdenv.mkDerivation rec {
  pname = "rapidyaml";
  version = "0.5.0";

  src = fetchFromGitHub {
    owner = "biojppm";
    repo = pname;
    rev = "v${version}";
    fetchSubmodules = true;
    hash = "sha256-1/P6Szgng94UU8cPFAtOKMS+EmiwfW/IJl2UTolDU5s=";
  };

  nativeBuildInputs = [ cmake git ];
  cmakeFlags = [
    "-DRYML_WITH_TAB_TOKENS=ON"
    "-DBUILD_SHARED_LIBS=${if enableStatic then "OFF" else "ON"}"
  ];
}

I prefer to statically link this because the static library is rather small (~500 KB).

In my opinion not only bootstrapping of nix has to be easy, but compiling it for other distributions should be easy, too. I don't know how to add rapidyaml as optional dependency to nix, so that builds without nix aren't messed up.

@NaN-git
Copy link
Author

NaN-git commented Feb 11, 2023

I removed rapidyaml as single header file and made it an optional library instead. If the library cannot be found then builtins.fromYAML is not available. Rapidyaml is not part of nixpkgs yet, so that I added the package to flake.nix and it will be linked statically.
With builtins.fromYAML the size of libnixexpr increases by 800KB and in my tests on an AMD 5800U the build time of nix increased by 4-5 seconds to ~176 seconds when compiling with 16 threads.

Regarding the testing framework: I need more information what's actually required/wanted. I commented some point in #7340 (comment). Of course it would be possible to preprocess the tests differently so that less logic is needed in the testing code and it would be more obvious which tests are executed actually.

@fricklerhandwerk
Copy link
Contributor

fricklerhandwerk commented Feb 13, 2023

If the library cannot be found then builtins.fromYAML is not available.

@edolstra what do you think about an optional built-in? Doesn't sound right to me.

@roberth
Copy link
Member

roberth commented Feb 13, 2023

If the library cannot be found then builtins.fromYAML is not available.

@edolstra what do you think about an optional built-in? Doesn't sound right to me.

Supporting a build without yaml library might be useful for some build-from-source bootstrapping process, but in this case it must be an explicit choice to remove the yaml feature and create an incomplete Nix. It must not happen by accident.

@roberth
Copy link
Member

roberth commented Feb 13, 2023

Whether that's even a worthwhile effort, I don't know. Personally I think bootstrapping Nix from source is not an important use case, but packaging by other distros is. Those distros must not package a Nix without yaml though! I'd rather require rapidyaml unconditionally.

@NaN-git
Copy link
Author

NaN-git commented Feb 13, 2023

Whether that's even a worthwhile effort, I don't know. Personally I think bootstrapping Nix from source is not an important use case, but packaging by other distros is. Those distros must not package a Nix without yaml though! I'd rather require rapidyaml unconditionally.

Ok, I see two possible solutions:

  1. Always link against libryml, but then it won't be easy to compile it without rapidyaml. This would enable the removal of some lines of code.
  2. Require rapidyaml unless --disable-ryml is specified, i.e. a few lines of configure.ac have to be changed.

Philipp Otterbein and others added 24 commits September 24, 2024 02:36
- failed assertion throws exception
- parse values correctly
- handle empty YAML
This reverts commit 82e4242.
also check consistency with fromJSON
Co-authored-by: Eelco Dolstra <edolstra@gmail.com>
update rapidyaml version

cleanup/fix parsing of yaml

make fromYAML experimental
cleanup test logic

don't ignore whole classes of tests
- add additional argument to fromYAML for optional parameters of the parser

- adhere to the YAML 1.2 core schema

- much stronger error checks and improved error messages

- proper conversion of null, floats, integers and booleans

- additional testcases and more checks for expected failures
- restrict patterns of floats and ints to patterns defined by YAML 1.2 core schema

- parse integers with tag !!float

- map: enforce key uniqueness
fix: parse "!!float -0" as -0.0
@NaN-git
Copy link
Author

NaN-git commented Sep 28, 2024

@roberth @Atry I think that the PR is now ready for merging. The experimental feature still lacks a tracking URL.

I made some larger changes:

  • builtins.fromYAML got a second argument. This attribute set can be used to control the behavior of the parser. For example many applications, e.g. Kubernetes, are still supporting YAML 1.1 features not supported by YAML 1.2 and for compatibility I added an option to enable YAML 1.1 booleans.
    Futhermore tag resolution can be application specific and in the future it might be possible to extend the tag resolution with nix code.
    Backward incompatible changes could be enabled by this argument, too.
  • Tag resolution is done according to the YAML 1.2 core schema (map, seq, null, bool, int, float and str). Any tag not defined by the YAML 1.2 core schema will result in an error. Any plain scalar, which neither matches the null, bool, int nor float regex, is parsed as string.
  • Strictness is improved and more error conditions are checked. Also the error messages are improved.
  • The tests are improved. Tests of the yaml-test-suite, which shall generate an error, are no longer ignored. Sadly, some invalid YAML strings are still parsed successfully by rapidyaml (these tests are skipped), but the situation has improved and it seems that only a handful of cases need to be fixed. The author of rapidyaml seems to be aware about this and there might be fixes in the next few months.
    I added tests for codepaths added in fromYAML.cc.

Binary size and performance

  • Static linking of rapidyaml is no problem.
  • Additional compile time due to the from-yaml feature: approx. +5s out of 6 minutes for autoreconf, configure and make measured on a Ryzen 7 Pro 5850U. Compiling the tests takes longer than compiling the builtin function.
  • Binary size overhead (with nix build): +110 KiB for the nix binary and less than +650 KiB for libnixexpr and nix.
  • Performance: similar to builtins.fromJSON for JSON inputs and sometimes even faster, but of course with larger memory footprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature request or proposal language The Nix expression language; parser, interpreter, primops, evaluation, etc with-tests Issues related to testing. PRs with tests have some priority
Projects
Status: To triage
Development

Successfully merging this pull request may close these issues.