Stricter formula syntax #986

merkys · 2021-10-25T12:50:29Z

Making formula syntax closer follow the specification by forbidding integer proportions of elements equal to 1 (can they be equal to 0?). Forbidding empty string formula as well, as they do not make much sense.

ml-evs

Thanks for this @merkys! The test failures are just that the proportion "1" was previously being caught by the Python validators and not the regexp, so we will just need to update the expected error message in the tests. This should allow us to remove some other code too, which I can commit on top of your PR (if that is okay with you!)

The regexp also needs to be updated in the OpenAPI spec which is the cause of the other failure in the openapi and pre-commit checks.

ml-evs · 2021-10-25T13:02:01Z

optimade/models/utils.py

@@ -202,7 +202,7 @@ def anonymous_element_generator():
 ANONYMOUS_ELEMENTS = tuple(itertools.islice(anonymous_element_generator(), 150))
 """ Returns the first 150 values of the anonymous element generator. """

-CHEMICAL_FORMULA_REGEXP = r"^([A-Z][a-z]?\d*)*$"
+CHEMICAL_FORMULA_REGEXP = r"^([A-Z][a-z]?([02-9]|[1-9]\d+)?)+$"


This also exclude "0"-proportioned elements right? e.g.

Suggested change

CHEMICAL_FORMULA_REGEXP = r"^([A-Z][a-z]?([02-9]|[1-9]\d+)?)+$"

CHEMICAL_FORMULA_REGEXP = r"^([A-Z][a-z]?([2-9]|[1-9]\d+)?)+$"

Good idea. I was not sure about 0-proportioned elements as per specification, but I do not think they make any sense.

If 0 is not allowed, we should mention this in the OPTIMADE standard.
Now it says that the API implementation can choose how to do the rounding. A compound like LaO_0.7F_0.3FeAs could thus be written as LaOF₀FeAs

I am not quite sure that having proportion of 0 is a reasonable approximation. In your example I would multiply all the proportions by 10 and thus preserve the original proportion. Rounding 0.3 to 0 loses a lot of information. But you are right about the specification allowing one to choose whichever approximation one wants.

It would indeed be more informative to multiply by 10. Rounding to the nearest integer can however be a very simple way to generate the chemical_formula_reduced field from the chemical_formula_descriptive. So I would not exclude the possibility at least some implementers have tried this.

ml-evs · 2021-10-25T13:08:17Z

It seems I can't push to your fork @merkys, would you mind giving me collaborator access? I have updated the test cases and the regexp to disallow 0 too

merkys · 2021-10-25T13:13:49Z

It seems I can't push to your fork @merkys, would you mind giving me collaborator access? I have updated the test cases and the regexp to disallow 0 too

I have just added you as a collaborator - could you please try pushing again?

ml-evs · 2021-10-25T13:17:33Z

It seems I can't push to your fork @merkys, would you mind giving me collaborator access? I have updated the test cases and the regexp to disallow 0 too

I have just added you as a collaborator - could you please try pushing again?

That worked, thanks! I'll let @CasperWA or @JPBergsma have a look over this too, since I have also made edits...

merkys · 2021-10-25T13:21:11Z

@ml-evs do you by the way know why chemical_formula_hill is not nullable while specification says that it "MAY be null"? Maybe I just do not understand the meaning of nullable...

codecov · 2021-10-25T13:23:51Z

Codecov Report

Merging #986 (d8470c9) into master (8d94df0) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head d8470c9 differs from pull request most recent head 14cf5e2. Consider uploading reports for the commit 14cf5e2 to get more accurate results

@@            Coverage Diff             @@
##           master     #986      +/-   ##
==========================================
- Coverage   92.92%   92.88%   -0.04%     
==========================================
  Files          67       67              
  Lines        3787     3782       -5     
==========================================
- Hits         3519     3513       -6     
- Misses        268      269       +1

Flag	Coverage Δ
project	`92.88% <100.00%> (-0.04%)`	⬇️
validator	`92.88% <100.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
optimade/models/structures.py	`95.38% <ø> (-0.47%)`	⬇️
optimade/models/utils.py	`91.56% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d94df0...14cf5e2. Read the comment docs.

ml-evs · 2021-10-25T13:40:51Z

@ml-evs do you by the way know why chemical_formula_hill is not nullable while specification says that it "MAY be null"? Maybe I just do not understand the meaning of nullable...

This is a bit of an awkard quirk between OpenAPI, pydantic and the way we define "SHOULD"/"OPTIONAL" fields. SHOULD fields are listed under required but have nullable: true, whereas OPTIONAL fields are not listed under required but do not have nullable: true.

Perhaps it would be clearer if we additionally set all optional fields to be nullable too.

optimade/models/structures.py

JPBergsma · 2021-10-25T15:30:35Z

tests/models/test_structures.py

@@ -180,7 +180,7 @@ def test_bad_structures(bad_structures, mapper):
    ),
    (
        {"chemical_formula_anonymous": "A44B15C9D4E3F2GHI0J0K0L0"},
-        "chemical_formula_anonymous 'A44B15C9D4E3F2GHI0J0K0L0' cannot contain chemical proportion of 0.",
+        "string does not match regex",


It would have been nice if we could keep the old error messages. They are much more informative and easier to reas than a regular expression.

True, this is what we lose with this approach. One workaround could be to keep the pydantic validators and apply them before the regexp with pre=True but this adds some complexity and overhead. Any thoughts @CasperWA?

Perhaps you could use the pydantic validators after the check with the regular expression has found an error.
There is no use in checking the same thing twice.

I think it should be fine. While it's less informative, it's still as informative as it needs to be. One can, technically, copy paste our regular expression and try it out in any of the many online regex testers - or just do it offline locally to try and find the culprit sub-string.

In essence, while it would be nice to be extra helpful, I think the hassle of re-implementing that is much worse than the nice "cleanup" we're getting here.

CasperWA

This looks great to me.

Just to reiterate what I'm seeing happening (mainly):

You're elegantly moving the tests for 0 and 1 stoichiometries/"concentrations", whatever... to a regular expression. Neat!
Through the regular expression you're changing/removing the possibility for the values to be empty strings.
Is this technically correct? I could probably argue for it being correct, since in that case the specification probably intends for the value to be null? I guess this is your interpretation as well.

CasperWA · 2021-10-26T14:15:51Z

tests/models/test_structures.py

@@ -180,7 +180,7 @@ def test_bad_structures(bad_structures, mapper):
    ),
    (
        {"chemical_formula_anonymous": "A44B15C9D4E3F2GHI0J0K0L0"},
-        "chemical_formula_anonymous 'A44B15C9D4E3F2GHI0J0K0L0' cannot contain chemical proportion of 0.",
+        "string does not match regex",


I think it should be fine. While it's less informative, it's still as informative as it needs to be. One can, technically, copy paste our regular expression and try it out in any of the many online regex testers - or just do it offline locally to try and find the culprit sub-string.

In essence, while it would be nice to be extra helpful, I think the hassle of re-implementing that is much worse than the nice "cleanup" we're getting here.

ml-evs

The other added benefit is that external OpenAPI-based tools will now do a better job of validating responses outside of this package! Thanks @merkys!

- Formulas are unlikely to be empty strings. - Integer proportions cannot be equal to 1.

Co-authored-by: Johan Bergsma <JPBergsma@users.noreply.github.com>

ml-evs · 2021-10-26T14:34:06Z

I have rebased into two commits in case we want to revert to the old pydantic validators at a later date.

merkys · 2021-10-26T14:35:58Z

I mixed up requirements arising from the specification and from "common sense" (mine of course). Thus:

To stay true to the spec, only proportions of 1 should be forbidden via regular expressions;
To stay true to "common sense" (mine), empty formulas and proportions of 0 should be forbidden.

It should be much better to forward these "common sense" restrictions for the discussion and possible enshrinement in the specification. I will do that.

merkys requested a review from ml-evs October 25, 2021 12:50

merkys requested review from CasperWA and JPBergsma as code owners October 25, 2021 12:50

ml-evs reviewed Oct 25, 2021

View reviewed changes

ml-evs requested changes Oct 25, 2021

View reviewed changes

JPBergsma reviewed Oct 25, 2021

View reviewed changes

optimade/models/structures.py Show resolved Hide resolved

JPBergsma reviewed Oct 25, 2021

View reviewed changes

ml-evs force-pushed the stricter-formula-syntax branch from 72f5ce5 to 89e0d3f Compare October 26, 2021 10:00

CasperWA previously approved these changes Oct 26, 2021

View reviewed changes

ml-evs previously approved these changes Oct 26, 2021

View reviewed changes

merkys and others added 2 commits October 26, 2021 15:33

Use a stricter chemical formula regexp

ec7ee84

- Formulas are unlikely to be empty strings. - Integer proportions cannot be equal to 1.

Also disallow 0 proportion and clean up tests and validator code

14cf5e2

Co-authored-by: Johan Bergsma <JPBergsma@users.noreply.github.com>

ml-evs dismissed stale reviews from CasperWA and themself via 14cf5e2 October 26, 2021 14:33

ml-evs force-pushed the stricter-formula-syntax branch from 89e0d3f to 14cf5e2 Compare October 26, 2021 14:33

ml-evs approved these changes Oct 26, 2021

View reviewed changes

ml-evs enabled auto-merge (rebase) October 26, 2021 14:34

ml-evs merged commit 32d2cdf into Materials-Consortia:master Oct 26, 2021

merkys deleted the stricter-formula-syntax branch October 26, 2021 14:49

merkys mentioned this pull request Oct 27, 2021

Additional requirements on formulas Materials-Consortia/OPTIMADE#388

Open

ml-evs mentioned this pull request Nov 15, 2021

Stricter validation of chemical formulas in OpenAPI schema #708

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stricter formula syntax #986

Stricter formula syntax #986

merkys commented Oct 25, 2021

ml-evs left a comment •

edited

Loading

ml-evs Oct 25, 2021

merkys Oct 25, 2021

JPBergsma Oct 25, 2021 •

edited

Loading

merkys Oct 26, 2021

JPBergsma Oct 26, 2021

ml-evs commented Oct 25, 2021

merkys commented Oct 25, 2021

ml-evs commented Oct 25, 2021

merkys commented Oct 25, 2021

codecov bot commented Oct 25, 2021 •

edited

Loading

ml-evs commented Oct 25, 2021

JPBergsma Oct 25, 2021

ml-evs Oct 25, 2021

JPBergsma Oct 26, 2021

CasperWA Oct 26, 2021

CasperWA left a comment

CasperWA Oct 26, 2021

ml-evs left a comment

ml-evs commented Oct 26, 2021

merkys commented Oct 26, 2021

	CHEMICAL_FORMULA_REGEXP = r"^([A-Z][a-z]?([02-9]\|[1-9]\d+)?)+$"
	CHEMICAL_FORMULA_REGEXP = r"^([A-Z][a-z]?([2-9]\|[1-9]\d+)?)+$"

Stricter formula syntax #986

Stricter formula syntax #986

Conversation

merkys commented Oct 25, 2021

ml-evs left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JPBergsma Oct 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ml-evs commented Oct 25, 2021

merkys commented Oct 25, 2021

ml-evs commented Oct 25, 2021

merkys commented Oct 25, 2021

codecov bot commented Oct 25, 2021 • edited Loading

Codecov Report

ml-evs commented Oct 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CasperWA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ml-evs left a comment

Choose a reason for hiding this comment

ml-evs commented Oct 26, 2021

merkys commented Oct 26, 2021

ml-evs left a comment •

edited

Loading

JPBergsma Oct 25, 2021 •

edited

Loading

codecov bot commented Oct 25, 2021 •

edited

Loading