Refactor Ouranos folder structure #127

aulemahal · 2023-05-12T21:33:25Z

Pull Request Checklist:

This PR addresses an already opened issue (for bug fixes / features)
- This PR fixes #xyz
Tests for the changes have been added (for bug fixes / features)
- (If applicable) Documentation has been added / updated (for bug fixes / features)
HISTORY.rst has been updated (with summary of main changes)
- Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

~~Introduces new schema for ouranos's folder structure. Many of the points were discussed in meetings, but the implementation adds a few things and there are some questions left open.~~

Changes how the schema YAML is parsed to add more flexibility, removes references to "Ouranos" : miranda provides the methods and a simple example. The official ouranos schema will be moved to ouranos-data-catalogs.

I guess only @Zeitsperre is concerned by the next section. @RondeauG and @juliettelavoie may skip to "Schema changes".

Schema mechanics

I modified the mechanics a bit, they are all explained in the "docstring" of the yml. Grosso modo:

the "option" element is not recursive
new "concat" element (a list) to build folder names from more than one facet.
new "text" element to inject raw strings
new "filename" element
the topmost "option" element for each structure is now a "with" element that can combine multiple "options".
meta-facets : "dates"

One big change here is that the filename is built from the schema/facets, the original is not kept (as before). The "filename" element is more relax that the "structure" as a missing facet is simply skipped. This way, the folder depth is standardized, but the number of "_" in a filename is not. Since the only thing xscen needs to parse from the filename is the dates and they are always the last element, the rest of the filename doesn't matter anymore.

Meta-facet

dates (see _parse_dates) returns the "start-end" string as we usually want them at the end of the filenames. It tries to be smart and only print a year is the data fits calendar years neatly. Same thing with months. A single date is printed if the whole period (and only that period) is contained in the file. It will not print at resolution finer than a day. It prints dates as %Y, &Y%m or %Y%m%d. The hyphen separates the start and end dates. This would resolve Ouranosinc/data-requests#14.

Schema changes

The official schema will move away from miranda, this is a simple legacy summary

station-obs and reconstruction : The "version" field may be appended to the "source" field.
simulation : The "version" field may be appended to the "bias_adjust_project" field.
reconstruction : Optional "member" folder below source and above domain.
New "derived" category of schemas (I'm not sure yet if this notion of categories is useful).
- It defines a different schema for non-raw simulations and reconstruction. I meant it for indicators or deltas, or anything monthly or coarser AND where processing_level is significant.
- I suggest we use "xrfreq" instead of "frequency" here. It is absent from miranda's facets, as miranda usually processes raw data. So it would need to be added, copying code from xscen.

~~@RondeauG I still need to include your suggestions for hydrology projects.~~

~~I may also need to update the "validator" module more. I don't have pyessv on this machine, so I have not tested it yet.~~

Does this PR introduce a breaking change?

Not in the usual sense, but yes, old paths aren't all compatible with the new paths.

Other information:

Simply reading about my suggestions might be confusing. See the new structure_tests.csv file for examples.

juliettelavoie

This is great!

I like that processing_level is deeper for the derived.
yes, I think xrfreq is neccesary for AS-JUL et compagnie. would it be better to switch to xrfreq everywhere ?
Comment ça marche les versions ? On drop tous les point et tous les 0 ? on drop dans le path ou dans le catalogue et les attrs aussi ? On met toujours 2 chiffres ?
It would be nice to have xscen be able to create the path, so that we donèt need to go through miranda for ESPO and other datasets we create. Though, I"m not sure I understand how xscen, a public package,can access data from ouranos_data_catalog ? Would it be bad to just put it publicly in xscen ?
I put some comments regarding this but I think the order should be as similar as possible for all types. I would base the order on simulation.

miranda/structure/data/ouranos_schema.yml

aulemahal · 2023-05-15T15:32:22Z

I'm not against using xrfreq everwhere. frequency has the advantage of being CMIP6-like...
I'm not sure I see a problem in putting dots in a folder name ? Otherwise, we could convert into hyphens - ?
I was thinking of having that file somewhere only accessible from within ouranos's VPN or with a github account with the correct permissions.

juliettelavoie · 2023-05-15T15:45:31Z

I don't have a strong opinion. Just thought, if we wanted to change now would be a good time.
version is also in the filename. I think Trevor doesn't want that? I think on Pavics, Travis put period in the filenames of ESPOs. Would you do the conversion only for the path and filename ?
ok. I guess if miranda doesn't want to import xscen. This is the way to do it.

aulemahal · 2023-05-15T15:48:41Z

Ah true, in filenames dots are trickier. We could only convert in the filenames but we'd loose our nice symmetry...

aulemahal · 2023-05-15T18:44:46Z

Changes as suggested from review applied.

In the same way "processing_level" changes, I think the meaning of "domain" also changes between, "datasets" and "derived". For raw data, "domain" usually denotes the domain over which the source is "defined", it's kinda part of the source's definition. Like "CAN" is the domain of AHCCD or "global" is the domain of a GCM. With derived data, we usually subset the data so the domain becomes more closely related to the processing done. As such, I sent it down in the tree, next to "processing_level".

See added comments in the yaml.

RondeauG · 2023-05-15T18:50:55Z

I suggest we use "xrfreq" instead of "frequency" here. It is absent from miranda's facets, as miranda usually processes raw data.

I'm all for it, since we might eventually have to deal with an indicator that is computed on different xrfreqs, but with the same frequency (AS-JAN and AS-DEC, for example).

Zeitsperre · 2023-05-15T19:25:05Z

One big change here is that the filename is built from the schema/facets, the original is not kept (as before). The "filename" element is more relax that the "structure" as a missing facet is simply skipped. This way, the folder depth is standardized, but the number of "_" in a filename is not. Since the only thing xscen needs to parse from the filename is the dates and they are always the last element, the rest of the filename doesn't matter anymore.

This type of change is perfectly fine. Folder structure and parsable facets are much more important than how we name files. All good for me.

dates (see _parse_dates) returns the "start-end" string as we usually want them at the end of the filenames. It tries to be smart and only print a year is the data fits calendar years neatly. Same thing with months. A single date is printed if the whole period (and only that period) is contained in the file. It will not print at resolution finer than a day. It prints dates as %Y, &Y%m or %Y%m%d. The hyphen separates the start and end dates. This would resolve Ouranosinc/data-requests#14.

Sounds good to me. I don't foresee too many instances of hourly data needing hour start and hour end. Frequency is available within the facets, via xarray, and typically in the filename and folder structure already.

Is Miranda the best place for this file ? At one point, I think we would want xscen to be able to read in the schema and create the structure too. This the specific Ouranos schema might be best kept somewhere else, accessible for both packages by Ouranos users ? Like https://github.com/Ouranosinc/ouranos_data_catalog , side-by-side with the parser that creates the catalogs ?

We should have a common place for some of these configurations so that changes needed in one project are easily ported to the other.

Both miranda and xscen would ship with some very simple versions in order to show external users how it works and to run tests, but the ouranos-specific version would be elsewhere.

That sounds great

aulemahal · 2023-05-16T15:42:44Z

@Zeitsperre is it my comprehension that the validator will already detect if a dataset is missing some facets ? Like if a simulation file is missing member for example ?

If yes, I suggest relaxing the schema. We could rely on the validation to detect missing facets. The schema building code would then skip levels if a facet is missing. This would remove all patterns like:

- option: member
  is_true: member

making the yaml a bit cleaner.

aulemahal · 2023-05-17T20:13:49Z

Implemented some simplifying ideas in the latest commit:

No need for a "concat" element, a list has the same meaning.
Validation is already done elsewhere. A missing facet is simply skipped, in the structure as in the filenames. This allowed a massive simplification of the structure schema, by regrouping "station-obs", "forecast" and "reconstruction" together.
However, I did not have time yet to enable validation on my machine, so tests will fail for now.

…chema

aulemahal · 2023-06-02T21:42:14Z

Caduc.
This functionality (structure) will move to xscen. See Ouranosinc/xscen#205.

aulemahal added 4 commits May 11, 2023 17:03

WIP new schema

61c193e

revamp structure

321b017

Add tests

4be12b1

remove future syntax

0d7fad5

juliettelavoie reviewed May 15, 2023

View reviewed changes

aulemahal added 2 commits May 15, 2023 14:38

Upd schema after review

d924046

Upd schema after review bis

1debe8c

Simplify schema by relaxing validation - rm concat

505bcf9

aulemahal added 5 commits May 17, 2023 17:27

fix validation and tests

367800b

Remove ouranos schema in favor of basic raw-only

bab0525

remove derived datasets

6ca5f0d

filename typo time to go home

deac104

Merge branch 'main' into new-schema

bb1e12a

aulemahal requested a review from Zeitsperre May 23, 2023 19:09

aulemahal added 2 commits May 23, 2023 16:15

Rename top_folder to category

d02ee18

Merge branch 'new-schema' of github.com:Ouranosinc/miranda into new-s…

4f47e09

…chema

aulemahal closed this Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Ouranos folder structure #127

Refactor Ouranos folder structure #127

aulemahal commented May 12, 2023 •

edited

Loading

juliettelavoie left a comment

aulemahal commented May 15, 2023 •

edited

Loading

juliettelavoie commented May 15, 2023

aulemahal commented May 15, 2023

aulemahal commented May 15, 2023

RondeauG commented May 15, 2023

Zeitsperre commented May 15, 2023

aulemahal commented May 16, 2023

aulemahal commented May 17, 2023

aulemahal commented Jun 2, 2023

Refactor Ouranos folder structure #127

Refactor Ouranos folder structure #127

Conversation

aulemahal commented May 12, 2023 • edited Loading

Pull Request Checklist:

What kind of change does this PR introduce?

Schema mechanics

Meta-facet

Schema changes

Does this PR introduce a breaking change?

Other information:

juliettelavoie left a comment

Choose a reason for hiding this comment

aulemahal commented May 15, 2023 • edited Loading

juliettelavoie commented May 15, 2023

aulemahal commented May 15, 2023

aulemahal commented May 15, 2023

RondeauG commented May 15, 2023

Zeitsperre commented May 15, 2023

aulemahal commented May 16, 2023

aulemahal commented May 17, 2023

aulemahal commented Jun 2, 2023

aulemahal commented May 12, 2023 •

edited

Loading

aulemahal commented May 15, 2023 •

edited

Loading