Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Ouranos folder structure #127

Closed
wants to merge 14 commits into from
Closed

Refactor Ouranos folder structure #127

wants to merge 14 commits into from

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented May 12, 2023

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • Tests for the changes have been added (for bug fixes / features)
    • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

Introduces new schema for ouranos's folder structure. Many of the points were discussed in meetings, but the implementation adds a few things and there are some questions left open.

Changes how the schema YAML is parsed to add more flexibility, removes references to "Ouranos" : miranda provides the methods and a simple example. The official ouranos schema will be moved to ouranos-data-catalogs.

I guess only @Zeitsperre is concerned by the next section. @RondeauG and @juliettelavoie may skip to "Schema changes".

Schema mechanics

I modified the mechanics a bit, they are all explained in the "docstring" of the yml. Grosso modo:

  • the "option" element is not recursive
  • new "concat" element (a list) to build folder names from more than one facet.
  • new "text" element to inject raw strings
  • new "filename" element
  • the topmost "option" element for each structure is now a "with" element that can combine multiple "options".
  • meta-facets : "dates"

One big change here is that the filename is built from the schema/facets, the original is not kept (as before). The "filename" element is more relax that the "structure" as a missing facet is simply skipped. This way, the folder depth is standardized, but the number of "_" in a filename is not. Since the only thing xscen needs to parse from the filename is the dates and they are always the last element, the rest of the filename doesn't matter anymore.

Meta-facet

dates (see _parse_dates) returns the "start-end" string as we usually want them at the end of the filenames. It tries to be smart and only print a year is the data fits calendar years neatly. Same thing with months. A single date is printed if the whole period (and only that period) is contained in the file. It will not print at resolution finer than a day. It prints dates as %Y, &Y%m or %Y%m%d. The hyphen separates the start and end dates. This would resolve Ouranosinc/data-requests#14.

Schema changes

The official schema will move away from miranda, this is a simple legacy summary

  • station-obs and reconstruction : The "version" field may be appended to the "source" field.
  • simulation : The "version" field may be appended to the "bias_adjust_project" field.
  • reconstruction : Optional "member" folder below source and above domain.
  • New "derived" category of schemas (I'm not sure yet if this notion of categories is useful).
    • It defines a different schema for non-raw simulations and reconstruction. I meant it for indicators or deltas, or anything monthly or coarser AND where processing_level is significant.
    • I suggest we use "xrfreq" instead of "frequency" here. It is absent from miranda's facets, as miranda usually processes raw data. So it would need to be added, copying code from xscen.

@RondeauG I still need to include your suggestions for hydrology projects.

I may also need to update the "validator" module more. I don't have pyessv on this machine, so I have not tested it yet.

Does this PR introduce a breaking change?

Not in the usual sense, but yes, old paths aren't all compatible with the new paths.

Other information:

Simply reading about my suggestions might be confusing. See the new structure_tests.csv file for examples.

Copy link
Collaborator

@juliettelavoie juliettelavoie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

  1. I like that processing_level is deeper for the derived.
  2. yes, I think xrfreq is neccesary for AS-JUL et compagnie. would it be better to switch to xrfreq everywhere ?
  3. Comment ça marche les versions ? On drop tous les point et tous les 0 ? on drop dans le path ou dans le catalogue et les attrs aussi ? On met toujours 2 chiffres ?
  4. It would be nice to have xscen be able to create the path, so that we donèt need to go through miranda for ESPO and other datasets we create. Though, I"m not sure I understand how xscen, a public package,can access data from ouranos_data_catalog ? Would it be bad to just put it publicly in xscen ?
  5. I put some comments regarding this but I think the order should be as similar as possible for all types. I would base the order on simulation.

miranda/structure/data/ouranos_schema.yml Outdated Show resolved Hide resolved
miranda/structure/data/ouranos_schema.yml Outdated Show resolved Hide resolved
miranda/structure/data/ouranos_schema.yml Outdated Show resolved Hide resolved
miranda/structure/data/ouranos_schema.yml Outdated Show resolved Hide resolved
miranda/structure/data/ouranos_schema.yml Outdated Show resolved Hide resolved
@aulemahal
Copy link
Collaborator Author

aulemahal commented May 15, 2023

  1. I'm not against using xrfreq everwhere. frequency has the advantage of being CMIP6-like...
  2. I'm not sure I see a problem in putting dots in a folder name ? Otherwise, we could convert into hyphens - ?
  3. I was thinking of having that file somewhere only accessible from within ouranos's VPN or with a github account with the correct permissions.

@juliettelavoie
Copy link
Collaborator

  1. I don't have a strong opinion. Just thought, if we wanted to change now would be a good time.
  2. version is also in the filename. I think Trevor doesn't want that? I think on Pavics, Travis put period in the filenames of ESPOs. Would you do the conversion only for the path and filename ?
  3. ok. I guess if miranda doesn't want to import xscen. This is the way to do it.

@aulemahal
Copy link
Collaborator Author

  1. Ah true, in filenames dots are trickier. We could only convert in the filenames but we'd loose our nice symmetry...

@aulemahal
Copy link
Collaborator Author

Changes as suggested from review applied.

In the same way "processing_level" changes, I think the meaning of "domain" also changes between, "datasets" and "derived". For raw data, "domain" usually denotes the domain over which the source is "defined", it's kinda part of the source's definition. Like "CAN" is the domain of AHCCD or "global" is the domain of a GCM. With derived data, we usually subset the data so the domain becomes more closely related to the processing done. As such, I sent it down in the tree, next to "processing_level".

See added comments in the yaml.

@RondeauG
Copy link
Collaborator

  • I suggest we use "xrfreq" instead of "frequency" here. It is absent from miranda's facets, as miranda usually processes raw data.

I'm all for it, since we might eventually have to deal with an indicator that is computed on different xrfreqs, but with the same frequency (AS-JAN and AS-DEC, for example).

@Zeitsperre
Copy link
Collaborator

One big change here is that the filename is built from the schema/facets, the original is not kept (as before). The "filename" element is more relax that the "structure" as a missing facet is simply skipped. This way, the folder depth is standardized, but the number of "_" in a filename is not. Since the only thing xscen needs to parse from the filename is the dates and they are always the last element, the rest of the filename doesn't matter anymore.

This type of change is perfectly fine. Folder structure and parsable facets are much more important than how we name files. All good for me.

dates (see _parse_dates) returns the "start-end" string as we usually want them at the end of the filenames. It tries to be smart and only print a year is the data fits calendar years neatly. Same thing with months. A single date is printed if the whole period (and only that period) is contained in the file. It will not print at resolution finer than a day. It prints dates as %Y, &Y%m or %Y%m%d. The hyphen separates the start and end dates. This would resolve Ouranosinc/data-requests#14.

Sounds good to me. I don't foresee too many instances of hourly data needing hour start and hour end. Frequency is available within the facets, via xarray, and typically in the filename and folder structure already.

Is Miranda the best place for this file ? At one point, I think we would want xscen to be able to read in the schema and create the structure too. This the specific Ouranos schema might be best kept somewhere else, accessible for both packages by Ouranos users ? Like https://github.com/Ouranosinc/ouranos_data_catalog , side-by-side with the parser that creates the catalogs ?

We should have a common place for some of these configurations so that changes needed in one project are easily ported to the other.

Both miranda and xscen would ship with some very simple versions in order to show external users how it works and to run tests, but the ouranos-specific version would be elsewhere.

That sounds great

@aulemahal
Copy link
Collaborator Author

@Zeitsperre is it my comprehension that the validator will already detect if a dataset is missing some facets ? Like if a simulation file is missing member for example ?

If yes, I suggest relaxing the schema. We could rely on the validation to detect missing facets. The schema building code would then skip levels if a facet is missing. This would remove all patterns like:

- option: member
  is_true: member

making the yaml a bit cleaner.

@aulemahal
Copy link
Collaborator Author

Implemented some simplifying ideas in the latest commit:

  • No need for a "concat" element, a list has the same meaning.
  • Validation is already done elsewhere. A missing facet is simply skipped, in the structure as in the filenames. This allowed a massive simplification of the structure schema, by regrouping "station-obs", "forecast" and "reconstruction" together.
  • However, I did not have time yet to enable validation on my machine, so tests will fail for now.

@aulemahal aulemahal requested a review from Zeitsperre May 23, 2023 19:09
@aulemahal
Copy link
Collaborator Author

Caduc.
This functionality (structure) will move to xscen. See Ouranosinc/xscen#205.

@aulemahal aulemahal closed this Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants