-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Ouranos folder structure #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
- I like that
processing_level
is deeper for the derived. - yes, I think xrfreq is neccesary for AS-JUL et compagnie. would it be better to switch to
xrfreq
everywhere ? - Comment ça marche les versions ? On drop tous les point et tous les 0 ? on drop dans le path ou dans le catalogue et les attrs aussi ? On met toujours 2 chiffres ?
- It would be nice to have xscen be able to create the path, so that we donèt need to go through miranda for ESPO and other datasets we create. Though, I"m not sure I understand how xscen, a public package,can access data from ouranos_data_catalog ? Would it be bad to just put it publicly in xscen ?
- I put some comments regarding this but I think the order should be as similar as possible for all types. I would base the order on
simulation
.
|
|
|
Changes as suggested from review applied. In the same way "processing_level" changes, I think the meaning of "domain" also changes between, "datasets" and "derived". For raw data, "domain" usually denotes the domain over which the source is "defined", it's kinda part of the source's definition. Like "CAN" is the domain of AHCCD or "global" is the domain of a GCM. With derived data, we usually subset the data so the domain becomes more closely related to the processing done. As such, I sent it down in the tree, next to "processing_level". See added comments in the yaml. |
I'm all for it, since we might eventually have to deal with an indicator that is computed on different xrfreqs, but with the same frequency (AS-JAN and AS-DEC, for example). |
This type of change is perfectly fine. Folder structure and parsable facets are much more important than how we name files. All good for me.
Sounds good to me. I don't foresee too many instances of hourly data needing hour start and hour end. Frequency is available within the facets, via xarray, and typically in the filename and folder structure already.
We should have a common place for some of these configurations so that changes needed in one project are easily ported to the other.
That sounds great |
@Zeitsperre is it my comprehension that the validator will already detect if a dataset is missing some facets ? Like if a If yes, I suggest relaxing the schema. We could rely on the validation to detect missing facets. The schema building code would then skip levels if a facet is missing. This would remove all patterns like:
making the yaml a bit cleaner. |
Implemented some simplifying ideas in the latest commit:
|
Caduc. |
Pull Request Checklist:
number
) and pull request (:pull:number
) has been addedWhat kind of change does this PR introduce?
Introduces new schema for ouranos's folder structure. Many of the points were discussed in meetings, but the implementation adds a few things and there are some questions left open.Changes how the schema YAML is parsed to add more flexibility, removes references to "Ouranos" : miranda provides the methods and a simple example. The official ouranos schema will be moved to
ouranos-data-catalogs
.I guess only @Zeitsperre is concerned by the next section. @RondeauG and @juliettelavoie may skip to "Schema changes".
Schema mechanics
I modified the mechanics a bit, they are all explained in the "docstring" of the yml. Grosso modo:
One big change here is that the filename is built from the schema/facets, the original is not kept (as before).
The "filename" element is more relax that the "structure" as a missing facet is simply skipped. This way, the folder depth is standardized, but the number of "_" in a filename is not. Since the only thing xscen needs to parse from the filename is the dates and they are always the last element, the rest of the filename doesn't matter anymore.Meta-facet
dates
(see_parse_dates
) returns the "start-end" string as we usually want them at the end of the filenames. It tries to be smart and only print a year is the data fits calendar years neatly. Same thing with months. A single date is printed if the whole period (and only that period) is contained in the file. It will not print at resolution finer than a day. It prints dates as%Y
,&Y%m
or%Y%m%d
. The hyphen separates the start and end dates. This would resolve Ouranosinc/data-requests#14.Schema changes
The official schema will move away from miranda, this is a simple legacy summary
station-obs
andreconstruction
: The "version" field may be appended to the "source" field.simulation
: The "version" field may be appended to the "bias_adjust_project" field.reconstruction
: Optional "member" folder belowsource
and abovedomain
.processing_level
is significant.xscen
.@RondeauG I still need to include your suggestions for hydrology projects.I may also need to update the "validator" module more. I don't havepyessv
on this machine, so I have not tested it yet.Does this PR introduce a breaking change?
Not in the usual sense, but yes, old paths aren't all compatible with the new paths.
Other information:
Simply reading about my suggestions might be confusing. See the new
structure_tests.csv
file for examples.