-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor catalog utils (again) #205
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
- I would update the init to be able to do
xs.build_path
. - It might be good to add
build_path
at the end of the the Getting Started notebook. - There is some conflict with format and version.
ds = xr.open_zarr('PATH_FINAL_ESPO')
ds = ds[['tasmax']]
ds.attrs['cat:variable']='tasmax'
ds.attrs['cat:processing_level']='biasadjusted'
xs.catutils.build_path(ds) #--> PosixPath('simulation/biasadjusted/ESPO-G6_1.1.0/CMIP6/ScenarioMIP/NAM-rdrs/BCC/BCC-CSM2-MR/ssp245/r1i1p1f1/day/tasmax/tasmax_day_ESPO-G6_1.1.0_CMIP6_ScenarioMIP_NAM-rdrs_BCC_BCC-CSM2-MR_ssp245_r1i1p1f1_1950-2100')
xs.catutils.build_path(ds,format='zarr') #-->PosixPath('simulation/ESPO-G6_1.1.0/CMIP6/ScenarioMIP/BCC/BCC-CSM2-MR/ssp245/r1i1p1f1/NAM-rdrs/final/D/tasmax/tasmax_D_ESPO-G6_1.1.zarr')
- In an issue, we discussed moving some of my ESPO utils fonction in xscen. I think the
save_move_update
doesn't have a place anymore with the new high quotas and io issues we are having. But I think asave_and_update
with possibility of usingbuild_path
would be nice. It would lighten workflows. I could push this on this PR or a new one after.
Co-authored-by: juliettelavoie <juliette.lavoie@hotmail.ca>
…nto refac-parse-dir-again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! However, I'd ideally wait until @mccrayc has had a look at it before we merge.
Looks good to me! The new patterns seem much more intuitive and clean. |
Last commit made a few changes. I tested the new code with ouranos_data_catalogs and it made a few bugs appear:
And I guess I now need to add tests! |
+12% coverage 💪 |
Pull Request Checklist:
number
) and pull request (:pull:number
) has been addedWhat kind of change does this PR introduce?
parse_directory
) to a newcatutils.py
moduleos.walk
instead of subprocess callingfind
. The function should now be platform independent and not linux-specific.parse
instead ofintake.source.utils.reverse_format
. It allows some interesting features, like format specifiers.cvs
build_path
. Schema isdata/file_schema.yml
.date_parser
has moved toutils.py
.Does this PR introduce a breaking change?
Yes.
parse_directory
are now specified differently. See doc and example below.globpattern
. The extension is parsed from the patterns. Calls can now mix different extensions. A newdirglob
arg takes care of the "folder filtering" feature.parse_directory
does not usedask
anymore. I believe the code is now easier to understand and speed is not affected too much. Parsing a local filesystem is not supposed to be faster in parallel, but we are often using nfs-mounted disk where this is less true. Testing on doris, parsing the full/datasets/simulation/raw
folder takes the same time between the new and the old versions. See note below on "code complexity".Other information:
Example for the new patterns:
The "variable" field accepts underscores, so it uses the "_" format specifier. No more need to specify each part of the filename if only the last one is needed. Here the "DATES" special field can catch single dates or bounds.
TODO:
Update the notebooksCode complexity
Lol. I thought this PR would simplify the
parse_directory
system.The problem is the MRCC5. Or at least, it is the enormous size of this database and its scattering on slow-to-read disks. It is the only reason for all these complexities:
I wanted to make clear that the complexity of this code is NOT only because I like to optimize things. Last time I ran the MRCC5 catalog creation (with this code), it took 7 hours. Just imagine without the optimizations. (And this while missing a full disk).