Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor catalog utils (again) #205

Merged
merged 25 commits into from
Jun 13, 2023
Merged

Refactor catalog utils (again) #205

merged 25 commits into from
Jun 13, 2023

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented May 23, 2023

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • This PR does not seem to break the templates.
  • HISTORY.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

  • Move catalog creation (parse_directory) to a new catutils.py module
  • Rewrite file parsing functions:
    • Use os.walk instead of subprocess calling find. The function should now be platform independent and not linux-specific.
    • Use parse instead of intake.source.utils.reverse_format. It allows some interesting features, like format specifiers.
    • Rethink parallelism.
    • More flexibility to cvs
  • Copy and adapt path building stuff from miranda. See build_path. Schema is data/file_schema.yml.
  • date_parser has moved to utils.py.

Does this PR introduce a breaking change?

Yes.

  • Functions have moved.
  • Patterns feeded to parse_directory are now specified differently. See doc and example below.
  • No more globpattern. The extension is parsed from the patterns. Calls can now mix different extensions. A new dirglob arg takes care of the "folder filtering" feature.
  • parse_directory does not use dask anymore. I believe the code is now easier to understand and speed is not affected too much. Parsing a local filesystem is not supposed to be faster in parallel, but we are often using nfs-mounted disk where this is less true. Testing on doris, parsing the full /datasets/simulation/raw folder takes the same time between the new and the old versions. See note below on "code complexity".

Other information:

Example for the new patterns:

OLD: {processing_level}/{mip_era}/{activity}/{domain}/{institution}/{source}/{experiment}/{member}/{frequency}/{_variable}/{?var}_{?freq}_{?src}_{?exp}_{?memb}_{*}_{date_start}-{date_end}.nc"

NEW : "{processing_level}/{mip_era}/{activity}/{domain}/{institution}/{source}/{experiment}/{member}/{frequency}/{variable:_}/{*}_{DATES}.nc"

The "variable" field accepts underscores, so it uses the "_" format specifier. No more need to specify each part of the filename if only the last one is needed. Here the "DATES" special field can catch single dates or bounds.

TODO:

  • Update the notebooks

Code complexity

Lol. I thought this PR would simplify the parse_directory system.

The problem is the MRCC5. Or at least, it is the enormous size of this database and its scattering on slow-to-read disks. It is the only reason for all these complexities:

  • Queues to multithread the file parsing
  • Access checks because there are non-readable files in the database.
  • Multiprocessing pool for parallelize the disks parsing
  • "read_file" by groups to minimize the number of file reads.
  • Parsing from file with fast-paths for netCDF and Zarr (instead of xarray).
  • Nested CVS (although this new feature could be useful in other databases)

I wanted to make clear that the complexity of this code is NOT only because I like to optimize things. Last time I ran the MRCC5 catalog creation (with this code), it took 7 hours. Just imagine without the optimizations. (And this while missing a full disk).

@RondeauG
Copy link
Contributor

If you could use this PR to also address #152, that would be great! I wrote the issue, but it originally came from @mccrayc, so you can check with him for the details.

@aulemahal aulemahal marked this pull request as ready for review June 2, 2023 21:36
Copy link
Contributor

@juliettelavoie juliettelavoie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

  1. I would update the init to be able to do xs.build_path.
  2. It might be good to add build_path at the end of the the Getting Started notebook.
  3. There is some conflict with format and version.
ds = xr.open_zarr('PATH_FINAL_ESPO')
ds = ds[['tasmax']]
ds.attrs['cat:variable']='tasmax'
ds.attrs['cat:processing_level']='biasadjusted'
xs.catutils.build_path(ds) #--> PosixPath('simulation/biasadjusted/ESPO-G6_1.1.0/CMIP6/ScenarioMIP/NAM-rdrs/BCC/BCC-CSM2-MR/ssp245/r1i1p1f1/day/tasmax/tasmax_day_ESPO-G6_1.1.0_CMIP6_ScenarioMIP_NAM-rdrs_BCC_BCC-CSM2-MR_ssp245_r1i1p1f1_1950-2100')
xs.catutils.build_path(ds,format='zarr') #-->PosixPath('simulation/ESPO-G6_1.1.0/CMIP6/ScenarioMIP/BCC/BCC-CSM2-MR/ssp245/r1i1p1f1/NAM-rdrs/final/D/tasmax/tasmax_D_ESPO-G6_1.1.zarr')
  1. In an issue, we discussed moving some of my ESPO utils fonction in xscen. I think the save_move_update doesn't have a place anymore with the new high quotas and io issues we are having. But I think a save_and_update with possibility of using build_path would be nice. It would lighten workflows. I could push this on this PR or a new one after.

xscen/catutils.py Outdated Show resolved Hide resolved
xscen/catutils.py Outdated Show resolved Hide resolved
xscen/catutils.py Show resolved Hide resolved
xscen/data/file_schema.yml Show resolved Hide resolved
Copy link
Contributor

@RondeauG RondeauG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! However, I'd ideally wait until @mccrayc has had a look at it before we merge.

xscen/catutils.py Outdated Show resolved Hide resolved
xscen/catutils.py Outdated Show resolved Hide resolved
xscen/catutils.py Outdated Show resolved Hide resolved
xscen/io.py Outdated Show resolved Hide resolved
@mccrayc
Copy link
Collaborator

mccrayc commented Jun 9, 2023

Looks good to me! The new patterns seem much more intuitive and clean.

@aulemahal
Copy link
Collaborator Author

Last commit made a few changes. I tested the new code with ouranos_data_catalogs and it made a few bugs appear:

  • _parse_from_zarr can now read the time coordinate. Before, xarray was needed, which was slowing down the process a lot.
  • Removed split_dataset, this is to be implemented in another PR.
  • Relaxed the "dates" regex to dates of any length.
  • Added a test in parse_directory to remove entries where date_start is a date, but frequency is "fx".

And I guess I now need to add tests!

@aulemahal
Copy link
Collaborator Author

+12% coverage 💪

@aulemahal aulemahal merged commit bb8aa45 into main Jun 13, 2023
@aulemahal aulemahal deleted the refac-parse-dir-again branch June 13, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Better catalog-related documentation
4 participants