Refactor catalog utils (again) #205

aulemahal · 2023-05-23T21:28:09Z

Pull Request Checklist:

This PR addresses an already opened issue (for bug fixes / features)
- This PR fixes Better catalog-related documentation #152.
(If applicable) Documentation has been added / updated (for bug fixes / features)
This PR does not seem to break the templates.
HISTORY.rst has been updated (with summary of main changes)
- Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

Move catalog creation (parse_directory) to a new catutils.py module
Rewrite file parsing functions:
- Use os.walk instead of subprocess calling find. The function should now be platform independent and not linux-specific.
- Use parse instead of intake.source.utils.reverse_format. It allows some interesting features, like format specifiers.
- Rethink parallelism.
- More flexibility to cvs
Copy and adapt path building stuff from miranda. See build_path. Schema is data/file_schema.yml.
date_parser has moved to utils.py.

Does this PR introduce a breaking change?

Yes.

Functions have moved.
Patterns feeded to parse_directory are now specified differently. See doc and example below.
No more globpattern. The extension is parsed from the patterns. Calls can now mix different extensions. A new dirglob arg takes care of the "folder filtering" feature.
parse_directory does not use dask anymore. I believe the code is now easier to understand and speed is not affected too much. Parsing a local filesystem is not supposed to be faster in parallel, but we are often using nfs-mounted disk where this is less true. Testing on doris, parsing the full /datasets/simulation/raw folder takes the same time between the new and the old versions. See note below on "code complexity".

Other information:

Example for the new patterns:

OLD: {processing_level}/{mip_era}/{activity}/{domain}/{institution}/{source}/{experiment}/{member}/{frequency}/{_variable}/{?var}_{?freq}_{?src}_{?exp}_{?memb}_{*}_{date_start}-{date_end}.nc"

NEW : "{processing_level}/{mip_era}/{activity}/{domain}/{institution}/{source}/{experiment}/{member}/{frequency}/{variable:_}/{*}_{DATES}.nc"

The "variable" field accepts underscores, so it uses the "_" format specifier. No more need to specify each part of the filename if only the last one is needed. Here the "DATES" special field can catch single dates or bounds.

TODO:

~~Update the notebooks~~

Code complexity

Lol. I thought this PR would simplify the parse_directory system.

The problem is the MRCC5. Or at least, it is the enormous size of this database and its scattering on slow-to-read disks. It is the only reason for all these complexities:

Queues to multithread the file parsing
Access checks because there are non-readable files in the database.
Multiprocessing pool for parallelize the disks parsing
"read_file" by groups to minimize the number of file reads.
Parsing from file with fast-paths for netCDF and Zarr (instead of xarray).
Nested CVS (although this new feature could be useful in other databases)

I wanted to make clear that the complexity of this code is NOT only because I like to optimize things. Last time I ran the MRCC5 catalog creation (with this code), it took 7 hours. Just imagine without the optimizations. (And this while missing a full disk).

RondeauG · 2023-05-24T13:16:03Z

If you could use this PR to also address #152, that would be great! I wrote the issue, but it originally came from @mccrayc, so you can check with him for the details.

juliettelavoie

This is great!

I would update the init to be able to do xs.build_path.
It might be good to add build_path at the end of the the Getting Started notebook.
There is some conflict with format and version.

ds = xr.open_zarr('PATH_FINAL_ESPO')
ds = ds[['tasmax']]
ds.attrs['cat:variable']='tasmax'
ds.attrs['cat:processing_level']='biasadjusted'
xs.catutils.build_path(ds) #--> PosixPath('simulation/biasadjusted/ESPO-G6_1.1.0/CMIP6/ScenarioMIP/NAM-rdrs/BCC/BCC-CSM2-MR/ssp245/r1i1p1f1/day/tasmax/tasmax_day_ESPO-G6_1.1.0_CMIP6_ScenarioMIP_NAM-rdrs_BCC_BCC-CSM2-MR_ssp245_r1i1p1f1_1950-2100')
xs.catutils.build_path(ds,format='zarr') #-->PosixPath('simulation/ESPO-G6_1.1.0/CMIP6/ScenarioMIP/BCC/BCC-CSM2-MR/ssp245/r1i1p1f1/NAM-rdrs/final/D/tasmax/tasmax_D_ESPO-G6_1.1.zarr')

In an issue, we discussed moving some of my ESPO utils fonction in xscen. I think the save_move_update doesn't have a place anymore with the new high quotas and io issues we are having. But I think a save_and_update with possibility of using build_path would be nice. It would lighten workflows. I could push this on this PR or a new one after.

xscen/catutils.py

xscen/data/file_schema.yml

Co-authored-by: juliettelavoie <juliette.lavoie@hotmail.ca>

…nto refac-parse-dir-again

RondeauG

Looks good! However, I'd ideally wait until @mccrayc has had a look at it before we merge.

xscen/catutils.py

xscen/io.py

mccrayc · 2023-06-09T21:17:13Z

Looks good to me! The new patterns seem much more intuitive and clean.

aulemahal · 2023-06-12T18:20:57Z

Last commit made a few changes. I tested the new code with ouranos_data_catalogs and it made a few bugs appear:

_parse_from_zarr can now read the time coordinate. Before, xarray was needed, which was slowing down the process a lot.
Removed split_dataset, this is to be implemented in another PR.
Relaxed the "dates" regex to dates of any length.
Added a test in parse_directory to remove entries where date_start is a date, but frequency is "fx".

And I guess I now need to add tests!

This reverts commit e9104f7.

aulemahal · 2023-06-13T20:37:19Z

+12% coverage 💪

aulemahal added 2 commits May 23, 2023 17:08

Refactor catalog utils - copy miranda structure stuff - parse with parse

1f75eb1

Merge branch 'main' into refac-parse-dir-again

d7354aa

aulemahal added 10 commits May 30, 2023 13:52

WIP structure - notebooks

b754590

WIP Reappropriation of structure

2aeade0

WIP add schema

0aed7b4

Build path

c6500c9

WIP parser

2d9c1ac

Merge WIP struct

70fbd12

Cleaner (?), easier to understand (?) code.

d0192c5

precommit

79e1141

Working (?) but badly documented

fae1f5a

Merge branch 'main' into refac-parse-dir-again

dcbc280

aulemahal marked this pull request as ready for review June 2, 2023 21:36

aulemahal mentioned this pull request Jun 2, 2023

Refactor Ouranos folder structure Ouranosinc/miranda#127

Closed

5 tasks

aulemahal added 2 commits June 5, 2023 16:54

Update notebook 1

d47ca0a

merge

3f19e95

aulemahal requested review from RondeauG, juliettelavoie and Zeitsperre June 5, 2023 20:58

aulemahal added 2 commits June 5, 2023 17:21

rm local path

4918747

Output is dataframe

b00097d

juliettelavoie reviewed Jun 6, 2023

View reviewed changes

xscen/catutils.py Outdated Show resolved Hide resolved

xscen/catutils.py Outdated Show resolved Hide resolved

xscen/catutils.py Show resolved Hide resolved

xscen/data/file_schema.yml Show resolved Hide resolved

aulemahal and others added 3 commits June 6, 2023 12:53

Apply suggestions from code review

e2271b5

Co-authored-by: juliettelavoie <juliette.lavoie@hotmail.ca>

Other comments after review

fc06898

Merge branch 'refac-parse-dir-again' of github.com:Ouranosinc/xscen i…

8b84f2b

…nto refac-parse-dir-again

RondeauG approved these changes Jun 8, 2023

View reviewed changes

xscen/catutils.py Outdated Show resolved Hide resolved

xscen/catutils.py Outdated Show resolved Hide resolved

xscen/catutils.py Outdated Show resolved Hide resolved

xscen/io.py Outdated Show resolved Hide resolved

juliettelavoie approved these changes Jun 9, 2023

View reviewed changes

Fixes for ouranos data cats - fixes from rev

3877e70

aulemahal added 5 commits June 12, 2023 14:21

Fix annotation

bfd2fed

Tests

25f8102

Allow missing facet

e9104f7

Revert "Allow missing facet"

aed0c2a

This reverts commit e9104f7.

Allow missing facet (prise deux)

5e79394

aulemahal merged commit bb8aa45 into main Jun 13, 2023

aulemahal deleted the refac-parse-dir-again branch June 13, 2023 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor catalog utils (again) #205

Refactor catalog utils (again) #205

aulemahal commented May 23, 2023 •

edited

Loading

RondeauG commented May 24, 2023

juliettelavoie left a comment

RondeauG left a comment

mccrayc commented Jun 9, 2023

aulemahal commented Jun 12, 2023

aulemahal commented Jun 13, 2023

Refactor catalog utils (again) #205

Refactor catalog utils (again) #205

Conversation

aulemahal commented May 23, 2023 • edited Loading

Pull Request Checklist:

What kind of change does this PR introduce?

Does this PR introduce a breaking change?

Other information:

TODO:

Code complexity

RondeauG commented May 24, 2023

juliettelavoie left a comment

Choose a reason for hiding this comment

RondeauG left a comment

Choose a reason for hiding this comment

mccrayc commented Jun 9, 2023

aulemahal commented Jun 12, 2023

aulemahal commented Jun 13, 2023

aulemahal commented May 23, 2023 •

edited

Loading