Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More thorough store handling during combine/append #488

Merged
merged 2 commits into from
Aug 2, 2024

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Aug 1, 2024

Fixes #487

@martindurant
Copy link
Member Author

In [1]: from kerchunk.combine import MultiZarrToZarr
   ...: ppath = "/Users/mdurant/Downloads/append.parquet"
   ...: from fsspec.implementations.reference import LazyReferenceMapper
   ...: out = LazyReferenceMapper(root=ppath)
   ...: import re
   ...: import datetime
   ...: def fn_to_time(index, fs, var, fn):
   ...:     match = re.search(r'CLDPROP_D3_VIIRS_SNPP\.A(\d{4})(\d{3})\.', fn)
   ...:     year = int(match.group(1))
   ...:     day_of_year = int(match.group(2))
   ...:     return datetime.datetime(year, 1, 1) + datetime.timedelta(days=day_of_year - 1)
   ...: import numpy as np

In [2]: MultiZarrToZarr.append(
   ...:             ["CLDPROP_D3_VIIRS_SNPP.A2024174.011.2024178005308.json"],
   ...:             original_refs=out,
   ...:             coo_map={'time': fn_to_time},
   ...:             coo_dtypes={'time': np.dtype('M8[s]')},
   ...:             concat_dims=['time'],
   ...:             remote_protocol="file"
   ...:         ).translate()
/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/xarray/backends/plugins.py:80: RuntimeWarning: Engine 'gribberish' loading failed:
No module named 'gribberish'
  warnings.warn(f"Engine {name!r} loading failed:\n{ex}", RuntimeWarning)
Out[2]: <fsspec.implementations.reference.LazyReferenceMapper at 0x102b78250>

In [3]: import xarray as xr

In [4]: ds = xr.open_dataset(ppath, engine='kerchunk', group = "Cloud_Optical_Thickness_1621_PCL_Log_Liquid")

In [5]: ds
Out[5]:
<xarray.Dataset>
Dimensions:             (time: 2, longitude: 360, latitude: 180)
Dimensions without coordinates: time, longitude, latitude
Data variables:
    Mean                (time, longitude, latitude) float64 ...
    Pixel_Counts        (time, longitude, latitude) float64 ...
    Standard_Deviation  (time, longitude, latitude) float64 ...
    Sum                 (time, longitude, latitude) float64 ...
    Sum_Squares         (time, longitude, latitude) float64 ...
Attributes:
    add_offset:    0.0
    long_name:     Cloud Optical Thickness Log10 for Liquid Water Clouds (1.6...
    scale_factor:  1.0
    units:         none
    valid_max:     2.176
    valid_min:     -2.0

@martindurant
Copy link
Member Author

Also required the following in fsspec:

--- a/fsspec/implementations/reference.py
+++ b/fsspec/implementations/reference.py
@@ -1085,7 +1085,7 @@ class ReferenceFileSystem(AsyncFileSystem):
         if self.dircache:
             return path in self.dircache
         elif isinstance(self.references, LazyReferenceMapper):
-            return path in self.references.listdir("")
+            return path in self.references.listdir()
         else:

@martindurant
Copy link
Member Author

cc @sreesanjeevkg

@sreesanjeevkg
Copy link

@martindurant Can you please share the whole notebook if possible? My append function is not working as expected. Please see screenshot below.

The version of kerchunk I installed

kerchunk 0.0.9.post422

CODE:

Screenshot 2024-08-01 at 3 21 27 PM

@sreesanjeevkg
Copy link

Also required the following in fsspec:

--- a/fsspec/implementations/reference.py
+++ b/fsspec/implementations/reference.py
@@ -1085,7 +1085,7 @@ class ReferenceFileSystem(AsyncFileSystem):
         if self.dircache:
             return path in self.dircache
         elif isinstance(self.references, LazyReferenceMapper):
-            return path in self.references.listdir("")
+            return path in self.references.listdir()
         else:

Are these changes pushed ?

@martindurant
Copy link
Member Author

Are these changes pushed ?

No, I haven't done that yet

@sreesanjeevkg
Copy link

sreesanjeevkg commented Aug 1, 2024

Also, if possible, can you share your notebook? append() functionality is not working out for me. I made the changes to fsspec as well. Same error as below

@martindurant Can you please share the whole notebook if possible? My append function is not working as expected. Please see screenshot below.

The version of kerchunk I installed

kerchunk 0.0.9.post422

CODE:

Screenshot 2024-08-01 at 3 21 27 PM

@sreesanjeevkg
Copy link

sreesanjeevkg commented Aug 1, 2024

@martindurant

KerchunkError.md

Notebook for reference...

in your code you have used

["CLDPROP_D3_VIIRS_SNPP.A2024174.011.2024178005308.json"] but all my JSONS are like ["CLDPROP_D3_VIIRS_SNPP.A2024174.011.2024178005308.nc.json"]

@sreesanjeevkg
Copy link

@martindurant Am I doing anything wrong in creating the reference files ? I am not able to figure it out why my parquet is not appending

@sreesanjeevkg
Copy link

@martindurant Can you please share your example notebook, along with versions of kerchunk and fsspec installed?
The append() functionality is not working for me.

@martindurant
Copy link
Member Author

I will copy here what I have later today.

martindurant added a commit to martindurant/filesystem_spec that referenced this pull request Aug 2, 2024
@martindurant
Copy link
Member Author

Complete workflow:

In [3]: import ujson
In [5]: from kerchunk.combine import MultiZarrToZarr
In [6]: from kerchunk.hdf import SingleHdf5ToZarr
In [7]: with open("CLDPROP_D3_VIIRS_SNPP.A2024174.011.2024178005308.json", "wt") as f:
   ...:     ujson.dump(SingleHdf5ToZarr("CLDPROP_D3_VIIRS_SNPP.A2024174.011.2024178005308.nc", inline_threshold=300).translate(), f)
   ...:

In [8]: with open("CLDPROP_D3_VIIRS_SNPP.A2024173.011.2024177004727.json", "wt") as f:
   ...:     ujson.dump(SingleHdf5ToZarr("CLDPROP_D3_VIIRS_SNPP.A2024173.011.2024177004727.nc", inline_threshold=300).translate(), f)
   ...:

In [9]: ppath = "/Users/mdurant/Downloads/append.parquet"
In [11]: from fsspec.implementations.reference import LazyReferenceMapper
In [12]: out = LazyReferenceMapper.create(root=ppath)
In [14]: import re
    ...: import datetime
    ...: def fn_to_time(index, fs, var, fn):
    ...:     match = re.search(r'CLDPROP_D3_VIIRS_SNPP\.A(\d{4})(\d{3})\.', fn)
    ...:     year = int(match.group(1))
    ...:     day_of_year = int(match.group(2))
    ...:     return datetime.datetime(year, 1, 1) + datetime.timedelta(days=day_of_year - 1)
    ...:

In [16]: import numpy as np
In [17]: MultiZarrToZarr(
    ...:             ["CLDPROP_D3_VIIRS_SNPP.A2024173.011.2024177004727.json"],
    ...:             coo_map={'time': fn_to_time},
    ...:             coo_dtypes={'time': np.dtype('M8[s]')},
    ...:             concat_dims=['time'],
    ...:             out=out,
    ...:         ).translate()
/Users/mdurant/code/kerchunk/kerchunk/combine.py:374: UserWarning: Concatenated coordinate 'time' contains less than expectednumber of values across the datasets: ['2024-06-21T00:00:00']
  warnings.warn(
Out[17]: <fsspec.implementations.reference.LazyReferenceMapper at 0x1027a97b0>

In [18]: MultiZarrToZarr.append(
    ...:             ["CLDPROP_D3_VIIRS_SNPP.A2024174.011.2024178005308.json"],
    ...:             original_refs=out,
    ...:             coo_map={'time': fn_to_time},
    ...:             coo_dtypes={'time': np.dtype('M8[s]')},
    ...:             concat_dims=['time'],
    ...:             remote_protocol="file"
    ...:         ).translate()
/Users/mdurant/conda/envs/py310/lib/python3.10/site-packages/xarray/backends/plugins.py:80: RuntimeWarning: Engine 'gribberish' loading failed:
No module named 'gribberish'
  warnings.warn(f"Engine {name!r} loading failed:\n{ex}", RuntimeWarning)
Out[18]: <fsspec.implementations.reference.LazyReferenceMapper at 0x1027a97b0>
In [19]: import xarray as xr
In [20]: ds = xr.open_dataset(ppath, engine='kerchunk', group = "Cloud_Optical_Thickness_1621_PCL_Log_Liquid")
In [21]: ds
Out[21]:
<xarray.Dataset>
Dimensions:             (time: 2, longitude: 360, latitude: 180)
Dimensions without coordinates: time, longitude, latitude
Data variables:
    Mean                (time, longitude, latitude) float64 ...
    Pixel_Counts        (time, longitude, latitude) float64 ...
    Standard_Deviation  (time, longitude, latitude) float64 ...
    Sum                 (time, longitude, latitude) float64 ...
    Sum_Squares         (time, longitude, latitude) float64 ...
Attributes:
    add_offset:    0.0
    long_name:     Cloud Optical Thickness Log10 for Liquid Water Clouds (1.6...
    scale_factor:  1.0
    units:         none
    valid_max:     2.176
    valid_min:     -2.0

kerchunk at 429d1df (append_deep branch, this PR)
fsspec at c8fa00d ( fsspec/filesystem_spec#1657 )
zarr at 0855bd6e (after PR1915 , current state of main/v2)

@sreesanjeevkg
Copy link

I think it is working as expected @martindurant. You can go ahead and merge all the branches. Thank you.

Will do a thorough analysis one more time in my project and let you know if any questions.

@martindurant martindurant merged commit 3ae8939 into fsspec:main Aug 2, 2024
5 checks passed
@martindurant martindurant deleted the append_deep branch August 2, 2024 20:48
martindurant added a commit to fsspec/filesystem_spec that referenced this pull request Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to append to already existing Kerchunk parquet store
2 participants