Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior with fsspec.open with mode='r' vs. mode='rb' #579

Open
rabernat opened this issue Mar 23, 2021 · 13 comments
Open

Inconsistent behavior with fsspec.open with mode='r' vs. mode='rb' #579

rabernat opened this issue Mar 23, 2021 · 13 comments

Comments

@rabernat
Copy link
Contributor

It's unclear to me how to use fsspec.open as a contextmanager. The behavior of this function seems inconsistent. The inconsistency is somehow related to the mode argument. I will try to illustrate this with an example.

Create a test file

Bypass fsspec completely

fname = 'test.txt'
with open(fname, mode='w') as f:
    f.write('hello')

Open the file with a FileSystem instance

When I open the file via an instantiated FileSystem instance, everything works as I expect

fs = fsspec.implementations.local.LocalFileSystem()
with fs.open(fname, mode='r') as fp:
    print(type(fp))
# -> <class '_io.TextIOWrapper'>
with fs.open(fname, mode='rb') as fp:
    print(type(fp))
# -> <class '_io.BufferedReader'>

The objects yielded to the context manager look like standard open file objects and can be used as such throughout python.

Open the file via fsspec.open

with fsspec.open(fname, mode='r') as fp:
    print(type(fp))
# -> <class '_io.TextIOWrapper'>
with fsspec.open(fname, mode='rb') as fp:
    print(type(fp))
# -> <class 'fsspec.implementations.local.LocalFileOpener'>

With mode='r', the fsspec.open yields object is the same type of object as fs.open. But with mode='rb', we get a LocalFileOpener. ⚠️ this is the key problem. In order to get a BufferedReader, we need an additional context manager!

with fsspec.open(fname) as fp:
    with fp as fp2:
        print(type(fp2))
# -> <class '_io.BufferedReader'>

I can't figure out what this LocalFileOpener object is. It's not documented in the API docs. Most importantly for my purposes, xarray can't figure out what to do with it if I pass it to open_dataset. In contrast, it handles an _io.BufferedReader object fine.

Proposed resolution

I would propose to remove the inconsistency and have with fsspec.open(fname, mode='rb') yield an _io.BufferedReader object. However, I recognize this could be a breaking change for some applications that rely on the LocalFileOpener.

@martindurant
Copy link
Member

fsspec cares about serializability. Whilst OpenFile instances have explicit pickle support, we try to make the file-like objects (cloud)pickle-friendly. This is the job of LocalFileOpener.

What is it about LocalFileOpener that confuses xarray?

@martindurant
Copy link
Member

OK, so it's the line

elif isinstance(filename_or_obj, io.IOBase)

which LocalFileOpener is not. The reason is, that to derive from io.IOBase would mean either using __getattribute__ instead of __getattr__ or listing all of the methods we wish to pass on to the actual, not-serialisable local file. I would suggest that, instead, xarray check for the methods it needs (seek, read) rather than using an instance check.

@martindurant
Copy link
Member

fsspec might be a decent place to create utility functions like isfilelike; although I thought that methods like readable() and writable() were meant for this purpose.

@rabernat
Copy link
Contributor Author

But on a deeper level, don't you think it's weird that fs.open(url, mode='rb') returns one thing but fsspec.open(url, mode='rb') returns another?

@martindurant
Copy link
Member

Not really? fsspec.open isn't a method on a file system implementation, it's a convenience function that happens to have the same name and enables, for instance, caching and compression.

@rabernat
Copy link
Contributor Author

But why does does fsspec.open(url, mode='r') return an _io.TextIOWrapper? Don't the same arguments about serializability apply?

@martindurant
Copy link
Member

True, that could be fixed too.
As it stands, though, but the outputs of fsspec.open (OpenFile instances) and of fs.open (file-like objects) are pickleable. I don't know yet how to pickle the TextIOWrappers - maybe a subclass owuld be appropriate, since the only thing it actually needs to know is the current position and any newline characters.

@rabernat
Copy link
Contributor Author

I am sitting with @cisaacstern working on pangeo forge and we have hit this issue again. Let me show why it is problematic in the context of pangeo forge.

goal: open files from different locations (local, http, s3) using fsspec.open and feed them to xarray in a consistent way, without special case logic. We do NOT want to just pass paths around. We want to have one layer of our pipeline convert a path to an OpenFile object (with certain custom options related to secrets, etc.) and the next layer open this with xarray.

We will work with the following example dataset: https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc

We will try to access the file both locally and over http.

import fsspec
import xarray as xr

# open locally
of_local = fsspec.open("sresa1b_ncar_ccsm3-example.nc", mode="rb")
xr.open_dataset(of_local)  # works

# open over http
of_http = fsspec.open(url, mode="rb")
xr.open_dataset(of_http)  # does NOT work

with of_http as fp:
    xr.open_dataset(of_http) # works

The problem here is that we need this extra context manager only in the case of the http-based file, but not with the local file. This means that we need special case logic for different types of files. This feels wrong.

@martindurant, do you have any recommendations here? How can we achieve our goal? Should we be using fsspec differently.

@martindurant
Copy link
Member

The context manager or fsspec.open(url, mode="rb").open() do work for both cases. The fact that xr.open_dataset(of_local) works is accidental: xarray/h5py does not recognise OpenFile as a file-like, and tries to use fspath, which only works for local files, as intended.

@rabernat
Copy link
Contributor Author

rabernat commented May 27, 2022

Interesting. I tried that over in pangeo-forge/pangeo-forge-recipes#370 and it turned up some serialization issues with beam.

If I do

with of_http as fp:
    ds = xr.open_dataset(of_http)

is the resulting dataset object serializable?

@martindurant
Copy link
Member

is the resulting dataset object serializable?

I am not certain. I would have thought yes.

@rabernat
Copy link
Contributor Author

rabernat commented May 29, 2022

This issue has been a thorn in our side for a while time, so I did a deep dive. Warning, copious detail ahead ⚠️ 🤓 ⚠️

The following code tests all combinations of the following five different parameters

parameter options
file type netcdf4 / netcdf3
url_type http / local file
fsspec_open_method of = fsspec.open / create an fs and call of = fs.open
xr_open_method call xr.open_dataset(of) / `with of as fp: xr.open_dataset(fp)
engine all possible xarray engines for file type

We test two things:

  • open_status: whether Xarray can open the object
  • pickle_status: whether the resulting dataset can be pickled
code to generate table
from pickle import dumps

import fsspec
from fsspec.implementations.http import HTTPFileSystem
from fsspec.implementations.local import LocalFileSystem
import xarray as xr
import pandas as pd

def open_from_fs(path):
    if path.startswith("http"):
        fs = HTTPFileSystem()
    else:
        fs = LocalFileSystem()
    return fs.open(path)

def open_from_open(path):
    return fsspec.open(path)

def xr_open_directly(thing, **kwargs):
    ds = xr.open_dataset(thing, **kwargs)
    return thing

def xr_open_with_context_manager(thing, **kwargs):
    with thing as fp:
        with xr.open_dataset(fp, **kwargs) as ds:
            return ds
    

files = {
    'netcdf4': 'https://www.unidata.ucar.edu/software/netcdf/examples/OMI-Aura_L2-example.nc',
    'netcdf3': 'https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc'
}

engines = {
    'netcdf4': ('h5netcdf', 'netcdf4', None),
    'netcdf3': ('scipy', 'netcdf4', None)
}

fsspec_open_methods = {
    'fsspec.open': open_from_fs,
    'fs.open': open_from_open
}

xr_open_methods = {
    'direct_open': xr_open_directly,
    'context_manager': xr_open_with_context_manager
}

results = []
columns = ('file type', 'url_type', 'fssepc_open_method', 'xr_open_method', 'engine', 'open_status', 'pickle_status')
for file_type, url in files.items():
    local_path = url.split('/')[-1]
    for url_type, path in zip(('http', 'local'), (url, local_path)):
        for open_method, open_fn in fsspec_open_methods.items():
            for xr_open_method, xr_open_fn in xr_open_methods.items():
                for engine in engines[file_type]:
                    params = (file_type, url_type, open_method, xr_open_method, engine or "None")
                    pickle_status = open_status = ""
                    try:
                        open_file = open_fn(path)
                        ds = xr_open_fn(open_file, engine=engine)
                        open_status = "✅"
                        try:
                            _ = dumps(ds)
                            pickle_status = "✅"
                        except Exception as e1:
                            pickle_status = f"❌ {type(e1).__name__}: {e1}".replace('\n', ' ')
                    except Exception as e2:
                        open_status = f"❌ {type(e2).__name__}: {e2}".replace('\n', ' ')
                        pickle_status = "n/a"
                    finally:
                        open_file.close()
                    results.append(
                        params + (open_status, pickle_status)
                    )
    

df = pd.DataFrame(data=results, columns=columns)
display(df)
df.to_markdown("results.md", index=False, tablefmt="github")
huge table with all the parameters
file type url_type fssepc_open_method xr_open_method engine open_status pickle_status
netcdf4 http fsspec.open direct_open h5netcdf
netcdf4 http fsspec.open direct_open netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf4 http fsspec.open direct_open None
netcdf4 http fsspec.open context_manager h5netcdf
netcdf4 http fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf4 http fsspec.open context_manager None ❌ ValueError: I/O operation on closed file. n/a
netcdf4 http fs.open direct_open h5netcdf ❌ AttributeError: 'HTTPFile' object has no attribute 'fspath' n/a
netcdf4 http fs.open direct_open netcdf4 ❌ AttributeError: 'HTTPFile' object has no attribute 'fspath' n/a
netcdf4 http fs.open direct_open None ❌ ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy', 'cfgrib', 'pydap', 'rasterio', 'zarr']. Consider explicitly selecting one of the installed engines via the engine parameter, or installing additional IO dependencies, see: https://docs.xarray.dev/en/stable/getting-started-guide/installing.html https://docs.xarray.dev/en/stable/user-guide/io.html n/a
netcdf4 http fs.open context_manager h5netcdf ❌ ValueError: I/O operation on closed file. n/a
netcdf4 http fs.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf4 http fs.open context_manager None ❌ ValueError: I/O operation on closed file. n/a
netcdf4 local fsspec.open direct_open h5netcdf
netcdf4 local fsspec.open direct_open netcdf4
netcdf4 local fsspec.open direct_open None
netcdf4 local fsspec.open context_manager h5netcdf ❌ TypeError: cannot pickle '_io.BufferedReader' object
netcdf4 local fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf4 local fsspec.open context_manager None ❌ TypeError: cannot pickle '_io.BufferedReader' object
netcdf4 local fs.open direct_open h5netcdf
netcdf4 local fs.open direct_open netcdf4
netcdf4 local fs.open direct_open None
netcdf4 local fs.open context_manager h5netcdf
netcdf4 local fs.open context_manager netcdf4
netcdf4 local fs.open context_manager None
netcdf3 http fsspec.open direct_open scipy
netcdf3 http fsspec.open direct_open netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf3 http fsspec.open direct_open None
netcdf3 http fsspec.open context_manager scipy
netcdf3 http fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf3 http fsspec.open context_manager None
netcdf3 http fs.open direct_open scipy ❌ AttributeError: 'HTTPFile' object has no attribute 'fspath' n/a
netcdf3 http fs.open direct_open netcdf4 ❌ AttributeError: 'HTTPFile' object has no attribute 'fspath' n/a
netcdf3 http fs.open direct_open None ❌ ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4', 'h5netcdf', 'scipy', 'cfgrib', 'pydap', 'rasterio', 'zarr']. Consider explicitly selecting one of the installed engines via the engine parameter, or installing additional IO dependencies, see: https://docs.xarray.dev/en/stable/getting-started-guide/installing.html https://docs.xarray.dev/en/stable/user-guide/io.html n/a
netcdf3 http fs.open context_manager scipy
netcdf3 http fs.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf3 http fs.open context_manager None
netcdf3 local fsspec.open direct_open scipy
netcdf3 local fsspec.open direct_open netcdf4
netcdf3 local fsspec.open direct_open None
netcdf3 local fsspec.open context_manager scipy ❌ TypeError: cannot pickle '_io.BufferedReader' object
netcdf3 local fsspec.open context_manager netcdf4 ❌ ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf' n/a
netcdf3 local fsspec.open context_manager None ❌ TypeError: cannot pickle '_io.BufferedReader' object
netcdf3 local fs.open direct_open scipy
netcdf3 local fs.open direct_open netcdf4
netcdf3 local fs.open direct_open None
netcdf3 local fs.open context_manager scipy
netcdf3 local fs.open context_manager netcdf4
netcdf3 local fs.open context_manager None

There is a lot of information in that table. So Let's focus just on netcdf4 files and the h5netcdf engine (most commonly used in pangeo forge because it can open remote files via h5py).

Always use Context Manager

Always using the context manager (as recommend in #579 (comment)) narrows the results down to

url_type fssepc_open_method open_status pickle_status
http fsspec.open
http fs.open
local fsspec.open ❌ TypeError: cannot pickle '_io.BufferedReader' object
local fs.open

Here see that we can't use fsspec.open on local files and the context manager, because it puts an _io.BufferedReader object into the dataset which can't be pickled.

Don't Use Context Manager

We can disregard the advice of #579 (comment) and not use the context manager, passing the open_file object directly to xarray. (If this works it is "accidental".) Then we get the following results

url_type fssepc_open_method open_status pickle_status
http fsspec.open
http fs.open ❌ AttributeError: 'HTTPFile' object has no attribute 'fspath' n/a
local fsspec.open
local fs.open

Here we can't use fs.open where fs is an HTTPFileSystem, because it is not openable by xarray. Ultimately, the error is coming from h5netcdf, as illustrated by the following simple reproducer

url = 'https://www.unidata.ucar.edu/software/netcdf/examples/OMI-Aura_L2-example.nc' # netcdf
open_file = fsspec.open(url)
ds = xr.open_dataset(open_file, engine='h5netcdf')
traceback
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-71c2c6bbd0b7> in <module>
      3 open_file = fsspec.open(url)
      4 #with open_file as fp:
----> 5 ds = xr.open_dataset(open_file, engine='h5netcdf')
      6 #    _ = dumps(ds)

~/Code/xarray/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    493 
    494     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495     backend_ds = backend.open_dataset(
    496         filename_or_obj,
    497         drop_variables=drop_variables,

~/Code/xarray/xarray/backends/h5netcdf_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, format, group, lock, invalid_netcdf, phony_dims, decode_vlen_strings)
    384     ):
    385 
--> 386         filename_or_obj = _normalize_path(filename_or_obj)
    387         store = H5NetCDFStore.open(
    388             filename_or_obj,

~/Code/xarray/xarray/backends/common.py in _normalize_path(path)
     21 def _normalize_path(path):
     22     if isinstance(path, os.PathLike):
---> 23         path = os.fspath(path)
     24 
     25     if isinstance(path, str) and not is_remote_uri(path):

~/Code/filesystem_spec/fsspec/core.py in __fspath__(self)
     96     def __fspath__(self):
     97         # may raise if cannot be resolved to local file
---> 98         return self.open().__fspath__()
     99 
    100     def __enter__(self):

AttributeError: 'HTTPFile' object has no attribute '__fspath__'

Why does this matter?

In Pangeo Forge, we need to open Xarray datasets and serialize them starting from all four of the following type of objects

  • open_file = fsspec.open(local_path) -> <OpenFile '/local/path.nc'>
    • ❌ doesn't work with context manager (TypeError: cannot pickle '_io.BufferedReader' object)
    • ✅ works without context manager
  • open_file = fsspec.open(http_path) -> <OpenFile 'https://path.nc'>
    • ✅ works with context manager
    • ✅ works without context manager
  • open_file = fs_local.open(local_path) -> <fsspec.implementations.local.LocalFileOpener at 0x18425a880>
    • ✅ works with context manager
    • ✅ works without context manager
  • open_file = fs_http.open(http_path) -> <File-like object HTTPFileSystem, https://path.nc>
    • ✅ works with context manager
    • ❌ doesn't work without context manager (AttributeError: 'HTTPFile' object has no attribute 'fspath')

As this list shows, we can't just decide to always use a context manager or never use a context manager. The only viable option today is what I do in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/370/files#r884065015: use try / except logic to find something that works. I would prefer instead to have a stronger contract with fsspec.

What is the core problem in fsspec?

IMO the core issue is the fact that fsspec.open and fs.open do different things, and that the resulting objects do not have consistent behavior with xarray / h5netcdf. I wish that fsspec.open would give me back the exact same type of thing that fs.open does. Instead, fsspec.open creates fsspec.core.OpenFile objects, fs_local.open creates fsspec.implementations.local.LocalFileOpener , and fs_http.open creates fsspec.implementations.http.HTTPFile. Is an FileOpener supposed to behave the same as OpenFile? Which of the two APIs does HTTPFile implement? The situation is highly confusing.

We can continue working around this in Pangeo Forge, but having to try / except around this makes Pangeo Forge more fragile; what if a different, legitimate i/o error comes up inside the try block? It would be hard to detect. Furthermore, based on this analysis, we can only assume that more inconsistent behavior is lurking in all of the other implementations (gcsfs, s3fs, ftp, etc.), waiting to trip us up.

I hope I have managed to articulate this problem. Martin, would really appreciate your advice on where to go from here.

@rabernat
Copy link
Contributor Author

rabernat commented Jun 8, 2022

The resolution to this appears to be actually to not use the context manager. This is what we came up with in google/xarray-beam#49 (comment)

if hasattr(open_file, "open"):
    open_file = open_file.open()
ds = xr.open_dataset(open_file)

This seems to work with whatever fsspec can throw at us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants