Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand NetCDF CF convention for projection info in .geobox property #837

Closed
Kirill888 opened this issue Dec 12, 2019 · 8 comments · Fixed by #899
Closed

Understand NetCDF CF convention for projection info in .geobox property #837

Kirill888 opened this issue Dec 12, 2019 · 8 comments · Fixed by #899
Assignees

Comments

@Kirill888
Copy link
Member

Expected behaviour

xr.open_dataset(<netcdf CF file>).geobox is not None

Actual behaviour

netcdf cf files loaded via xr.open_dataset are seen as not-georegistered by datcube .geobox extension. Even for files produced by datacube itself.

Also see comment thread in #835 (comment)

@snowman2
Copy link
Contributor

snowman2 commented Dec 12, 2019

I think demonstrating what this would look like would be the most useful I think. One benefit of writing the CRS this way is that it can be understood natively by software using GDAL such as QGIS or rasterio and it can also be understood by newer versions of ArcMap. More information about it can be found here: https://corteva.github.io/rioxarray/html/examples/crs_management.html

Here is what a netcdf dataset with CRS information written to the dataset/dataarray with rio.write_crs would look like when opened up using rasterio/GDAL:

>>> import rioxarray
>>> rds = rioxarray.open_rasterio("PLANET_SCOPE_3D.nc")
>>> rds
<xarray.Dataset>
Dimensions:      (time: 2, x: 10, y: 10)
Coordinates:
  * y            (y) float64 8.085e+06 8.085e+06 ... 8.085e+06 8.085e+06
  * x            (x) float64 4.663e+05 4.663e+05 ... 4.663e+05 4.663e+05
  * time         (time) object 2016-12-19 10:27:29 2016-12-29 12:52:41.659696
    spatial_ref  int64 0
Data variables:
    blue         (time, y, x) float64 ...
    green        (time, y, x) float64 ...
Attributes:
    grid_mapping:  spatial_ref

Here is what is inside the spatial_ref coordinate:

>>> rds.coords["spatial_ref"]
<xarray.DataArray 'spatial_ref' ()>
array(0)
Coordinates:
    spatial_ref  int64 0
Attributes:
    spatial_ref:  PROJCS["WGS 84 / UTM zone 22S",GEOGCS["WGS 84",DATUM["WGS_1...
    crs_wkt:      PROJCS["WGS 84 / UTM zone 22S",GEOGCS["WGS 84",DATUM["WGS_1...

When parsing the data_vars the CRS information should be consistent:

>>> rds.blue
<xarray.DataArray 'blue' (time: 2, y: 10, x: 10)>
array([[[6.611017, 5.580979, 0.399607, 2.052803, 5.479985, 4.760219,
...
         1.888442, 3.491315, 5.055704, 3.368395]]])
Coordinates:
  * y            (y) float64 8.085e+06 8.085e+06 ... 8.085e+06 8.085e+06
  * x            (x) float64 4.663e+05 4.663e+05 ... 4.663e+05 4.663e+05
  * time         (time) object 2016-12-19 10:27:29 2016-12-29 12:52:41.659696
    spatial_ref  int64 0
Attributes:
    grid_mapping:  spatial_ref
    nodata:        0
    units:         ('DN', 'DN')
    _FillValue:    nan
    transform:     (3.0, 0.0, 466266.0, 0.0, -3.0, 8084700.0)
    scale_factor:  1.0
    add_offset:    0.0
>>> rds.blue.coords["spatial_ref"]
<xarray.DataArray 'spatial_ref' ()>
array(0)
Coordinates:
    spatial_ref  int64 0
Attributes:
    spatial_ref:  PROJCS["WGS 84 / UTM zone 22S",GEOGCS["WGS 84",DATUM["WGS_1...
    crs_wkt:      PROJCS["WGS 84 / UTM zone 22S",GEOGCS["WGS 84",DATUM["WGS_1...

Also, since the CRS information is in the coordinates, when performing some xarray operations, the CRS information is preserved:

>>> rds.blue.astype("int")
<xarray.DataArray 'blue' (time: 2, y: 10, x: 10)>
array([[[6, 5, 0, 2, 5, 4, 5, 5, 0, 5],
,,,
        [1, 6, 2, 5, 0, 4, 1, 3, 5, 3]]])
Coordinates:
  * y            (y) float64 8.085e+06 8.085e+06 ... 8.085e+06 8.085e+06
  * x            (x) float64 4.663e+05 4.663e+05 ... 4.663e+05 4.663e+05
  * time         (time) object 2016-12-19 10:27:29 2016-12-29 12:52:41.659696
    spatial_ref  int64 0
>>> rds.blue.astype("int").coords["spatial_ref"]
<xarray.DataArray 'spatial_ref' ()>
array(0)
Coordinates:
    spatial_ref  int64 0
Attributes:
    spatial_ref:  PROJCS["WGS 84 / UTM zone 22S",GEOGCS["WGS 84",DATUM["WGS_1...
    crs_wkt:      PROJCS["WGS 84 / UTM zone 22S",GEOGCS["WGS 84",DATUM["WGS_1...

But, to be able to retrieve the CRS information, you would have to have a standard coordinate to look for when the grid_mapping attribute is lost/missing. That is how it is handled in the rio.crs property.

>>> rds.blue.astype("int").rio.crs
CRS.from_epsg(32722)

From #835 (comment)

So long as we can expect WKT string to be present somewhere, not too keen on supporting assembling of all possible WKT parameters stored as separate fields in the attribute dictionary as suggested by CF convention.

I agree. In rioxarray we just support the WKT version of the projection from the CF conventions at this time for read/write.

@Kirill888
Copy link
Member Author

@snowman2 thanks for detailed example. I didn't realise that spatial_ref is a coordinate rather than a data_var. Would you be willing to share an example netcdf file? I'd like to see how it looks internally. We do have code that generates NetCDF CF compliant files with spatial data, but with those variable with CRS info appears as data_var after loading with xr.dataset_open.

I must say that I do not fully understand the coordinates vs dimensions model of xarray, kinda assume one-to-one mapping, but it's obviously not that simple. Having a special "empty" coordinate that gets propagated through various xarray operations looks like a good approach.

I should play around with rioxarray properly.

I should probably ping @andrewdhicks and @omad on this.

@snowman2
Copy link
Contributor

Sounds good 👍. Here is the PLANET_SCOPE_3D.nc file. I think the main difference is that there is a global attribute called coordinates in the file for xarray to use to differentiate between variables and coordinates:

...
// global attributes:
		:coordinates = "spatial_ref" ;

If you write the netCDF file using .to_netcdf() and the spatial_ref is in the coords then it should read back in as coords.

Side note - I was using the latest (unreleased) version of rioxarray from GitHub as it supports loading in 1-D coordinates.

@Kirill888
Copy link
Member Author

Thanks @snowman2, I have some questions regarding this scheme.

  1. CF convention seems to be using crs_wkt name for WKT attribute not spatial_ref.
  2. CF convention demands presence of grid_mapping_name attribute and only allows a small set of defined names without any escape hatches that I can see in the spec.

Even if using CF supported projections, figuring out grid_mapping_name from a CRS is not an obvious calculation.

I feel like having a scheme that is similar to CF convention, but isn't actually, is worse than having a completely different representation.

I do like non-dimension coordinate approach to store CRS metadata, just not sure about whole CF convention which seems to be very limited as far as availability of projections go.

@snowman2
Copy link
Contributor

Some background on our choice for using the subset of CRS CF conventions. In the beginning of our search for a format to use in the netCDF file, we wanted to be able to write a netCDF that would be understood by GDAL and ArcGIS. The motivation for the format for the CRS being readable by GDAL is that it can be read in using rasterio, QGIS, and the datacube load() call.

So, we did some digging and found that if you add the spatial_ref attribute to the spatial_ref coordinate, GDAL could read it in. The code responsible for this is likely this bit here. But, the official spec states that it should be crs_wkt. So, we added both in for good measure and possibly future compatibility. We did some testing and were able to successfully load it in with rasterio, QGIS and ArcMap and so it met our minimum requirements.

Also since GDAL supported the WKT string representation, that meant that it would also support any projection GDAL can read in with the WKT representation. So, that worked for our needs as well. Not sure about ArcMap - didn't test it too extensively.

This is very useful for our system as users can load in the data from datacube, do some computation and write the projection to the newly computed netCDF file in such a way that we can add it back into the system as a custom product and it will be read in properly. Also, having it in a format that GDAL can read in means that users can use rasterio to open the raw file and get the CRS information they need.

You are indeed correct that the CF conventions for projections is limited. PROJ/GDAL developers say that WKT is the better format. Maybe this could be something to bring up for discussion here: https://github.com/cf-convention/cf-conventions.

Side note: pyproj.CRS has limited support for reading/writing CRS using CF conventions, but it has limitations and is strongly WKT oriented (see docstings and this issue).

@Kirill888
Copy link
Member Author

I see, so spatial_ref:spatial_ref is a GDAL way to geo-reference "not-quite CF compliant" netcdf files, like when you need projection other than one in the CF spec?

NetCDF CF files datacube produces don't use those names, instead we needed to follow CF spec more closely and actually run compliance checker tool on produced netcdf files. We still use GDAL (via rasterio) to read those, we do not do any special CRS processing, all of it is done by GDAL code. So, I assume GDAL does support CF convention for reading, but probably not for arbitrary WKT string.

We do write files using netCDF4 library rather than GDAL, not sure what was the original reason for it, probably to do with the fact that we store dataset yaml text in one of the netcdf variables, or due to our need to be CF compliant more than what GDAL could produce at the time?

Since CF spec does not mandate any particular name for a variable containing CRS info we can use spatial_ref instead of current crs. CF spec allows for more attributes than what is in the spec, so copying crs_wkt into spatial_ref won't hurt, I guess.

So, what we have so far (at run-time level as opposed to storage level):

  • Attribute named crs at Dataset and DataArray level, containing a string (set by rasterio)
  • Coordinate named spatial_ref with string attribute spatial_ref, some GDAL convention of unknown source for netcdf files.
  • Full on CF convention: grid_mapping string attribute at DataArray level pointing to a coordinate or data var of that name that contains grid_mapping_name string attribute + other info and possibly crs_wkt attribute.

what a mess.

@snowman2
Copy link
Contributor

snowman2 commented Dec 13, 2019

so spatial_ref:spatial_ref is a GDAL way to geo-reference "not-quite CF compliant" netcdf files, like when you need projection other than one in the CF spec?

I should clarify that this only works in GDAL when the spatial_ref attribute is in the attributes of the coordinate name specified by the grid_mapping attribute. So, you can put it in there along with crs_wkt in the grid mapping attributes. The coordinate name spatial_ref has no special meaning in GDAL, but it does in rioxarray as it is the default one we chose for when the grid_mapping attribute is lost.

So, I assume GDAL does support CF convention for reading, but probably not for arbitrary WKT string.

GDAL does support reading in an arbitrary WKT string in the spatial_ref attribute. In the GDAL code here it attempts to read in both the WKT and the CF grid mapping parameters. If they both exist and the CRS created are different, the WKT one is dropped. So, adding the non-WKT CF parameters could cause GDAL to read it in and potentially lose projection information as CF grid mapping parameters do not always fully represent the CRS.

We do write files using netCDF4 library rather than GDAL ...

We use xarray's to_netcdf() method to write them.

@snowman2
Copy link
Contributor

Brought this up on the CF conventions repo for discussion: cf-convention/cf-conventions#222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants