Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CRS to the spatial dimensions. #835

Merged
merged 3 commits into from
Dec 11, 2019
Merged

Add CRS to the spatial dimensions. #835

merged 3 commits into from
Dec 11, 2019

Conversation

petewa
Copy link
Contributor

@petewa petewa commented Dec 11, 2019

Part of Bug squash 2019:

  • Added CRS to the spatial dimensions where they are more likely to persist and not get lost through xarray operations.
  • CRS is a string both in the dimensions and global attributes.
  • Adjusted tests.

'crs' is now a string in the spatial dimensions and attributes.
Tests adjusted to suit.
@petewa petewa self-assigned this Dec 11, 2019
spatial_dims = data_array.dims[-2:]

if 'crs' in data_array[data_array.dims[-1]].attrs:
if data_array[spatial_dims[0]].crs != data_array[spatial_dims[1]].crs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can fail with IndexError, you check that crs is present in the attributes of the second spatial dimension, but then assume first one also has it, I would use something like

set(d.get('crs', None) for d in spatial_dims)

crs = data_array[data_array.dims[-1]].crs
else:
# fall back option
crs = obj.crs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fail with AttributeError

fallback should look in data_array.attr.get('crs', None) and only then in obj.attrs.get('crs', None)


for measurement in measurements:
data = data_func(measurement)
attrs = measurement.dataarray_attrs()
attrs['crs'] = geobox.crs
attrs['crs'] = str(geobox.crs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for performance and safety reasons I would convert geobox.crs to string once, that way you also save on memory as you won;t be storing multiple identical but different strings.

@coveralls
Copy link

coveralls commented Dec 11, 2019

Coverage Status

Coverage decreased (-0.02%) to 88.605% when pulling 18388e3 on bugsquash/crs into 1b649b5 on develop.

@codecov
Copy link

codecov bot commented Dec 11, 2019

Codecov Report

Merging #835 into develop will decrease coverage by 0.01%.
The diff coverage is 85.18%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #835      +/-   ##
===========================================
- Coverage    88.66%   88.64%   -0.02%     
===========================================
  Files          112      112              
  Lines        10344    10364      +20     
===========================================
+ Hits          9171     9187      +16     
- Misses        1173     1177       +4
Impacted Files Coverage Δ
datacube/api/core.py 90.96% <100%> (ø) ⬆️
datacube/drivers/netcdf/_write.py 95.12% <75%> (-2.32%) ⬇️
datacube/utils/xarray_geoextensions.py 85.71% <85%> (-1.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b649b5...18388e3. Read the comment docs.

@Kirill888 Kirill888 self-requested a review December 11, 2019 05:25
if len(crs_set) > 1:
raise ValueError('Spatial dimensions have different crs.')
elif len(crs_set) == 1 and None not in crs_set:
crs = data_array[data_array.dims[-1]].crs
Copy link
Member

@Kirill888 Kirill888 Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably cleaner if it was just

elif len(crs_set) == 1:
    crs = crs_set.pop()

and then :

if crs is None:
   ..fallback..

@Kirill888 Kirill888 self-requested a review December 11, 2019 05:46
Copy link
Member

@Kirill888 Kirill888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, hopefully with this more xarray operations will preserve geo registration information

@mergify mergify bot merged commit bf0c9fd into develop Dec 11, 2019
@mergify mergify bot deleted the bugsquash/crs branch December 11, 2019 05:55
@snowman2
Copy link
Contributor

Thoughts on writing the CRS using CF grid_mapping conventions in the coordinates of the xarray dataset similar to what is done in rioxarray?

If interested, here are some potentially useful links:

  • Code for writing CRS - write_crs.
  • Code for retrieving CRS - crs.

@Kirill888
Copy link
Member

@snowman2 I think everyone would love to see various geo libraries converge to common representation of projections and the ways to attach them to raster data structures like xarray. Thank you Alan for pushing forward this effort.

Looks like xarray in particular is becoming a de-facto standard for dealing with geo-registered raster data in Python. Given it's origins with NetCDF, adopting NetCDF CF inspired convention makes sense. But maybe with some modifications.

My, limited, understanding of NetCDF CF is that all projection information is stored at the Dataset level in the attributes of the crs variable. Which, probably makes sense at the storage level but becomes very cumbersome at run-time. Moving that information into attributes of the Dataset itself is a step forward, but is still problematic as you can't extract single DataArray out of the Dataset and still have CRS data present. So we copy it into DataArray attributes as well. That works for more cases but it is still easy to drop CRS with a simple operation like type conversion, a very common thing people do is to convert to float32 after loading. This is very "surprising", in a bad way, to data users.

I believe the correct place to attach this information is to the X/Y coordinates themselves for a couple of reasons:

  1. Values stored in the coordinate arrays ARE interpreted in the context of the projection, so in a way it's a property of axis more than a property of the dataset/array.
  2. Pixel level operations, like type conversion, or band math on bands that share axis, will preserve coordinates and their attributes, and so won't drop CRS metadata.

So that's what this change was about. Also with this change datacube now stores CRS in a string form rather than datacube.utils.geometry.CRS object, so should help with things like .to_netcdf not working with datacube loaded data.

As far as changing from "string in an attribute named crs" to something more consistent with NetCDF CF convention, that should work for us, we should be able to make that change without major costs. So long as we can expect WKT string to be present somewhere, not too keen on supporting assembling of all possible WKT parameters stored as separate fields in the attribute dictionary as suggested by CF convention.

@Kirill888
Copy link
Member

I read up some more on CF model, correction to my previous post

  1. CRS info is stored in "some" Dataset variable, name is arbitrary and there could be several, it's just our code always generates one and calls it crs, but it can be anything.
  2. Projection used by a particular band is defined by an attribute grid_mapping that contains a string which is a name of the variable containing projection info. So pointer via name essentially.

I do think datacube code should be able to interpret that information and construct proper GeoBox from that. That way netcdf cf files loaded via xr.open_datasets would have proper .geobox property.

I also think this is a very inconvenient format to maintain at run-time, basically you can't detach individual band from the dataset and most operations will drop CRS data.

Attaching CRS info to coordinates makes most sense to me, for reasons outlined in the above post. However, having methods for moving between these representations more or less seamlessly would be nice to have.

storage representation (CF) -> 
  run time representation -> 
    some non-trivial band ops/mixing (maintains run-time representation, but not CF) -> 
       storage representation (CF)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants