Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common cloud-based time series data format #7

Closed
huhabla opened this issue Nov 28, 2017 · 9 comments
Closed

Common cloud-based time series data format #7

huhabla opened this issue Nov 28, 2017 · 9 comments

Comments

@huhabla
Copy link

huhabla commented Nov 28, 2017

In openEO we handle large time series data of satellite scenes. IMHO we need a data type to describe such data that reflects the cloud based storage approach. This data type should be a s simple but sophisticated as possible, so that time series data that was generated by services or data searches describe the generated/stored data sufficiently for further cloud processing. I would suggest a simple text/json based description of time series or image collections that includes the url's that points to the actual data (jp2000, geotiff, XML in an object storage) and provides additional metadata like time stamps, bounding box and others.

For example:
A search for Sentinel2A scenes can return this data type that includes all links to the scenes, bands and metadata. The result of the search can directly be used in processes that compute the NDVI for a time series of Sentinel2A scenes. This process will return the same data type to describe the resulting image collection and their links in a cloud storage. The result of the NDVI computation can then be used in other services that process time series data an accept the suggest time series data type as input.

@Mec-iS
Copy link

Mec-iS commented Jan 8, 2018

Geospatial standards do not address time format usually. I think that a nice point to start from may be GeoJSON-events

@m-mohr m-mohr added this to the v1.0.0 milestone Feb 5, 2018
@GreatEmerald
Copy link
Member

I'm not sure if the definition of "time series" is clear enough in this case... What you're describing sounds like a collection to me. We already have a sort of definition of "time series" as a JSON response that includes timestamps and values for a single pixel (see "Getting the timeseries" in https://github.com/Open-EO/openeo-python-client/blob/master/examples/notebooks/Compositing.ipynb), though it's not an official definition yet as per #46.

@m-mohr m-mohr added the other label Mar 11, 2018
@m-mohr m-mohr removed the feature label Apr 5, 2018
@m-mohr m-mohr added the processes Process definitions and descriptions label Apr 27, 2018
@m-mohr
Copy link
Member

m-mohr commented May 15, 2018

I agree with @GreatEmerald. For me the description sounds like a collection, too. But the collections are handled like data cubes in openEO and working on individual files as suggested in the first post doesn't seem suitable here. @huhabla Could you please explain what exactly you meant?
If it's just about linking to the data files, then we could simply use STAC, which we plan to use anyway to some extent. Not all back-ends might want to list their data files though and this should be optional.

@huhabla
Copy link
Author

huhabla commented May 16, 2018

I am not referring to the data formats that is used in the openeo backends.
What i meant is a data exchange format that represents an image collection with time stamps for each image. The best case would be a cloud-optimized solution.

For example howto download the the image collection of a NDVI computation? The input data of the computation was a time series of Landsat images (image collection) with different time stamps and spatial extents, in the worst case scattered around the globe. Should this be handled as a data cube using netcdf? What is the best export format for this? I would suggest a cloud compatible format like CO-GeoTiff for the images. These images are stored in a directory that contains the time stamps of the GeoTiff files in a text file. This directory can be compressed. The key is that HTTP GET requests can be used to download parts of the GeoTiff files as well as parts of the archive when hosted on a HTTP server or in an object storage.

My point is: What to do in case the user don't want to store the resulting image collections in the database of the backend, but in on a HTTP server or an object storage and want to use it in further processing? Then we should use a data format, that can be directly accessed in the openeo backends, using HTTP GET range requests or virtual file systems.

GDAL supports for this case a virtual file system:
http://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsis3

@m-mohr m-mohr added result access and removed processes Process definitions and descriptions labels Jun 27, 2018
@m-mohr m-mohr modified the milestones: v0.3, v1.0 Jul 5, 2018
@m-mohr m-mohr modified the milestones: v1.0, v0.5 Sep 12, 2018
@m-mohr m-mohr modified the milestones: v0.5, v1.0 Dec 6, 2018
@pramitghosh
Copy link

Yes, I agree with @huhabla that storing additional info on the GeoTIFFs in plaintext is a good idea.

As a side note, I am using a CSV (later transformed to a JSON array) file for storing information such as timestamps, band names and (relative) path of GeoTIFFs organized in directories (as seen here: https://github.com/Open-EO/openeo-r-udf/blob/master/data/example_udf_in/legend.csv). Maybe something like this could be adopted for the generic problem for time-series data in openEO.

As this look-up table (a "legend" file) contains the relative local paths, this might be useful to transfer time-series data selectively (say only the GeoTIFFs that are required) once the set of files are made available over HTTP through some endpoint.

Might be useful for Open-EO/openeo-udf#9

@m-mohr
Copy link
Member

m-mohr commented Aug 21, 2019

I'm not sure whether we can really assume that every back-end supports such a format and currently it lacks support by back-ends. We may make it for 1.0, but maybe this is something that's out of scope for now. The simplest approaches would probably be netCDF or a STAC catalog containing COGs if we want to go for it soon. I'd like some feedback from back-ends on what they think about this.

Edit, 21 days later: I still think a common basis to agree on could potentially be a STAC catalog which contains cloud-optimized GeoTiffs for raster data and another format for vector data. The summaries field in the collections could be used to describe and expose chunks of data, so that you don't need to download all the data. Maybe there's even a better way to index the data, which we could port back to STAC. All that would probably be more a best practice than being part of the core openEO API.

@m-mohr
Copy link
Member

m-mohr commented Sep 13, 2019

Somewhat related issue: Open-EO/openeo-processes#2

@mkadunc
Copy link
Member

mkadunc commented Sep 13, 2019

When we decide how to encode data cubes in JSON, we could maybe store several 2D tiffs and a single files-cube.json, where files-cube.json is a cube with variable 'fileName' and all non-spatial dimensions. A 4D cube [x, y, t, b] would be saved as:

  • t * b geotiff files, containing a single 2D [x,y] slice for each combination of t and b
  • a single data cube with shape [t, b] containing file names (or URLs) for the above slices

Another option for cloud-optimized vector data might be Tapalcatl; for rasters, especially when used from Python, zarr seems to be the dominant option right now, and it should interface nicely with xarray...

@m-mohr
Copy link
Member

m-mohr commented Jul 20, 2020

This doesn't seem to be an API issue, but more an issue that should be tackled in a separate specification. Overall, that's a new file format and these should be separate from the API itself, although we may give recommendations or best practices here.

@m-mohr m-mohr closed this as completed Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants