Common cloud-based time series data format #7

huhabla · 2017-11-28T11:33:41Z

In openEO we handle large time series data of satellite scenes. IMHO we need a data type to describe such data that reflects the cloud based storage approach. This data type should be a s simple but sophisticated as possible, so that time series data that was generated by services or data searches describe the generated/stored data sufficiently for further cloud processing. I would suggest a simple text/json based description of time series or image collections that includes the url's that points to the actual data (jp2000, geotiff, XML in an object storage) and provides additional metadata like time stamps, bounding box and others.

For example:
A search for Sentinel2A scenes can return this data type that includes all links to the scenes, bands and metadata. The result of the search can directly be used in processes that compute the NDVI for a time series of Sentinel2A scenes. This process will return the same data type to describe the resulting image collection and their links in a cloud storage. The result of the NDVI computation can then be used in other services that process time series data an accept the suggest time series data type as input.

Mec-iS · 2018-01-08T17:37:57Z

Geospatial standards do not address time format usually. I think that a nice point to start from may be GeoJSON-events

GreatEmerald · 2018-02-28T11:03:38Z

I'm not sure if the definition of "time series" is clear enough in this case... What you're describing sounds like a collection to me. We already have a sort of definition of "time series" as a JSON response that includes timestamps and values for a single pixel (see "Getting the timeseries" in https://github.com/Open-EO/openeo-python-client/blob/master/examples/notebooks/Compositing.ipynb), though it's not an official definition yet as per #46.

m-mohr · 2018-05-15T14:01:33Z

I agree with @GreatEmerald. For me the description sounds like a collection, too. But the collections are handled like data cubes in openEO and working on individual files as suggested in the first post doesn't seem suitable here. @huhabla Could you please explain what exactly you meant?
If it's just about linking to the data files, then we could simply use STAC, which we plan to use anyway to some extent. Not all back-ends might want to list their data files though and this should be optional.

huhabla · 2018-05-16T09:53:31Z

I am not referring to the data formats that is used in the openeo backends.
What i meant is a data exchange format that represents an image collection with time stamps for each image. The best case would be a cloud-optimized solution.

For example howto download the the image collection of a NDVI computation? The input data of the computation was a time series of Landsat images (image collection) with different time stamps and spatial extents, in the worst case scattered around the globe. Should this be handled as a data cube using netcdf? What is the best export format for this? I would suggest a cloud compatible format like CO-GeoTiff for the images. These images are stored in a directory that contains the time stamps of the GeoTiff files in a text file. This directory can be compressed. The key is that HTTP GET requests can be used to download parts of the GeoTiff files as well as parts of the archive when hosted on a HTTP server or in an object storage.

My point is: What to do in case the user don't want to store the resulting image collections in the database of the backend, but in on a HTTP server or an object storage and want to use it in further processing? Then we should use a data format, that can be directly accessed in the openeo backends, using HTTP GET range requests or virtual file systems.

GDAL supports for this case a virtual file system:
http://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsis3

pramitghosh · 2019-02-27T11:54:25Z

Yes, I agree with @huhabla that storing additional info on the GeoTIFFs in plaintext is a good idea.

As a side note, I am using a CSV (later transformed to a JSON array) file for storing information such as timestamps, band names and (relative) path of GeoTIFFs organized in directories (as seen here: https://github.com/Open-EO/openeo-r-udf/blob/master/data/example_udf_in/legend.csv). Maybe something like this could be adopted for the generic problem for time-series data in openEO.

As this look-up table (a "legend" file) contains the relative local paths, this might be useful to transfer time-series data selectively (say only the GeoTIFFs that are required) once the set of files are made available over HTTP through some endpoint.

Might be useful for Open-EO/openeo-udf#9

m-mohr · 2019-08-21T09:53:15Z

I'm not sure whether we can really assume that every back-end supports such a format and currently it lacks support by back-ends. We may make it for 1.0, but maybe this is something that's out of scope for now. The simplest approaches would probably be netCDF or a STAC catalog containing COGs if we want to go for it soon. I'd like some feedback from back-ends on what they think about this.

Edit, 21 days later: I still think a common basis to agree on could potentially be a STAC catalog which contains cloud-optimized GeoTiffs for raster data and another format for vector data. The summaries field in the collections could be used to describe and expose chunks of data, so that you don't need to download all the data. Maybe there's even a better way to index the data, which we could port back to STAC. All that would probably be more a best practice than being part of the core openEO API.

m-mohr · 2019-09-13T10:44:00Z

Somewhat related issue: Open-EO/openeo-processes#2

mkadunc · 2019-09-13T12:04:30Z

When we decide how to encode data cubes in JSON, we could maybe store several 2D tiffs and a single files-cube.json, where files-cube.json is a cube with variable 'fileName' and all non-spatial dimensions. A 4D cube [x, y, t, b] would be saved as:

t * b geotiff files, containing a single 2D [x,y] slice for each combination of t and b
a single data cube with shape [t, b] containing file names (or URLs) for the above slices

Another option for cloud-optimized vector data might be Tapalcatl; for rasters, especially when used from Python, zarr seems to be the dominant option right now, and it should interface nicely with xarray...

m-mohr · 2020-07-20T17:41:27Z

This doesn't seem to be an API issue, but more an issue that should be tackled in a separate specification. Overall, that's a new file format and these should be separate from the API itself, although we may give recommendations or best practices here.

m-mohr added this to the v1.0.0 milestone Feb 5, 2018

m-mohr added suggestion labels Feb 15, 2018

m-mohr added the other label Mar 11, 2018

m-mohr removed the feature label Apr 5, 2018

m-mohr added the processes Process definitions and descriptions label Apr 27, 2018

m-mohr added feedback required and removed in discussion labels May 15, 2018

m-mohr added result access and removed processes Process definitions and descriptions labels Jun 27, 2018

m-mohr modified the milestones: v0.3, v1.0 Jul 5, 2018

m-mohr modified the milestones: v1.0, v0.5 Sep 12, 2018

m-mohr modified the milestones: v0.5, v1.0 Dec 6, 2018

m-mohr added the interoperability label Apr 1, 2019

m-mohr added the 3rd year planning label Jul 31, 2019

m-mohr removed the 3rd year planning label Oct 14, 2019

m-mohr closed this as completed Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common cloud-based time series data format #7

Common cloud-based time series data format #7

huhabla commented Nov 28, 2017

Mec-iS commented Jan 8, 2018

GreatEmerald commented Feb 28, 2018

m-mohr commented May 15, 2018

huhabla commented May 16, 2018

pramitghosh commented Feb 27, 2019

m-mohr commented Aug 21, 2019 •

edited

Loading

m-mohr commented Sep 13, 2019

mkadunc commented Sep 13, 2019 •

edited

Loading

m-mohr commented Jul 20, 2020

Common cloud-based time series data format #7

Common cloud-based time series data format #7

Comments

huhabla commented Nov 28, 2017

Mec-iS commented Jan 8, 2018

GreatEmerald commented Feb 28, 2018

m-mohr commented May 15, 2018

huhabla commented May 16, 2018

pramitghosh commented Feb 27, 2019

m-mohr commented Aug 21, 2019 • edited Loading

m-mohr commented Sep 13, 2019

mkadunc commented Sep 13, 2019 • edited Loading

m-mohr commented Jul 20, 2020

m-mohr commented Aug 21, 2019 •

edited

Loading

mkadunc commented Sep 13, 2019 •

edited

Loading