Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of data ingest using Airflow vs Legacy (Step Functions) #7

Open
slesaad opened this issue Mar 1, 2023 · 1 comment
Open
Assignees

Comments

@slesaad
Copy link
Member

slesaad commented Mar 1, 2023

Description

As a validation of the ingests done using the [1] new airflow based pipeline, this issue runs the ingestion using both [1] and [2] legacy step functions based pipeline.

The ingest is initiated via the veda-stac-ingestor api, endpoint /dataset/publish with the same inputs for both except the collection id, as can be seen below:

For [2], the input was:

{
  "collection": "lis-global-da-tws-trend",
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "license": "CC0-1.0",
  "is_periodic": false,
  "time_density": null,
  "spatial_extent": {
    "xmin": -179.95,
    "ymin": -59.45,
    "xmax": 179.95,
    "ymax": 83.55
  },
  "temporal_extent": {
    "startdate": "2003-01-01T00:00:00Z",
    "enddate": "2021-12-31T23:59:59Z"
  },
  "sample_files": [
    "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif"
  ],
  "discovery_items": [
    {
      "collection": "lis-global-da-tws-trend-airflow",
      "discovery": "s3",
      "cogify": false,
      "upload": false,
      "dry_run": false,
      "prefix": "EIS/COG/LIS_GLOBAL_DA/DA_Trends/",
      "bucket": "veda-data-store-staging",
      "filename_regex": "(.*)DATWS_STL_based_trend.cog.tif$",
      "start_datetime": "2003-01-01T00:00:00Z",
      "end_datetime": "2021-12-31T23:59:59Z"
    }
  ]
}

Similarly, for [1], the input was:

{
  "collection": "lis-global-da-tws-trend-airflow",
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "license": "CC0-1.0",
  "is_periodic": false,
  "time_density": null,
  "spatial_extent": {
    "xmin": -179.95,
    "ymin": -59.45,
    "xmax": 179.95,
    "ymax": 83.55
  },
  "temporal_extent": {
    "startdate": "2003-01-01T00:00:00Z",
    "enddate": "2021-12-31T23:59:59Z"
  },
  "sample_files": [
    "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif"
  ],
  "discovery_items": [
    {
      "collection": "lis-global-da-tws-trend-airflow",
      "discovery": "s3",
      "cogify": false,
      "upload": false,
      "dry_run": false,
      "prefix": "EIS/COG/LIS_GLOBAL_DA/DA_Trends/",
      "bucket": "veda-data-store-staging",
      "filename_regex": "(.*)DATWS_STL_based_trend.cog.tif$",
      "start_datetime": "2003-01-01T00:00:00Z",
      "end_datetime": "2021-12-31T23:59:59Z"
    }
  ]
}

After the ingestion run was done, the stac records for both were compared and they look like the following:

Collection

[1]

{
  "id": "lis-global-da-tws-trend-airflow",
  "type": "Collection",
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
    }
  ],
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "assets": null,
  "extent": {
    "spatial": {
      "bbox": [
        [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ]
      ]
    },
    "temporal": {
      "interval": [["2003-01-01 00:00:00+00", "2003-01-01 00:00:00+00"]]
    }
  },
  "license": "CC0-1.0",
  "keywords": null,
  "providers": null,
  "summaries": {
    "datetime": ["2003-01-01T00:00:00Z"],
    "cog_default": {
      "max": 101.29833221435547,
      "min": -555
    }
  },
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "item_assets": {
    "cog_default": {
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "roles": ["data", "layer"],
      "title": "Default COG Layer",
      "description": "Cloud optimized default layer to display on map"
    }
  },
  "stac_version": "1.0.0",
  "stac_extensions": null,
  "dashboard:is_periodic": false,
  "dashboard:time_density": null
}

[2]

{
  "id": "lis-global-da-tws-trend",
  "type": "Collection",
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
    }
  ],
  "title": "Terrestrial Water Storage Trend - LIS 10km Global DA",
  "assets": null,
  "extent": {
    "spatial": {
      "bbox": [
        [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ]
      ]
    },
    "temporal": {
      "interval": [["2003-01-01 00:00:00+00", "2003-01-01 00:00:00+00"]]
    }
  },
  "license": "CC0-1.0",
  "keywords": null,
  "providers": null,
  "summaries": {
    "datetime": ["2003-01-01T00:00:00Z"],
    "cog_default": {
      "max": 101.29833221435547,
      "min": -555
    }
  },
  "description": "Gridded trend in terrestrial water storage (theil-sen slope estimation in mm yr-1) from 10km global LIS with assimilation",
  "item_assets": {
    "cog_default": {
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "roles": ["data", "layer"],
      "title": "Default COG Layer",
      "description": "Cloud optimized default layer to display on map"
    }
  },
  "stac_version": "1.0.0",
  "stac_extensions": null,
  "dashboard:is_periodic": false,
  "dashboard:time_density": null
}

Items

[1]

{
  "type": "FeatureCollection",
  "context": {
    "limit": 10,
    "matched": 0,
    "returned": 1
  },
  "features": [
    {
      "id": "DATWS_STL_based_trend.cog",
      "bbox": [
        -179.9500000157243, -59.98224871364589, 179.9973980503783,
        89.9999999874719
      ],
      "type": "Feature",
      "links": [
        {
          "rel": "collection",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
        },
        {
          "rel": "parent",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
        },
        {
          "rel": "root",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/"
        },
        {
          "rel": "self",
          "type": "application/geo+json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow/items/DATWS_STL_based_trend.cog"
        }
      ],
      "assets": {
        "cog_default": {
          "href": "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif",
          "type": "image/tiff; application=geotiff; profile=cloud-optimized",
          "roles": ["data", "layer"],
          "title": "Default COG Layer",
          "description": "Cloud optimized default layer to display on map",
          "raster:bands": [
            {
              "scale": 1.0,
              "nodata": 0.0,
              "offset": 0.0,
              "sampling": "area",
              "data_type": "float64",
              "histogram": {
                "max": 101.29833221435547,
                "min": -555.0,
                "count": 11.0,
                "buckets": [
                  7843.0, 0.0, 2.0, 13.0, 24.0, 77.0, 353.0, 1228.0, 118651.0,
                  9.0
                ]
              },
              "statistics": {
                "mean": -36.01088186359726,
                "stddev": 133.02156258224915,
                "maximum": 101.29833221435547,
                "minimum": -555.0,
                "valid_percent": 29.319745316159253
              }
            }
          ]
        }
      },
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [-179.9500000157243, -59.98224871364589],
            [179.9973980503783, -59.98224871364589],
            [179.9973980503783, 89.9999999874719],
            [-179.9500000157243, 89.9999999874719],
            [-179.9500000157243, -59.98224871364589]
          ]
        ]
      },
      "collection": "lis-global-da-tws-trend-airflow",
      "properties": {
        "proj:bbox": [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ],
        "proj:epsg": 4326.0,
        "proj:shape": [1500.0, 3600.0],
        "end_datetime": "2021-12-31T23:59:59+00:00",
        "proj:geometry": {
          "type": "Polygon",
          "coordinates": [
            [
              [-179.9500000157243, -59.98224871364589],
              [179.9973980503783, -59.98224871364589],
              [179.9973980503783, 89.9999999874719],
              [-179.9500000157243, 89.9999999874719],
              [-179.9500000157243, -59.98224871364589]
            ]
          ]
        },
        "proj:transform": [
          0.09998538835169517, 0.0, -179.9500000157243, 0.0,
          -0.09998816580074518, 89.9999999874719, 0.0, 0.0, 1.0
        ],
        "start_datetime": "2003-01-01T00:00:00+00:00"
      },
      "stac_version": "1.0.0",
      "stac_extensions": [
        "https://stac-extensions.github.io/projection/v1.0.0/schema.json",
        "https://stac-extensions.github.io/raster/v1.1.0/schema.json"
      ]
    }
  ],
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend-airflow"
    }
  ]
}

[2]

{
  "type": "FeatureCollection",
  "context": {
    "limit": 10,
    "matched": 1,
    "returned": 1
  },
  "features": [
    {
      "id": "DATWS_STL_based_trend.cog",
      "bbox": [
        -179.9500000157243, -59.98224871364589, 179.9973980503783,
        89.9999999874719
      ],
      "type": "Feature",
      "links": [
        {
          "rel": "collection",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
        },
        {
          "rel": "parent",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
        },
        {
          "rel": "root",
          "type": "application/json",
          "href": "https://dev-stac.delta-backend.com/"
        },
        {
          "rel": "self",
          "type": "application/geo+json",
          "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend/items/DATWS_STL_based_trend.cog"
        }
      ],
      "assets": {
        "cog_default": {
          "href": "s3://veda-data-store-staging/EIS/COG/LIS_GLOBAL_DA/DA_Trends/DATWS_STL_based_trend.cog.tif",
          "type": "image/tiff; application=geotiff; profile=cloud-optimized",
          "roles": ["data", "layer"],
          "title": "Default COG Layer",
          "description": "Cloud optimized default layer to display on map",
          "raster:bands": [
            {
              "scale": 1.0,
              "nodata": 0.0,
              "offset": 0.0,
              "sampling": "area",
              "data_type": "float64",
              "histogram": {
                "max": 101.29833221435547,
                "min": -555.0,
                "count": 11.0,
                "buckets": [
                  7843.0, 0.0, 2.0, 13.0, 24.0, 77.0, 353.0, 1228.0, 118651.0,
                  9.0
                ]
              },
              "statistics": {
                "mean": -36.01088186359726,
                "stddev": 133.02156258224915,
                "maximum": 101.29833221435547,
                "minimum": -555.0,
                "valid_percent": 29.319745316159253
              }
            }
          ]
        }
      },
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [-179.9500000157243, -59.98224871364589],
            [179.9973980503783, -59.98224871364589],
            [179.9973980503783, 89.9999999874719],
            [-179.9500000157243, 89.9999999874719],
            [-179.9500000157243, -59.98224871364589]
          ]
        ]
      },
      "collection": "lis-global-da-tws-trend",
      "properties": {
        "proj:bbox": [
          -179.9500000157243, -59.98224871364589, 179.9973980503783,
          89.9999999874719
        ],
        "proj:epsg": 4326.0,
        "proj:shape": [1500.0, 3600.0],
        "end_datetime": "2021-12-31T23:59:59+00:00",
        "proj:geometry": {
          "type": "Polygon",
          "coordinates": [
            [
              [-179.9500000157243, -59.98224871364589],
              [179.9973980503783, -59.98224871364589],
              [179.9973980503783, 89.9999999874719],
              [-179.9500000157243, 89.9999999874719],
              [-179.9500000157243, -59.98224871364589]
            ]
          ]
        },
        "proj:transform": [
          0.09998538835169517, 0.0, -179.9500000157243, 0.0,
          -0.09998816580074518, 89.9999999874719, 0.0, 0.0, 1.0
        ],
        "start_datetime": "2003-01-01T00:00:00+00:00"
      },
      "stac_version": "1.0.0",
      "stac_extensions": [
        "https://stac-extensions.github.io/projection/v1.0.0/schema.json",
        "https://stac-extensions.github.io/raster/v1.1.0/schema.json"
      ]
    }
  ],
  "links": [
    {
      "rel": "items",
      "type": "application/geo+json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend/items"
    },
    {
      "rel": "parent",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "root",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/"
    },
    {
      "rel": "self",
      "type": "application/json",
      "href": "https://dev-stac.delta-backend.com/collections/lis-global-da-tws-trend"
    }
  ]
}

Comparison

On comparison, the STAC records look exactly the same for ingests from both systems [1] and [2].

Note: Did notice a discrepancy where the context["matched"] value is wrong for the airflow ingestion, but that's an auto-generated value and not because of any ingestion faults, right @anayeaye?

PI Objective

https://github.com/NASA-IMPACT/veda-architecture/issues/164

@slesaad slesaad self-assigned this Mar 1, 2023
@anayeaye
Copy link
Contributor

anayeaye commented Mar 1, 2023

@slesaad Thanks for posting this comparison. I'm actually not seeing the difference in the number matched for each collection. What endpoint returned that matched 0 result? I tried /collections/<collection-id>/items and the /search endpoints below and see one match for each collection (hopefully I didn't use the wrong collection ids but can't see it yet 🙃 ).

lis-global-da-tws-trend-airflow

curl -X 'GET' \
  'https://dev-stac.delta-backend.com/search?collections=lis-global-da-tws-trend-airflow&limit=10' \
  -H 'accept: application/geo+json' | jq '.context'

{
  "limit": 10,
  "matched": 1,
  "returned": 1
}

lis-global-da-tws-trend

curl -X 'GET' \
  'https://dev-stac.delta-backend.com/search?collections=lis-global-da-tws-trend&limit=10' \ 
  -H 'accept: application/geo+json' | jq '.context'
{
  "limit": 10,
  "matched": 1,
  "returned": 1
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants