Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Provide guidance for how agencies should designate Geospatial data #303

Open
gbinal opened this issue May 1, 2014 · 18 comments
Open

Provide guidance for how agencies should designate Geospatial data #303

gbinal opened this issue May 1, 2014 · 18 comments

Comments

@gbinal
Copy link
Contributor

gbinal commented May 1, 2014

Since the 'geospatial' filter for data.gov pretty much drives what ends up on geoplatform.gov's data catalog, there's a need to enable agencies to trigger that.

This is a related but distinct issue from representing FGDC metadata in the schema and is more a question of simply how to trigger that flag.

@philipashlock philipashlock added this to the Next Version of Common Core Metadata Schema milestone May 8, 2014
@philipashlock
Copy link
Contributor

Since the use of the spatial field is already required for all geospatial datasets, Kishore has proposed that any dataset where this field has been populated could be considered geospatial.

@torrin47
Copy link
Contributor

That's a reasonable suggestion, but the specification (and ISO) allows "(4) a geographic feature from the GeoNames database." Would the GeoPlatform handle that gracefully?

The EPA's metadata catalog requires bounding boxes on all records, including non-geo ones, which default to the US extent.

What about adopting a standard keyword to denote geo records?

@mhogeweg
Copy link
Contributor

most geospatial datasets will use a bounding box instead of a name to indicate the (spatial) extent of the dataset/service. since Data.gov harvests the vast majority of its content from existing agency catalogs that already have FGDC/ISO metadata, it would be preferred if the Data.gov catalog uses the existing elements in metadata to conclude if the data is to be considered spatial or not.

@smrgeoinfo
Copy link
Contributor

I agree with Marten that looking for a bounding box makes the most sense, but there is a problem when systems default to some regional (US extent) or global (-90,-180, 90, 180) bounding box when its not specified. For the NGDS we try to get a keyword 'non-geographic' added to explicitly indicate resources that do not have a geospatial footprint. Unless there is some explicit indication, telling resources that have a meaningful extent from resources that really aren't associated with a specific location is a very difficult problem.

@gbinal
Copy link
Contributor Author

gbinal commented Jul 17, 2014

It seems like there may not be a clear way of tying a metadata field from Project Open Data to whether FGDC wants to consider a dataset 'geospatial'. The status quo seems to be working, if imperfect, so there may not be a consensus to change this now.

@bsweezy
Copy link

bsweezy commented Jul 17, 2014

It seems like there is a potential difference between two interpretations of "geospatial" in this conversation: (1) dataset contains attributes which describe location, (2) dataset is suitable to be included on geospatial.gov. It seems to me that (2) is a subset of (1). It would be confusing to try to handle both scenarios with one field, especially since providers may be unfamiliar with what is included on (2).

Ideally, our navigation/aggregator/inventory software could look at the actual data and make a "guess" about its geospatialness.

@gbinal
Copy link
Contributor Author

gbinal commented Jul 21, 2014

I think that I may have been taking a solution and going in search of a problem. I'm inclined to close this issue for now.

@kvuppala
Copy link

@gbinal @philipashlock
this is still open and there is no way of marking a particular dataset as geospatial in the data.json files, some agencies and community/topic POCs have this question once in a while. how should we handle approach this?

@webmaven
Copy link

Is there any documentation or guidance on what distinguishes (2) from (1)?

@mhogeweg
Copy link
Contributor

I would not go and try build a reasoning engine that reads the data and then decides whether it's Geospatial or not. There are two things imho:

  • if the data.json has a footprint the data set should be discoverable using the spatial filter.
  • if the data has a location attribute it is a spatial data set and it will also have a footprint (the tools agencies use to manage spatial data all generate such footprint automagically and insert it in their spatial metadata formats already).

@webmaven
Copy link

Ah, spatial footprint (associated with the set) vs. spatial data (associated with the rows), is a useful distinction.

An edge case occurs to me: If a set consists of multiple files, each of which has it's own different footprint (not location), a consolidated file will then have a footprint associated with each row. Would you consider the original unconsolidated set (which probably has a footprint that consists of a union of footprints of it's contained sets) to just have a footprint, or spatial data as well?

@mhogeweg
Copy link
Contributor

One can have great discussions on this.. But mostly it comes down to doing what's practical and what makes common sense.

After all: What is the extent of a point data set? The points? A convex hull around the points or a bounding box? As long as our monitors and paper are rectangular people typically use a minimum bounding box.

Each state in the us has their own county data sets (an arbitrary example), then there are national county data sets that combine all in one. If both types are considered spatial data, I would consider the collection of state county data sets also a spatial data set.

@CurtTilmes
Copy link

Bounding boxes break down when you have a swath of data (not global coverage) that happens to cross one or both poles.

@smrgeoinfo
Copy link
Contributor

unless you can specify a polar SRS for the bounding box coordinates. Breaks if the bounding boxes have to be WGS84 or webMercator.

@torrin47
Copy link
Contributor

If we refer back to the original post, the context was what should be used to cause records to appear in geoplatform.gov's view of the data.gov listings. In that context, the extent or bounding box is a poor proxy, as many non-spatial datasets could be said to apply to a specific extent, but still lack (as webmaven put it) spatial data associated with the rows.

What about leveraging the "format" field? MIME type is a pretty useless designation in the geospatial world, it communicates very little that will help get most data onto a map. If we were to overload this parameter with a set of valid values that do make sense in the geospatial world, it might help. A good start for the valid value list might be here:
http://www.digitalpreservation.gov/formats/fdd/gis_fdd.shtml
Getting a concise list of geospatial APIs might be more challenging, but not impossible, especially if aiming for an 80/20 solution.

@torrin47
Copy link
Contributor

And now I see that's precisely what Marten advocated in this issue:
#272
so if we resolve the format/mediatype question, we might well resolve this issue as well.

@gbinal
Copy link
Contributor Author

gbinal commented Aug 15, 2014

This has come up some with the FGDC/geoplatform.gov folks, who are the principle audience of this issue - namely whether there was interest in having a means datasets to be indicated as candidates for their curation. It's still an ongoing topic, but it looks like what is actually preferred is simply to include a theme of geospatial. The FGDC/geoplatform.gov folks would then be able to take it from there.

@philipashlock
Copy link
Contributor

Our current guidance for this with regard to data.gov is outlined in http://www.digitalgov.gov/resources/how-to-get-your-open-data-on-data-gov/#federal-geospatial-data

Unfortunately this means that agencies need to then manage multiple versions of their data.json file. An alternative approach would be to have a flag in the data.json file that denotes the metadata is available from a preferred source (at least for data.gov's purposes). This might also include the solution for #308.

One issue we've encountered is that for agencies which have dozens or hundreds of disparate geospatial harvest sources, the combined version of that as data.json becomes a rather large and unwieldy file. If we were to use the data.json metadata as the filter for avoiding alternate/preferred duplicate sources then we may also want to ensure that metadata is provided as something more like JSON Lines (though ideally still as valid JSON) so that it can be parsed more easily, especially as these JSON files get into the hundreds of megabytes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants