Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Clarify ownership and provenance for datasets listed in data.json #296

Closed
philipashlock opened this issue Mar 14, 2014 · 25 comments
Closed

Comments

@philipashlock
Copy link
Contributor

As @JoshData and @lilybradley have mentioned, the data.json from HHS includes data aggregated from State governments as well (here's an example from ny.gov). Does Project Open Data already have a clear requirement that datasets should only be those produced by the agency? If not, should that requirement be better specified?

I'm sure there are a lot of gray areas here where a local government has produced some data in partnership with a federal agency, but if we can provide better guidance on those scenarios (eg the data.json should only include data hosted on a federal .gov) then that would be helpful.

If other sources can be included, we'll want a better method to identify the source of these datasets. If each federal dataset listed the programCode and bureauCode as required it would be easy to filter out those that don't have them, but we'd still want a consistent and detailed way to identify those other sources.

@haleyvandyck
Copy link
Contributor

Interesting point--thanks for raising it. Do we think this is something we could issue guidance on for the Publisher field? or are you suggesting we consider a new field in the schema?

@lilybradley
Copy link

@philipashlock This is the other example: http://hub.healthdata.gov/dataset/supplemental-nutrition-assistance-program-snap-data-system
@feomike Do you have thoughts from FGDC/GIS world, on a standard for referencing state and local governments (legal entities) in the current schema? Potentially as Publishers as @haleyvandyck mentions above?

@feomike
Copy link
Contributor

feomike commented Apr 24, 2014

right great question. the short answer is no, a standard nomenclature/vocabulary for state/local government does not exist. there isn't really an authority to maintain such a lexicon (e.g. dept of ed would have lists of school districts, HHS have local health agencies etc, but not an comprehensive thing). i can't even think of an association which would have the current understanding (perhaps american planning association?). all this being said, i can see this as an outreach opportunity if someone wants to dive in. strategy might be (a) just use Census geography as the nomenclature (eg Census Places as local cities, census counties as all counties etc). this method is not comprehensive (would mis some local tax areas and miss council of governments etc, but a good start), (b) assign the task to FGDC to form a committee and make such a thing (this would be great for a standards body to take on, and FGDC could be one to ask (there could be others, but FGDC is a federal entity and i assume OMB might be able to direct them to do something), (c) ask an association for some in kind help. in the geo world, NSGIC (national states GIS council) could be one, the american planning association could be one, might be some others. the short is, there isn't one today and building it is likely not a small task, depending on exactly how deep we want to go. sorry.

@gbinal
Copy link
Contributor

gbinal commented May 5, 2014

@haleyvandyck - to your question:

Do we think this is something we could issue guidance on for the Publisher field? or are you suggesting we consider a new field in the schema?

I'd suggest the former. I think the schema is robust enough at the moment, so the next stage would be to outline a proposal for guidance on this topic.

@rebeccawilliams
Copy link
Contributor

I am adding notes from Thursday's Common-Core Metadata Schema Review (see #325) where the group agreed there was value in accommodating more information about hierarchy (as well as primary source guidance) in the Publisher field.

The discussion noted that some Publisher information is less likely to change (e.g. Bureaus) and that due to potential Publisher changes flexibility should be favored. The discussion also noted that hierarchical information might be best collected with additional designated fields.

The group outlined three possible resolutions:

    1. Including guidance that called for additional hierarchical information in the Publisher field to be separated by commas (a string, but still human-readable)
    1. Collecting additional Publisher information through additional fields (as an array)
    1. A combination of both, with optional fields for additional information

@philipashlock philipashlock added this to the Next Version of Common Core Metadata Schema (1.0 -> 1.1.) milestone Jul 24, 2014
@gbinal
Copy link
Contributor

gbinal commented Jul 31, 2014

One option for clarifying guidance would be to provide very specific guidance about what to put in the string, including a clear vision for how to provide multiple levels of publisher, separated by commas.

I'll take a pass at this.

@philipashlock
Copy link
Contributor Author

If we were to align the publisher field to be compatible with DCAT it's actually defined there in a way that allows this additional structure. In DCAT this property uses the foaf:Agent class which includes an Organization class with properties like unitOf and subOrganizationOf see http://www.w3.org/TR/vocab-org/

@gbinal
Copy link
Contributor

gbinal commented Aug 13, 2014

Here's a suggestion for updated Usage Notes for the publisher field:

The plaintext name of the entity publishing this dataset. Where greater specificity is desired, include as many levels of publisher as is useful, in ascending order, separated by commas. For instance: team, office, bureau, agency, department.

The example can be updated to be:

{"publisher":"U.S. Department of Education"} or {"publisher":"Office of Duty Operations, International Trade Administration, U.S. Department of Commerce"}

@gbinal
Copy link
Contributor

gbinal commented Aug 13, 2014

Ah - sorry for missing your comment, Phil. I'd assumed that it had to stay a string. What would my above example (Office of Duty Operations) look like in that case?

@jpmckinney
Copy link
Contributor

@gbinal Having a longer string for publisher is allowed by DCAT, since it uses the old Dublin Core publisher term which has no specific range. However, as @philipashlock points out, it's probably more maintainable to have the value of publisher be a foaf:Agent. That way, we don't end up in a situation with dozens of variations on the same publisher, like:

  • Office of Duty Operations
  • Office of Duty Operations, International Trade Administration
  • Office of Duty Operations, U.S. Department of Commerce
  • Office of Duty Operations, International Trade Administration, U.S. Department of Commerce

In fully expanded form, it can look like:

{
  "publisher": {
    "name": "Office of Duty Operations",
    "subOrganizationOf": {
      "name": "International Trade Administration",
      "subOrganizationOf": {
        "name": "U.S. Department of Commerce"
      }
    }
  }
}

However, you would probably never do that, as it has the same issues of the string version, in that there can be inconsistency in how the publisher value is serialized. Ideally, you would abbreviate to a URL:

{
  "publisher": "http://example.org/publishers/1.json"
}

Or include a name but use a URL for subOrganizationOf:

{
  "publisher": {
    "name": "Office of Duty Operations",
    "subOrganizationOf": "http://example.org/publishers/2.json"
  }
}

@jpmckinney
Copy link
Contributor

@feomike The UK did your option (b) and produced the RDF that powers this visualization: http://data.gov.uk/organogram/cabinet-office

For option (a), I'd look at OpenCivicData's Division Identifiers, which establish stable identifiers for geographies. Examples:

  • ocd-division/country:us
  • ocd-division/country:us/state:nc
  • ocd-division/country:us/state:nc/county:wake
  • ocd-division/country:us/state:nc/place:cary

@lilybradley
Copy link

For a Linked Data route, I think there would be just two fields in our schema.
"publisher": {
"name": "Office of Duty of Operations",
"foaf:homepage"/"skos:exact match"/"schema.org:thing:sameAs": "http://www.trade.gov/enforcement/operations/index.asp/"
}
per schema.org, the "sameAs" field => URL of a reference Web page that unambiguously indicates the item's identity. E.g. the URL of the item's Wikipedia page, Freebase page, or official website. This was a choice example because google search, and website search of trade.gov do not yield an "Office of Duty of Operations". There is an Antidumping and Countervailing Duty Operations Unit.

Where possible, I would be inclined to refer to the wikipedia page (there is not one for this office of duty ops) for a number of reasons:

  1. Wikipedia tracks name changes, and organizational restructuring in a way that our government websites do not yet.
  2. Wikipedia has key activity/process called "disambiguation": http://en.wikipedia.org/wiki/Wikipedia:Disambiguation that seems highly leverage-able, so much so, that Google bought a company now known as Freebase to compete with DBpedia, wikipedia's structured content API (i think its an API).
  3. By referencing the wikipedia page, I believe you can access DBpedia (or freebase), and find related ('linked') data to the organization. the wikipedia webpage url acts as the unique identifier.
  4. Another reason, i would reference the wikipedia url is because it's at the center of this linked data cloud: http://en.wikipedia.org/wiki/Linked_Data#mediaviewer/File:LOD_Cloud_Diagram_as_of_September_2011.png - hence, a higher likelihood of fewer re-directs :)

@jpmckinney
Copy link
Contributor

@lilybradley I think the question of linking the publisher to some common identifier (like a Wikipedia URL) is a separate question from how to model the organizational hierarchy that the publisher is part of.

For linking, foaf:homepage or schema:sameAs can be used, but the range of skos:exactMatch must be a skos:Concept, so it wouldn't work here. Also, confusingly, schema:sameAs is not the same as owl:sameAs (owl:sameAs needs to point to an RDF representation whereas schema:sameAs need not).

@philipashlock
Copy link
Contributor Author

Thanks for the background and examples @jpmckinney, it's immensely helpful. For implementers of the Project Open Data schema, I think the fully expanded example you provided is more realistic than pointing to a meaningful URL, but we can also help maintain consistency by enforcing a little validation. I think we'd probably want to require the use of the name property rather than a literal string (DCAT does recommend foaf:Agent) , but still discourage the expression of the full hierarchy as a string and instead encourage the use of subOrganizationOf for expressing the org structure.

Another example of identifiers for government organizations is publicbodies.org which might have some overlap with OCD IDs, but is more focused on organizational units rather than geographies. @project-open-data, @opencivicdata, and @unitedstates might also be interested in doing more based on the various internal id mappings like the one from OMB - #341

@gbinal
Copy link
Contributor

gbinal commented Aug 22, 2014

gbinal added a commit that referenced this issue Sep 4, 2014
Following through on with #296.
@gbinal
Copy link
Contributor

gbinal commented Sep 4, 2014

This is addressed in 3789c65

rebeccawilliams pushed a commit that referenced this issue Oct 2, 2014
Changes that still need to be addressed are changes in structure and should we add usage notes additions here or no?:

* Adds optional describedByType field at the dataset and distribution level (#291, #332)
* Changes contactPoint field to an object that contains the name (fn) and email address (hasEmail) (#358)
* Adds fn field as part of contactPoint replacing earlier use of contactPoint (#358)
* Changes publisher field to an object that allows multiple levels of organizations (#296)
* Changes accessURL field to represent indirect access and to exist only within distribution (#217, #335) 
* Changes format field to a human readable description and to exist only within distribution (#272, #293)
* Adds optional description field for use within distribution (#248)
* Adds optional title field for use within distribution (#248)
* Changes accrualPeriodicity field to use ISO 8601 date syntax (#292)
* Changes distribution field to become required-if-applicable and to always contain the accessURL or downloadURL fields (#217)
* Changes license field to be a URL (#196)
@philipashlock
Copy link
Contributor Author

This issue ended up focusing on providing more clarity for dataset provenance using the publisher field - and that's great, but the other thing that should go alongside this is more clear and prominent guidance that the data.json file should not include data published by other entities (like other agencies or governments).

I think there might have been discussion in another issue about some cases where there could be grey areas, like a cross agency partnership where multiple agencies worked together on something that produced a dataset. I don't have a specific example of that, but if someone has one perhaps we can test whether there would be justification for publishing the metadata in multiple agencies' data.json rather than determining a single owner. That said, I think it's pretty clear that datasets published by other governments, like cities and states, should not be included in an agencies data.json. Likewise, if a city or state were to implement the data.json spec, it'd be best if they provided a version that was exclusively limited to their own datasets rather than including datasets aggregated from other levels of government.

@rrmishra
Copy link

Thank you @philipashlock for pointing me to this thread. @gbinal has posted a link to example json with publisher tag with a nested subOrganization element.

My suggestion: It is easier to parse an array data that represents a hierarchical relationship than infinitely nested hierarchical relationship. Therefore, change the schema to represent the hierarchy as array as originally suggested by @rebeccawilliams on Jul 21 and @gbinal on Aug 13.

I believe creating a "model" class with "publisher" element as array can be parsed by any JSON parser with single line of code whereas having recursively nested subOrganization element will require "special handling".

@philipashlock
Copy link
Contributor Author

Using a full publisher class provides a lot more information at minimal cost. I think you might be overestimating how much is needed for the "special handling"

Here's a comparison of what's needed to process the two approaches, it's not a huge difference:

Nested objects:

var json = {"publisher":{"name":"Office of Duty Operations","subOrganizationOf":{"name":"International Trade Administration","subOrganizationOf":{"name":"U.S. Department of Commerce"}}}}
var publisher = json.publisher;

while(publisher.name){
    console.log(publisher.name);
    publisher = (publisher.subOrganizationOf ? publisher.subOrganizationOf : false);
}

Or one simple array:

var json = {"publisher":["Office of Duty Operations","International Trade Administration","U.S. Department of Commerce"]}
var publisher = json.publisher;

publisher.forEach( 
    function(elem){
        console.log(elem)
    }
);

@smrgeoinfo
Copy link
Contributor

There is also the minor detail of information loss in the array encoding approach...

@rrmishra
Copy link

Thank you @philipashlock for the response with code example showing how to extract the attribute values.
As shown by your example, using the hierarchical array, the "model" class representing the json object is simpler to process. It will also be simple to serialize the "model" object and de-serialize into a "model" object.

Regarding loss of information: the hierarchical array does not have to be array of strings but can be array of object with any additional attribute for the organization.

@philipashlock
Copy link
Contributor Author

@rrmishra The current proposal is based on using FOAF and ORG, so we're not inventing our own new way to represent these terms and relationships. As @jpmckinney suggested, this could be done by referencing an external URI and JSON representation of each organization object but I don't think it's realistic to expect most federal agencies to do that in the near future and it's trivial to include the additional organization objects inline. If an agency did want to represent subOrganizationOf with a URI I think that would be ok too. We can encourage that more over time - especially to help with common identifiers.

It seems like what you're suggesting is to provide an array of organization objects. I think this would deviate from the DCAT standard we're trying to adhere to and based on the example you provided earlier it also wouldn't convey any relationships between the organizations. How do you suggest representing this as one publisher with a relationship to other organizations rather than multiple publishers?

Also, can you provide a real use case where it's non-trivially more burdensome to process nested organization objects rather than an array of organization objects?

@gbinal
Copy link
Contributor

gbinal commented Nov 3, 2014

FWIW, I'm def. a fan of simplicity in the schema but am convinced of the greater benefit that comes by adhering to norms and standards, in this case, FOAF, ORG, and DCAT.

In addition to the followup questions @philipashlock poses above, I'd also ask whether there's a compelling example of the array setup following data standards better.

@philipashlock
Copy link
Contributor Author

I think there are a few different issues that came out of this discussion:

  1. Providing more structure to define the publisher as an organization and it's context within other organizations
  2. Providing unique identifiers for publishers
  3. Clarifying weather an agency can include data from other agencies, governments, and organizations as part of their Public Data Listing and Enterprise Data Inventory.

I think we've addressed the first issue here, but as I commented before, #296 (comment), I don't think we've fully addressed the third point. This issue isn't really part of the schema, but rather the broader guidance so I've gone ahead and created a new issue for it - #390

We also haven't addressed the second point on unique identifiers for the publisher (other than what's already accomplished by bureauCode and programCode) but we can create a new issue for that as needed.

@gbinal
Copy link
Contributor

gbinal commented Nov 10, 2014

Sounds great. Thanks a bunch for parsing these issues. I'll go ahead and close this issue in the meantime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants