Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arbitrary saving location of child catalogs #1116

Closed
constantinius opened this issue May 9, 2023 · 10 comments · Fixed by #1117 or #1155
Closed

Arbitrary saving location of child catalogs #1116

constantinius opened this issue May 9, 2023 · 10 comments · Fixed by #1117 or #1155
Assignees
Milestone

Comments

@constantinius
Copy link

It is hard to control the actual storage location of child catalogs when they are child of more than one catalog. Consider the following example:

import pystac

root = pystac.Catalog("root", "root")
left = pystac.Catalog("left", "left")
right = pystac.Catalog("right", "right")

root.add_child(left)
root.add_child(right)

child = pystac.Catalog("child", "child")

left.add_child(child)
right.add_child(child)

root.normalize_and_save("out_test", pystac.CatalogType.ABSOLUTE_PUBLISHED)

Whether the "child" catalog is saved under left/child or right/child depends on the order "left" and "right" are added to "root" (right/child in this case, as "right" was added after "left").


Additionally, when I change the order of the catalog addition:

root.add_child(right)
root.add_child(left)

# ...

left.add_child(child)
right.add_child(child)

Then there is no parent link from "child" to "right" at all, even when this should be the case as the add_child on "right" was called last:

{
  "type": "Catalog",
  "id": "child",
  "stac_version": "1.0.0",
  "description": "child",
  "links": [
    {
      "rel": "root",
      "href": "/out_test/catalog.json",
      "type": "application/json"
    },
    {
      "rel": "parent",
      "href": "/out_test/left/catalog.json",
      "type": "application/json"
    },
    {
      "rel": "self",
      "href": "/out_test/left/child/catalog.json",
      "type": "application/json"
    }
  ],
  "stac_extensions": []
}
@m-mohr
Copy link
Contributor

m-mohr commented May 9, 2023

I'd agree that it should be possible to allow that multiple entities can refer to a catalog. So something like
Root <-> A <-> Z
Root <-> B -> Z
should be possible. Of course, Z should only have one parent (here: A) and I assume add_child adds this parent by default. So I'd propose that add_child gets a parameter where you can disable this. Something like add_child(catalog, add_parent_to_child = False) (default: True) or so.

@pjhartzell pjhartzell added this to the 1.8 milestone May 9, 2023
@m-mohr
Copy link
Contributor

m-mohr commented May 11, 2023

Related: radiantearth/stac-spec#1230

@m-mohr
Copy link
Contributor

m-mohr commented May 11, 2023

PR is at #1117

@m-mohr
Copy link
Contributor

m-mohr commented May 11, 2023

Sorry, I'm reopening here because #1117 actually doesn't fully solve the issue. It helps with the parent links, but the storage location is still showing unexpected results (i.e. it depends on the order of the root.add_child calls).

More specifically, I'd assume that this code:

import pystac

root = pystac.Catalog("root", "root")
left = pystac.Catalog("left", "left")
right = pystac.Catalog("right", "right")

root.add_child(left)
root.add_child(right)

child = pystac.Catalog("child", "child")

left.add_child(child)
right.add_child(child, keep_parent=True)

root.normalize_and_save("out_test", pystac.CatalogType.ABSOLUTE_PUBLISHED)

Stores the child in left (as the parent is left), but it stores it in right. I assume the serialization goes from "root" to the items and whichever catalog lists the child catalog first, in that location it is stored.

@m-mohr
Copy link
Contributor

m-mohr commented May 11, 2023

I've looked into the code a bit and it looks difficult ;-) Some findings:

  • resolve_stac_object overrides the parent that we have carefully set before (from left to right)
  • In normalize_href it sets the self links for a single catalog first to left and then to right, which decides the storage location. It doesn't take the parent into account, which I think it should.

I guess both issues need to be solved for this, but I'm not sure yet how to do this.

Edit: I started an attempt in PR #1118. It fixes the issue discussed here, but as a couple of tests fail on my machine also on main it's hard to further work on it as I need to rely on the CI. I'm not sure how to proceed. If someone has an idea, I'd be very happy. :-)

@gadomski
Copy link
Member

left.add_child(child)
right.add_child(child, keep_parent=True)

I'm starting to wonder if we're outrunning the spec here. My preferred solution would be to disallow multiple parents (radiantearth/stac-spec#1230), and then throw an error on the second add_child call (obviously, we'd remove the just-added keep_parent argument of add_child).

@m-mohr
Copy link
Contributor

m-mohr commented May 15, 2023

I don't think we do. Isn't that exactly what browsable encourages (in the API world)? And isn't it also a common usecase do combine multiple data sources into one catalog?

@gadomski
Copy link
Member

It seems strange to me to have a child link in one direction, but no parent link back. That feels very un-browseable. But maybe I'm misunderstanding the use case.

---
title: This seems bad
---
flowchart TD
    A <-- child+parent --> C
    B -- child --> C
Loading

@m-mohr
Copy link
Contributor

m-mohr commented May 15, 2023

Yes, but then we are looking at multiple parents again, see radiantearth/stac-spec#1230.

I think, the example mentioned in radiantearth/stac-spec#1230 (comment) is a very valid use case for a static catalog as it makes the data easier to find by users (imaging hundreds of collections) by providing different access paths. And I thought people are actually encouraged to do this (see Browsable Extension). At least @cholmes always tried to push people towards browsable catalogs and multiple access paths. In an API it's simpler of course, as you can update the parent depending on where you are coming from, but in static it's impossible. And in an API you may also argue that it's strange to have an item with an id have different (parent) links depending on the access path. You won't have one source of truth then. So if you are saying that this is invalid, we'd basically say Browsable should be removed as an extension as this won't work then (assuming we don't want to duplicate data).

By the way, I've seen this approach with links to one entity in more implementations (e.g. FedEO), but the issue never came up until now because they are all APIs and we are obviously the first to try this for static catalogs with PySTAC. The structure for us is set, but maybe PySTAC is not the right tool for us?

@gadomski
Copy link
Member

So if you are saying that this is invalid, we'd basically say Browsable should be removed as an extension as this won't work then (assuming we don't want to duplicate data).

Not saying this is invalid, just saying that it scans strange to me. But I see the use case and it seems reasonable enough.

The structure for us is set, but maybe PySTAC is not the right tool for us?

Nah, it should be simple enough to add a child w/o setting the parent link. I'll take another look at your draft PR and see if I can't untangle things.

jsignell added a commit that referenced this issue Jun 14, 2023
…arent override of href (#1155)

* Don't override parent link when determining hrefs #1116

* Add an attribute to the class to mark if it is an orphan

* Use @m-mohr's suggestion for when to set `self_href`

* Add test based on example

* Change `keep_parent` to `set_parent`

* Add note about self_link to docstrings

---------

Co-authored-by: Matthias Mohr <webmaster@mamo-net.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment