Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_cubes: Clarify merging procedure #87

Closed
jdries opened this issue Nov 14, 2019 · 21 comments
Closed

merge_cubes: Clarify merging procedure #87

jdries opened this issue Nov 14, 2019 · 21 comments

Comments

@jdries
Copy link
Contributor

jdries commented Nov 14, 2019

In our use case, we want to run two separate operations on a given datacube, and then combine those two results.
To do this, we either need a reducer that can reduce values coming from different cubes, or we need a process that 'joins' the cubes first to a datacube with two bands. This is probably the most generic solution and needed anyway.

@m-mohr
Copy link
Member

m-mohr commented Nov 14, 2019

There's merge_cubes, which has been defined at the VITO sprint. If that's not working properly, we should probably improve that process instead of adding a new one.

@jdries
Copy link
Contributor Author

jdries commented Nov 14, 2019

Indeed, I found that one as well. Perhaps we should then clarify how it handles bands?
Or is there a reducer that basically leaves the bands untouched? Like
merge([first_cube_band],[second_cube_band]) = [first_cube_band,second_cube_band]

@m-mohr
Copy link
Member

m-mohr commented Nov 14, 2019

My understanding of the process is that if you pass a two datacube with dimensions x,y,t,b that one has bands a,b and the other has bands c,d the result will be a datacube x,y,t,b, with bands a,b,c,d. Same applies for t etc. If there are two bands with the same name, the overlap_resolver must be used to reduce the values so that the values of both bands are reduced to a single band.

@mkadunc You were pushing that definition at the VITO sprint. Do you agree?

@jdries
Copy link
Contributor Author

jdries commented Nov 14, 2019

That would make sense, adding that to the documentation would solve the issue.

@m-mohr m-mohr changed the title new process: join datacubes merge_cubes: Clarify merging procedure Nov 14, 2019
@m-mohr m-mohr added this to the v1.0 milestone Nov 14, 2019
jdries added a commit to Open-EO/openeo-python-client that referenced this issue Nov 14, 2019
@mkadunc
Copy link
Member

mkadunc commented Nov 14, 2019

Yes, that was my understanding.

A caveat for Jeroen's scenario - if the two input cubes do not yet have band dimension, merge should fail (merge(A(x,y,t), B(x,y,t)) won't work whenever any cell is present in both cubes, at least not without an overlap resolver) - to be able to use the merge process on such cubes, one should extend the cubes first:

    merge(
        add_dimension(A(x,y,t), "bands", "a"),
        add_dimension(B(x,y,t), "bands", "c")
    ) // returns cube with dimensions (x, y, t, bands)

@m-mohr
Copy link
Member

m-mohr commented Nov 14, 2019

@mkadunc Shouldn't merge(A(x,y,t), B(x,y,t)) still work? I would expect that the temporal dimension could similarly be merged. For example if A has values for 2017 and B for 2018, why should that fail? Seems to be a reasonable use case to me? Similarly for x and y, although resolving overlap in these cases could likely be problematic.

@mkadunc
Copy link
Member

mkadunc commented Nov 14, 2019

Shouldn't merge(A(x,y,t), B(x,y,t)) still work?

Yes, of course - in your example there are no overlapping cells, because the two cubes are separated in time.

I was commenting on Jeroen's scenario, in which the two cubes occupy the same space and time, and the introduction of an extra dimension makes them disjoint again.

@m-mohr
Copy link
Member

m-mohr commented Nov 14, 2019

Thanks, got it.

@mkadunc
Copy link
Member

mkadunc commented Nov 14, 2019

Also, existing merge_cubes, with a provided overlap_resolver might already be sufficient answer to Jeroen's original question - merge is a process "that can reduce values coming from different cubes", e.g. if the combining function is combineAB(a::number, b::number)

    cubeA = customProcessingA(collection/x,y,t/);
    cubeB = customProcessingB(collection/x,y,t/);
    cubeR/x,y,t/ = merge_cubes(
        cube1=cubeA,
        cube2=cubeB,
        overlap_resolver = callback({ return combineAB(data[0], data[1]) })
    );

@m-mohr
Copy link
Member

m-mohr commented Nov 14, 2019

Yes, indeed. That's basically what I meant in #87 (comment). I think the only ToDo here is to improve the documentation a bit.

@jdries
Copy link
Contributor Author

jdries commented Nov 15, 2019

I agree to the above.
There's also the case where the bands in the two cubes have the same name. I guess that would mean you need an overlap_resolver?
And related to that, is there a way to rename bands?

@m-mohr
Copy link
Member

m-mohr commented Nov 17, 2019

Yes, you'd need an overlap resolver.

No, there's no way to rename dimension values at the moment, but there's #50 to track it.

@jdries
Copy link
Contributor Author

jdries commented Nov 21, 2019

Perhaps related: what about merging vector cubes, either through spatial-join or attribute join.
Use a separate process for that, or extend this one?

@m-mohr
Copy link
Member

m-mohr commented Nov 21, 2019

@jdries At the moment vector cubes are not supported at all. In the current draft for the 1.0 processes there's no vector-cubes data type in any of the processes. There were some references to vector-cubes in 0.4, but I removed them as we basically had no real use for it yet and most processes could not handle them yet anyway. We need discuss a useful approach later. Related: #68

@jdries
Copy link
Contributor Author

jdries commented Nov 28, 2019

My first graph containing merge cubes is ready, as created by the python client:

I'm still a bit unsure about how to optimally specify the inputs to the callback. I now use a list of expressions, and point back to the cubes that are to be merged.
I do feel that this is a little bit awkward. Somehow it's pretty clear from docs what the overlaps_resolver should work on, but it seems that the API gives me different ways to specify it, which could lead to incompatibilities.

    "process_id": "merge_cubes",
    "arguments": {
      "cube1": {
        "from_node": "linearscalerange3"
      },
      "cube2": {
        "from_node": "linearscalerange4"
      },
      "overlap_resolver": {
        "callback": {
          "or1": {
            "process_id": "or",
            "arguments": {
              "expressions": [
                {
                  "from_argument": "cube2"
                },
                {
                  "from_argument": "cube2"
                }
              ]
            },
            "result": true
          }
        }
      }
    },

Full graph:
https://github.com/Open-EO/openeo-python-client/blob/master/tests/data/cube_merge_or.json

@mkadunc
Copy link
Member

mkadunc commented Nov 28, 2019

I think that the overlap resolver would be applied on each pixel within the overlap, so the callback would take pixels, not cubes as arguments. (from_argument can't really access arguments of the merge_cubes process, so you're not pointing back to those; you can only access the parameters that the callback will be called with by the merge_cubes process)

Pseudo-code implementation of merge_cubes should look something like:

merge_cubes(cube1, cube2, overlap_resolver) {
    axesRet = cube1.axes.map(ax1 => Axis.createUnion(ax1, cube2.axes[ax1.name]) )
    cubeRet = createCube(axesRet)

    foreach(pxIndex in cubeRet.pixelIndices) {
        const pxCoord = cubeRet.getCoordinates(pxIndex)

        const val1 = cube1.containsCoord(pxCoord) ? cube1.getValueAt(pxCoord) : null
        const val2 = cube2.containsCoord(pxCoord) ? cube2.getValueAt(pxCoord) : null
        if (val1 == null) {
            if (val2 != null) cubeRet.values[pxIndex] = val2
        } else if (val2 == null) {
            cubeRet.values[pxIndex] = val1
        } else {
            // callback is called with two values, not two cubes
            cubeRet.values[pxIndex] = overlap_resolver(val1, val2)
        }
    }

    return cubeRet
}

We've recently settled on using x and y as the common names for (mathematical) binary processes, so I'd change your graph into something like this:

{
    "process_id": "merge_cubes",
    "arguments": {
        "cube1": {"from_node": "linearscalerange3"},
        "cube2": {"from_node": "linearscalerange4"},
        "overlap_resolver": {"callback": {
            "_": {
                "process_id": "or",
                "arguments": {"expressions": [{"from_argument": "x"}, {"from_argument": "y"}] },
                "result": true
            }
        }}
    }
}

side-note: having callbacks with the same argument names as mathematical binary processes (x and y) will at some point allow us to simplify the specification of a callback to a simple pointer to a pre-defined process — e.g. the signature of the or process is or(x::number, y::number), which is compatible with overlap_resolver, if it is defined as function(x::number, y::number) - this allows specifying your graph as:

{
    "process_id": "merge_cubes",
    "arguments": {
        "cube1": {"from_node": "linearscalerange3"},
        "cube2": {"from_node": "linearscalerange4"},
        "overlap_resolver": {"process_id": "or"}
    }
}

(I'm not sure how well the possibility of having such callbacks is supported right now, but it should not be too difficult to implement support if we decide something like that would be valuable)

@jdries
Copy link
Contributor Author

jdries commented Nov 28, 2019

Thanks, agree with the pseudo code.
Actually your second example process graph would be my preferred way of supporting it right now in the Python client and our backend. It is simpler, so reduces opportunities for misinterpretation.
It's also the case that I'm not supporting more complex reducers right now for merge_cubes.

@m-mohr
Copy link
Member

m-mohr commented Nov 28, 2019

The merge_cubes overlap_resolver works exactly as the reduce function.

Here's the related documentation:
image
see: http://processes.openeo.org/#merge_cubes

If binary = false: An array with the overlapping pixels is passed in data.
If binary = true: Two values from the overlapping pixels are passed in x and y.

So your initial part of the process graph would look like this:

    "process_id": "merge_cubes",
    "arguments": {
      "cube1": {
        "from_node": "linearscalerange3"
      },
      "cube2": {
        "from_node": "linearscalerange4"
      },
      "overlap_resolver": {
        "callback": {
          "or1": {
            "process_id": "or",
            "arguments": {
              "expressions": {
                "from_argument": "data"
              }
            },
            "result": true
          }
        }
      }
    }

The binary case would basically be what Miha posted.

re @mkadunc's side-note: The implicit mapping of parameters as proposed is currently not possible. I'm not sure whether it's really useful to solve this on the process graph level. I guess clients could simply do that for the user whenever parameters are "compatible" and come up with an "explicit" process graph. I think that would be the cleaner solution.

@jdries Could you clarify where this confusion comes from? For me the docs are quite clear, but I'm biased. ;-) Would help me to improve the docs.

@jdries
Copy link
Contributor Author

jdries commented Nov 29, 2019

Thanks, guess it was mostly me not reading the process docs well enough. Only thing to make it even easier for lazy people is to add the examples from this issue to the process docs.

@m-mohr
Copy link
Member

m-mohr commented Dec 18, 2019

Clarified merge_cubes in PR #112

m-mohr added a commit that referenced this issue Jan 13, 2020
Clarified merge behavior of merge_cubes #87
@m-mohr m-mohr closed this as completed Jan 13, 2020
@clausmichele
Copy link
Member

I would like to understand if x and y of the overlap resolver refer to cube1 and cube2 (x -> cube1, y -> cube2), since it is not explicitly stated in the API. This doubt comes from some inconsistencies I got from testing with the GEE backend, which I'm not sure are due to the implementation or the definition of the process itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants