Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to start up a complex task tree in an orderly way, given spawn's non-determinism? #284

Closed
njsmith opened this issue Aug 11, 2017 · 15 comments

Comments

@njsmith
Copy link
Member

njsmith commented Aug 11, 2017

(Text below was posted by @matham as #32 (comment); I'm splitting it off into a new issue to keep the threads tidy)


So I kinda just ran into this. I have a parent and a bunch of children (ordered).

Naively I tried:

def init_start(self):
    init_things  # get things started

async def wait_to_finish_work(self):
    await do_things

async def start(self):
    self.init_start()
    async with trio.open_nursery() as nursery:
        for child in self.children:
            nursery.spawn(self.start)
        nursery.spawn(self.wait_to_finish_work)

>>> await root.start()

I was hoping that the children would spawn deterministically in order and at least run until the first await in wait_to_finish_work. This would start the tree in depth first order. However, as I noticed in the result and in the design above, the initial spawning is also not in order.

Specifically, you mention that newly spawned tasks may inherit their parent's vtime. I wonder if it may not be better that the nursery initially birth its children and start their life until the first suspension in the deterministic order. Spawn looks deterministic and the docs don't really say much either way.

The alternative for a case like mine where you want to start off each spawnee in order in current trio would be to re-write my example as:

def init_start(self):
    init_things  # get things started

async def wait_to_finish_work(self):
    await do_things

async def start(self):
    for child in self.children:
        child.init_start()
    async with trio.open_nursery() as nursery:
        for child in self.children:
            nursery.spawn(self.start)
        nursery.spawn(self.wait_to_finish_work)

>>> root.init_start()
>>> await root.start()

The second version is not that different, but it does make it more complex, especially for cleanup as the parent starts it in two methods. But also, this is now breadth first order, where at each level a node is picked randomly to start off its children.

Edit: I'm thinking that maybe my model of spawn is wrong. E.g.

async with trio.open_nursery() as nursery:
    nursery.spawn(a)
    nursery.spawn(b)
    nursery.spawn(c)

Looks to me like

async with trio.open_nursery() as nursery:
    magic await a()
    magic await b()
    magic await c()

Where the magic collects the tasks at the with exit and runs them side by side. Perhaps I'm wrong and the better analogy is:

t1 = Thread(target=a)
t2 = Thread(target=b)
t3 = Thread(target=c)
for t in (t1, t2, t3):
    t.start()
for t in (t1, t2, t3):
    t.join()

because with threads order is completely undefined.

@njsmith
Copy link
Member Author

njsmith commented Aug 11, 2017

Right, what nursery.spawn does is immediately create a task, graft it into the task tree at the given nursery, and then mark the task runnable. And the next time the scheduler runs – which might be when you reach the end of the nursery's async with, but it could be earlier, too, if you execute any checkpoints inside the async with body – then it cycles through all the runnable tasks in some random order. So it's much closer to the thread spawn model than the magic await.

It looks like #136 will lead to some simplification in the nursery API, which may also make this a little easier to explain, because the body of the async with will stop being magical and become effectively just another task running inside the nursery.

In any case, your problem – wanting to run initialization code in a deterministic order – is an important one; it comes up all over the place. (Another common example: in a test suite, if you start a server and then start a client that connects to that server, you want to know that the server is fully started and ready to accept connections before you start the client.)

Here's one possible pattern: define a start function that takes a nursery, and it makes two guarantees: (a) it will spawn the appropriate thing into the nursery, and (b) it will not return until the thing is "ready", whatever that means. (For a server, it would mean: listening socket is bound and ready to accept connections.) In particular, point (b) is what gives you the ordering guarantees that raw spawn is lacking.

So in your case, if I'm understanding your pseudo-code right, this might look something like:

class Child:
    def start(self, nursery):  # or 'async def', if the initialization might be async
        self.initialize_this_object()
        nursery.spawn(self._main)

    async def _main(self):
        async with trio.open_nursery() as nursery:
            for child in self.children:
                child.start(nursery)

async with trio.open_nursery() as nursery:
    root.start(nursery)

The requirement to have an extra single-child nursery at the top level is a bit unfortunate. I guess it could be wrapped into a

Does that make sense? Does it actually address your problem? I'm not sure I fully understood the constraints.

@njsmith
Copy link
Member Author

njsmith commented Aug 15, 2017

(@mathan: not 100% sure github's notifications are working properly after my shenanigans -- I did reply to your comment here, in case you didn't see it.)

@matham
Copy link
Contributor

matham commented Aug 15, 2017 via email

@njsmith
Copy link
Member Author

njsmith commented Aug 15, 2017

Cool, no worries then :-)

@njsmith
Copy link
Member Author

njsmith commented Aug 15, 2017

I had a better idea last night! This is pretty exciting, I've been frustrated about the waiting-for-task-to-start problem since well before trio even existed.

Idea:

  • Convention: functions that have a startup phase followed by a providing-services phase take a kwarg like ready=None
  • When they're ready, they notify the token somehow, like trio.notify(ready). Optionally they can pass a value, trio.notify(ready, value); default is None.
  • There's some nursery method like await nursery.start(async_fn, *args) that calls async_fn(*args, ready=token), blocks until the function notifies readiness, then returns the value passed to notify(). (Fortunately partial does exactly what we want here WRT kwarg handling; lambda would be more annoying.)

That's... basically it. It's very simple, actually. But if it's baked in as a convention then people will use it, and it's super easy to use. And much easier to understand than some mess like supervisor.start(fn1) which calls fn1(nursery) which calls nursery.spawn(fn2) which calls fn2, and more powerful, and... yeah. It's better.

I thought for a bit about whether we could make this automatic – like if you want to spawn some function that doesn't notify readiness, it won't wait, but if you do then it does. (Unless you explicitly opt out of waiting somehow.) But on further thought I think this is both impossible and undesirable. The "impossible" part is that to do this we need to somehow immediately detect when a function is not going to provide any kind of readiness notification. This requires the ability to see into the future. To make this work we'd need to base our decision off things we can immediately observe, i.e. the signature or attributes of the function being called, and in a world full of decorators and lambdas and stuff then that just can't be done reliably. And the undesireable part is, well, generally you want to know whether something is going to block and what ordering guarantees your code is making. Plus the version that waits is async and the version that doesn't wait is sync, and we would want to have an explicitly no-wait version anyway, so... yeah, not a big loss.

It has to be an explicitly passed token instead of something ambient (trio.current_ready_token().notify()) because if it were ambient, you could potentially set it off early by calling some other function. ...Which would be weird, since generally this is going to used for functions that never return. But in any case I think this is less confusing. If you have a wrapper that does something and then calls the real function and you want to say that the wrapper's setup phase ends when the real functions's setup phase ends, it's much clearer to write await real_function(..., ready=ready) than to leave it implicit. And seeing the ready= kwarg in the call signature is a signal about which functions support this.

Should it be ready=None, or ready=READY_DEFAULT? The above uses trio.notify(ready) instead of ready.notify() because None.notify() is an error, so the top-level function's job is just to detect None. But this is probably more confusing than it's worth. Let's use something like READY_DEFAULT, as an explicit sentinel that means the caller doesn't want to know, that also provides a no-op notification method so users can blindly call it without worrying about checking for None.

Cute enhancement: if the new task raises an error before notifying, then we should raise that in the waiting spawner instead of propagating it into the nursery.

What should we do if the spawner is waiting for the child to declare readiness, and instead the child exits without raising an exception? Obviously the spawner should stop waiting. But there are a few options:

  • Raise an error ("never called notify")
  • Act like they did ready.notify(None)
  • Act like they did ready.notify(their_return_value)

If our general metaphor is that the new task is like a subroutine of the spawner until it calls notify, then, the last option makes the most sense. This also suggests that maybe the name should be like... token.split(), or something? token.diverge()?

In general names are really important here. Things that need names are:

  • The nursery method that spawns and then waits
  • The nursery method that spawns and doesn't wait
  • The kwarg
  • The kwarg default / "no-one cares" value
  • The notification method

If you accidentally use the no-wait version when you wanted the wait version, then it will seem to work, but you have unexpected concurrency. Unexpected concurrency is a pernicious bug; we don't want that. So if we pick bad names there's a danger of the no-wait version becoming a footgun (attractive nuisance variety). Plus the two methods are pretty similar, so we need to make sure it's easy to remember which is which or we'll get confused. For example, spawn and start are probably a bad pair of names, because we'll spend forever explaining to people which is which.

We can't do start and start_sync if we want to use sync as our convention for "this takes a sync function", like run_sync_in_thread. We could do something like start and start_nowait, though it's a bit of an abuse of terminology because everywhere else nowait means "try to do this blocking operation but if it actually blocks raise WouldBlock", and here it would mean "initiate this blocking operation but don't wait for it it to finish". Maybe that's the kind of inconsistency you can get away with, though.

Probably worth considering this and #136 together since they both involve overhauling nursery semantics.

@njsmith
Copy link
Member Author

njsmith commented Aug 15, 2017

Idea for naming: maybe make some connection between the spawn-and-don't-wait and the notification method, like start_released and release or something?

@njsmith
Copy link
Member Author

njsmith commented Aug 17, 2017

Some ideas for names: spinoff vs startup (or just start), spinoff vs incubate, spawn vs incubate

I like how incubate emphasises that this is a blocking call. I'm not a big fan of it being 8 characters, given that it's the one people should maybe be using more often, or at least as often.

No idea whether other people would read spinoff as suggesting a lack of connection between the two tasks. Also no idea whether these weird names mean anything to non-native speakers, though this is a core enough api that it might be ok to spend a little bit of weirdness budget here.

@njsmith
Copy link
Member Author

njsmith commented Aug 17, 2017

A library for making well behaved services (#252) could reuse the "is it really started yet?" convention and hook it up to sd_notify, issue a READY=1 message when the main task says it's started.

@njsmith
Copy link
Member Author

njsmith commented Aug 19, 2017

So I'm leaning towards start and start_soon. This has the nice properties that (a) the one could use accidentally is the one that's marked, (b) in the tutorial we can teach start_soon first and it won't look weird and confusing like if the thing we were teaching first was called start_nowait, but it does jump out at you so it makes sense to stop and explain what it means.

Regarding the API for tasks reporting back: I did a little more research on the status interfaces that service managers provide.

Systemd has statuses "READY", "RELOADING" (followed by "READY"), "STOPPING", plus you can set a free-form status string with "STATUS=".

Windows services have statuses START_PENDING, RUNNING, PAUSE_PENDING, PAUSED, CONTINUE_PENDING, STOP_PENDING, and STOPPED. No free-form text though. START_PENDING and CONTINUE_PENDING both indicate that the service is working on transitioning to RUNNING but hasn't gotten there yet.

I think this suggests that even if we want to start with just a "yeah I'm started" method, we should leave the door open to adding more methods like this in the future, both because this is evidence that they might be useful in general (the relationship between a task and its parent is fairly analogous to the relationship between a service and a service manager), and because in particular we might want to pun this interface to directly talk to service managers.

So I'm leaning towards making the kwarg task_status, used like:

async def awesome_server(*, task_status=trio.STATUS_IGNORED):
    ...
    task_status.started()

(and maybe in the future we will add task_status.stopping(), task_status.status("..."), etc.)

@njsmith
Copy link
Member Author

njsmith commented Aug 19, 2017

Oh hmm. The interaction between cancellation and blocking-start is complicated.

The simple case is simple: if we're just doing

async with open_nursery() as nursery:
    await nursery.start(...)

then any cancellation that affects the call to start also affects the child equally. But in the general case, the nursery and the task calling start could be totally unrelated.

What if the call to start gets cancelled? What effect should this have on the child? Intuitively, we sort of want to think of the child as being supervised by the spawner until it calls .started(). This would argue that start being cancelled should propagate into the child (and any child that it's spawned in the meantime, etc.), and then start shouldn't return until it's exited. OTOH the naive implementation would just have cancelled start abandon the child to keep running (i.e. it turns into a start_soon). Right now we really don't have any good way to swap around the cancel scopes affecting a task...

Also, what if the nursery gets cancelled? What affect should this have on the spawner blocked in start? Generally, I think if the child crashes before calling started, then we want the exception to propagate out to the spawner. (There's some question about what exactly should happen with nurseries that do automatic restarting, but I guess we can worry about that later.) But if the nursery is cancelled, and throws Cancelled into the child, and that Cancelled propagates into the spawner... then suddenly we have this loose Cancelled that might refer to a totally different cancellation hierarchy, and nothing can catch it until it kills the whole program. This is really no good at all. One option would be to convert it into a NurseryCancelled exception or something like that when it crosses over to the spawner. That doesn't seem too bad – "sorry, spawning into this nursery didn't work, it's cancelled". Trying to spawn into a closed nursery raises a RuntimeError -- this is similar. Except... if the cancellation does encompass both the nursery and the spawner, then we really should keep it as a Cancelled error so that it can eventually be absorbed by a move_on_after or whatever, instead of converting it into a "real" error that might propagate out to crash the program. Hmm. The other option would be to say that the task simply isn't affected by the nursery's cancellation state until after it calls started(). This would be consistent with the concept suggested above, that before started() the child is a child of the spawner, and after started() it's a child of the nursery.

Maybe one way to do it would be to literally have an adopt method on nurseries, that can move a task from one nursery to another, and then do something like:

@attr.s
class _TaskStatus:
    _final_nursery = attr.ib()
    _child = attr.ib(default=None)
    _spawner = attr.ib(default=None)

async def start(self, async_fn):
    # remember that 'self' here is the eventual nursery
    task_status = _TaskStatus(self)
    async with open_nursery() as nursery:
        task = nursery.start_soon(partial(async_fn, task_status=task_status))
        task_status._child = task
        return await task_status._wait()

where task_status.started() does self._final_nursery.adopt(self._child) to pull the child out from underneath the spawner.

Would this adopt method have any other uses? It's conceptually sound, but seems pretty weird. We could keep it internal for now in any case.

This approach would also fix something that's a little tricky with the current nursery interface: if the task exits/crashes before calling .started(), do we notify the nursery? We kind of have to right now, but this can cause an exception to be raised twice. This won't be an issue after we remove the notification APIs, like we're planning to do in #136, but I was sort of thinking that part could wait until 0.3.0, and I wanted to get .start() into 0.2.0.

This proposed semantics also makes it even more unclear what .start should do on a "proper" supervisor. If it's the supervisor who takes care of restarting, then like... it shouldn't do that until it's taken possession of the child, right? Conceptually? But we don't want an intermittent failure during startup to break the whole startup sequence; we probably do want start to retry failures following the normal logic. Maybe that's just fine and we shouldn't worry about the conceptual issue – fancy_supervisor.start() certainly can implement some fancy logic if it wants. In fact in general it might need to do something like, wait for the previous task to finish starting before it can even get going. But what should the cancellation semantics be then? Fancy supervisors probably have a way to cancel individual tasks so propagating from spawner→child is certainly doable. The other direction is less clear. But we certainly would need to do something if the supervisor guarantees that tasks will be started one at a time in serial, and then one of the precondition tasks never starts because it's cancelled. The strategy suggested above for plain nurseries of blithely letting the child continue on until it calls startup() and then dropping the cancellation on it -- that doesn't work at all.

And it does seem a bit odd to allow tasks to be blithely spun all the way up to "started" and report success to their spawner and then immediately be cancelled.

Hmm.

@njsmith
Copy link
Member Author

njsmith commented Aug 20, 2017

There's another issue to worry about with the "adoption" approach: before the adoption, only the spawner's cancellations can affect the child task, and after the adoption, only the new nursery's cancellations can affect the child task, so that's good. But at the moment of the adoption, we could potentially switch over a task that has a cancellation in flight.

I think the way this would need to work is: started always succeeds, from the perspective of its caller (which might be the child task, might be some grandchild of the child task, or who knows, could be anything in principle). But the adoption could fail, either because the new nursery is already closed, or (new point) because the old nursery is in a cancelled state – if the start has already been cancelled when the child calls started, then it doesn't actually get handed off, it stays in its original context to be cancelled. If the adoption fails because the new nursery is closed, then start should cancel the child and raise the nursery closed error. If the adoption fails because the old nursery is cancelled, then start should wait for the child to respond to that cancellation, and let the child's Cancelled propagate out to be caught in the normal way.


The other approach would be to do something complicated with a special cancel scope inside the new task whose only purpose is to let us manually propagate a cancel from start into the child, and some hack like if the child raises Cancelled, then start checks to see if it itself is cancelled and if so raises that, and otherwise raises some error. This would probably work... ok? It doesn't feel very natural though.


For "real" erlang-style supervisors that have restart policies etc., on further thought I think we're actually OK. First, there's nothing that says they need to have the same API as nurseries... I was thinking that at first, but then I realized that was partly a hold-over from when I was thinking that we needed a more elaborate way of getting serialization of task startup, where each task that supported this would need some custom code that took a duck-nursery and spawned stuff into it. But that constraint doesn't exist in this new design. Also, given that supervisors are just a different thing than regular nurseries, I'd feel pretty comfortable saying something like "the start method on this supervisor is slightly different from a regular nursery: it immediately adds the given item to the list of tasks to supervise, and then waits until either the task is successfully running or the supervisor itself has exited (e.g. due to exceeding its restart limit). Cancelling the start method does not remove the task from the list of tasks to supervise." Like, it'd be fine.

@njsmith
Copy link
Member Author

njsmith commented Aug 21, 2017

More edge cases to worry about:

Edge case 1: I've gone back and forth on what to do when the child exits cleanly before calling started, e.g.:

async def child(*, task_status=STATUS_IGNORED):
    return "hi"

async with trio.open_nursery() as nursery:
    await nursery.start(child)

The two obvious options are to either act like they called task_status.started("hi"), or to raise an error ("child never called started"). I was originally leaning towards the former, but on further thought I'm having a lot of trouble coming up with any situation where this makes sense. For start to be useful, you really need a task that has two phases, a before-started and after-started – otherwise you can just call the thing. Plus the raise-an-error version is a little simpler to implement, and if we change our minds it's easier to remove an error than add an error, so I guess I'll make it an error for now.

Edge case 2: normally, a nursery stays open until all of its tasks have exited, and then it becomes "closed", and it's impossible to spawn new tasks in it. So this sets up a weird possibility: what if the nursery is open when we call start, but is closed by the time we call started. Maybe a nursery should stay open until all its tasks have exited and all start calls have been resolved.

njsmith added a commit to njsmith/trio that referenced this issue Aug 21, 2017
start_soon is just a new name for spawn, except it doesn't return the
new task (in preparation for python-triogh-136, where we're going to stop
emphasizing task objects in the main api)

start is a major new feature: it provides a very simple way to start
up a long running task, while blocking until it's finished whatever
initialization it wants to do. At least... it's simple from the user's
point of view. Internally it's quite tricky indeed. The whole _run.py
file probably needs some refactoring and splitting up, but this is one
of those cases where I think it's best to first get the new
functionality working and nailed down, and then we can see what shape
the new abstractions should be.

Fixes python-triogh-284.
njsmith added a commit to njsmith/trio that referenced this issue Aug 21, 2017
start_soon is just a new name for spawn, except it doesn't return the
new task (in preparation for python-triogh-136, where we're going to stop
emphasizing task objects in the main api)

start is a major new feature: it provides a very simple way to start
up a long running task, while blocking until it's finished whatever
initialization it wants to do. At least... it's simple from the user's
point of view. Internally it's quite tricky indeed. The whole _run.py
file probably needs some refactoring and splitting up, but this is one
of those cases where I think it's best to first get the new
functionality working and nailed down, and then we can see what shape
the new abstractions should be.

Fixes python-triogh-284.
@njsmith
Copy link
Member Author

njsmith commented Aug 21, 2017

Maybe a nursery should stay open until all its tasks have exited and all start calls have been resolved.

Getting this to work required a slightly annoying hack: before, the nursery monitor queue received each task as it died; now, it still receives those, but it also receives a None when each start call resolves. Fortunately we're getting rid of this API entirely in 0.3.0, so this isn't a big deal.

@matham
Copy link
Contributor

matham commented Aug 21, 2017

This is prefect!!! Sorry I didn't get back until now, but I really like this approach as it makes things much easier to reason about and leads to cleaner code. I'm not sure what I could have added anyway as you seem to understand the issues involved like 100x more than I do :)

In my original issue I wanted to be able to start a tree of async tasks and a do on them either in depth-first pre order or depth-first post order. Following is an example of how it works using your newly added changes and it's super simple to switch between pre and post. Using spawn by comparison, it'd need to add a lot more complexity to make it work.

import trio


class Node(object):
    children = []
    parent = None
    depth = 'A'
    index = 0

    init_order = []
    compute_order = []

    def __init__(self, parent=None, index=0):
        super(Node, self).__init__()
        self.parent = parent
        self.children = []

        if parent is not None:
            self.depth = chr(ord(parent.depth) + 1)
            self.index = 2 * self.parent.index + index

    @property
    def name(self):
        return '{}{}'.format(self.depth, self.index)

    async def init(self, task_status=trio.STATUS_IGNORED):
        await trio.sleep(0)
        self.init_order.append(self.name)
        task_status.started()

    async def compute(self):
        await trio.sleep(0)
        self.compute_order.append(self.name)

    async def start_node_spawn_dfo_pre(self):
        await self.init()
        async with trio.open_nursery() as nursery:
            # problem here is that the children will init in random order
            # but our compute may happen before children all init
            for child in self.children:
                nursery.spawn(child.start_node_spawn_dfo_pre)
            nursery.spawn(self.compute)

    async def start_node_dfo_pre(self, task_status=trio.STATUS_IGNORED):
        await self.init()
        async with trio.open_nursery() as nursery:
            for child in self.children:
                await nursery.start(child.start_node_dfo_pre)
            task_status.started()
            nursery.start_soon(self.compute)

    async def start_node_dfo(self, task_status=trio.STATUS_IGNORED):
        async with trio.open_nursery() as nursery:
            for child in self.children:
                await nursery.start(child.start_node_dfo)
            await nursery.start(self.init)
            task_status.started()
            nursery.start_soon(self.compute)

if __name__ == '__main__':
    def create_children(root, n, root_depth):
        dfo_pre.append(root.name)
        if root_depth >= max_depth:
            dfo.append(root.name)
            return

        for i in range(n):
            node = Node(parent=root, index=i)
            root.children.append(node)
            create_children(node, n, root_depth + 1)
        dfo.append(root.name)

    max_depth = 4
    dfo_pre = []
    dfo = []
    root = Node()
    create_children(root, 2, 1)

    print('Depth first-post tree ordering is')
    print(dfo)
    print('Depth first-pre tree ordering is')
    print(dfo_pre)

    trio.run(root.start_node_spawn_dfo_pre)
    print('spawn dfs-pre init')
    print(root.init_order)
    Node.init_order = []

    trio.run(root.start_node_dfo_pre)
    print('start dfs-pre init')
    print(root.init_order)
    Node.init_order = []

    trio.run(root.start_node_dfo)
    print('start dfs-post init')
    print(root.init_order)

print

Depth first-post tree ordering is
['D0', 'D1', 'C0', 'D2', 'D3', 'C1', 'B0', 'D4', 'D5', 'C2', 'D6', 'D7', 'C3', 'B1', 'A0']
Depth first-pre tree ordering is
['A0', 'B0', 'C0', 'D0', 'D1', 'C1', 'D2', 'D3', 'B1', 'C2', 'D4', 'D5', 'C3', 'D6', 'D7']
spawn dfs-pre init
['A0', 'B1', 'B0', 'C1', 'C0', 'C2', 'C3', 'D3', 'D2', 'D0', 'D5', 'D7', 'D1', 'D4', 'D6']
start dfs-pre init
['A0', 'B0', 'C0', 'D0', 'D1', 'C1', 'D2', 'D3', 'B1', 'C2', 'D4', 'D5', 'C3', 'D6', 'D7']
start dfs-post init
['D0', 'D1', 'C0', 'D2', 'D3', 'C1', 'B0', 'D4', 'D5', 'C2', 'D6', 'D7', 'C3', 'B1', 'A0']

@njsmith
Copy link
Member Author

njsmith commented Aug 22, 2017

This is perfect!!!

Oh good! I meant to post a message here to check if this ended up being useful to you or if I'd just wandered off in some other direction entirely, but then I got distracted :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants