Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable cancel scope reuse #911

Merged
merged 4 commits into from
Feb 7, 2019
Merged

Conversation

oremanj
Copy link
Member

@oremanj oremanj commented Feb 7, 2019

To keep our options open for adding a smarter was_cancelled property, per discussion in #886.

To keep our options open for adding a smarter `was_cancelled` property, per discussion in python-trio#886.
@oremanj oremanj requested a review from njsmith February 7, 2019 00:17
@codecov
Copy link

codecov bot commented Feb 7, 2019

Codecov Report

Merging #911 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #911      +/-   ##
=========================================
- Coverage    99.4%   99.4%   -0.01%     
=========================================
  Files         102     102              
  Lines       12681   12678       -3     
  Branches      927     927              
=========================================
- Hits        12605   12602       -3     
  Misses         56      56              
  Partials       20      20
Impacted Files Coverage Δ
trio/_core/tests/test_run.py 100% <100%> (ø) ⬆️
trio/_core/_run.py 99.71% <100%> (-0.01%) ⬇️

" in another task ({!r})".format(self._scope_task.name)
)
"cancel scopes may not be reused or reentered "
"(first used in task {!r})".format(self._scope_task.name)
)
self._scope_task = task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use self._scope_task for anything else now? If it's effectively a bool, it would probably be better to make an an actual bool, so we don't inadvertently hold onto any stack frames longer than we need to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it in _exc_filter (testing whether self._scope_task._pending_cancel_scope() is self). I also don't think holding onto a Task can actually leak much memory -- if the task is running, it's referenced by the runner so our reference doesn't change the memory usage, and if it's done, it doesn't have any stack frames so the memory usage is small. Is there something I'm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe CancelScope._exc_filter should use current_task()? ...and maybe it should be refactored into CancelScope._close. And actually, now that all Cancelled exceptions are the same, we can hoist the check up out of _exc_filter entirely; no reason _close should call MultiError.filter at all if it's not going to catch anything. That's another matter though...

I guess you're right that coroutine objects drop their cr_frame reference when the coroutine exits. Huh, I didn't know that. That does reduce the likelihood of this reference causing problems. So now it still makes me nervous to hold onto an unnecessary reference to a complex object, but it's only on general principles, not because of any specific issue I can point to :-).

@njsmith
Copy link
Member

njsmith commented Feb 7, 2019

Huh, I would have added a self._has_been_entered = False field. It wouldn't have occurred to me overload self._tasks to track this as well. Is there a reason for it I'm missing? Sorry to nitpick, I just find it easier to read code where the state is as decoupled as possible...

@oremanj
Copy link
Member Author

oremanj commented Feb 7, 2019

I figured it would be one fewer slot. I'll add the separate bool.

@oremanj
Copy link
Member Author

oremanj commented Feb 7, 2019

Azure Ubuntu timeout again, same reason as last night.

@oremanj oremanj closed this Feb 7, 2019
@oremanj oremanj reopened this Feb 7, 2019
@oremanj
Copy link
Member Author

oremanj commented Feb 7, 2019

Even slower on a rerun.

... are we getting any value out of the lone Linux test environment on Azure? Should we just disable it?

@njsmith
Copy link
Member

njsmith commented Feb 7, 2019

are we getting any value out of the lone Linux test environment on Azure?

Unfortunately yes, it's our only environment that has openssl 1.1.1, and so far our experience has been that 1.1.1 is cranky and easy to break.

We could move it to Travis, or bump up the timeout as a temporary workaround...

It's super weird that the two times this has happened were both at 10:00-11:00 PM west coast time. I wonder if that's accidental.

@njsmith njsmith closed this Feb 7, 2019
@njsmith njsmith reopened this Feb 7, 2019
@pquentin
Copy link
Member

pquentin commented Feb 7, 2019

It's super weird that the two times this has happened were both at 10:00-11:00 PM west coast time. I wonder if that's accidental.

I don't think so: I think Azure Ubuntu is getting popular, and this is going to happen every day.

@njsmith
Copy link
Member

njsmith commented Feb 7, 2019

@pquentin my experience with other CI systems though is that they get overloaded in the middle of the USA working day. OTOH we're hosted in the Azure "Central US" region in Iowa, so the slowdown seems to starting right at midnight local time. Suspicious...

@pquentin
Copy link
Member

pquentin commented Feb 7, 2019

Ah, that explanation is better! 👍 Maybe they have cron jobs running at midnight.

@njsmith njsmith closed this Feb 7, 2019
@njsmith njsmith reopened this Feb 7, 2019
@njsmith
Copy link
Member

njsmith commented Feb 7, 2019

Hmm, actually, it's not quite at midnight.

The first time we had this problem was 2019-02-05 (UTC):

  • The last quick build was 20190205.4 (ran in ~2 min), which started at 2019-02-05 5:57:15Z.
  • The first slow build was 20190205.5, which started at 2019-02-05T06:39:20Z.
  • Then we had a series of slow builds.
  • Build 20190205.10 was borderline – it didn't quite hit the timeout, but squeaked in at ~9:45 min. It started at 2019-02-05T07:07:27Z.
  • The next build, 20190205.11 was definitely fast again – ~2 min runtime, started at 2019-02-05T07:35:19Z.

So this episode started somewhere between 05:57 and 06:39 UTC, and ended somewhere between 07:07 and 07:35 UTC.

The next day, 2019-02-05 UTC, we didn't run any builds between 2019-02-05T23:51:12Z and 2019-02-06T08:09:57Z, so nowhere near the period where the slowdown occurred.

Then then the day after that puts us at today, 2019-02-06 UTC:

  • Build 20190207.11 was quick (~2 min), and started at 2019-02-07T06:15:08Z. In Iowa, that's 00:15, so today it was still fast a few minutes after local midnight.
  • The next build, 20190207.12, was slow, and started at 2019-02-07T06:35:21Z
  • Then we had a series of slow builds, the last one being 20190207.14, which started at 2019-02-07T07:04:15Z
  • And then 20190207.15 was only slightly slow (~4 min), and started at 2019-02-07T07:15:17Z

So this time the slowdown started between 06:15 and 06:35 UTC, and ended around 07:15 UTC. That is still remarkably consistent with the previous window.

I also peeked back one day into the past, and it turns out that we did do a build at 2019-02-04T06:42:28Z, which went fine. So either the temporal consistenty is a total coincidence, or else this is a new problem that just started on 2019-02-05.

[Edit: it just happened again, this time at 2019-02-14 06:36 UTC.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants