Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutdown the weight proof process pool #10163

Merged
merged 9 commits into from
Feb 15, 2022

Conversation

altendky
Copy link
Contributor

@altendky altendky commented Feb 9, 2022

Sometimes when we shutdown with chia stop -d all we get 4 chia_full_node processes floating around unclosed.

$ ps aux | grep chia_
altendky  591271 84.4  0.1 547924 124968 pts/2   S    09:58   0:20 chia_full_node
altendky  591284 22.2  0.2 634456 140436 pts/2   S    09:58   0:04 chia_full_node
altendky  591285 22.1  0.2 634200 140180 pts/2   S    09:58   0:04 chia_full_node
altendky  591286 22.5  0.2 634200 140180 pts/2   S    09:58   0:04 chia_full_node
altendky  591345  0.0  0.0   9688  2464 pts/2    S+   09:58   0:00 grep --color=auto chia_

Issues:

  • WeightProofHandler.validate_weight_proof had a bunch of sync code with no interleaved awaits resulting in a 4 second blocking of all other activity (as timed on my laptop).
    • Added several await asyncio.sleep(0) to take turns. It is called cooperative concurrency after all.
    • If the sync functions are well isolate it may make sense to push them into threads. This both involves an await that will break the sync nature up at those points and also will revert to regular OS/threads/GIL sharing of CPU time while the sync work is being done. But, this depends on the functions be thread safe, hopefully just fully isolated functional input/output with no global state dependence or modification.
  • The executor was not being closed.
    • Added a context manager to close it.
  • Executor subprocesses continue until they are done with their tasks, there is no systematic way to cancel them.
    • In this case the code in the workers had already been added to trigger cancellation via a temp file. It just needed to be leveraged from this second use of the worker functions.
  • The full node sync task was being cancelled late.
    • Moved the cancellation from FullNode._await_closed() to FullNode._close()
  • The full node sync task was not being awaited.
    • Added awaiting of it in FullNode._await_closed() along with consuming the cancellation at that point since we are already shutting down there.
  • Service start signal handler registration was using the signal.signal() function instead of integrating itself with asyncio signal handling via asyncio.get_running_loop().add_signal_handler(). It seems this would end up either overriding other signal handlers or being overridden by them depending on order of execution. Neither seem good.
    • Switched.
  • Various other shutdown cleanup will be submitted in separate PRs

Draft for:

  • Self reflection
    • 🪞 👨‍🚀
  • Testing
    • Left it cycling overnight with no failures. It's not perfect, but a lot better in the test sequence I'm exposing it to.
    • set -vx; while true; do echo "about to start $(date --iso-8601=n)"; which chia; chia start node; sleep 30; echo "about to stop $(date --iso-8601=n)"; chia stop -d all; sleep 5; ps aux | grep chia_; pkill -9 chia_; done
  • We should explicitly try to shutdown, but can we also adjust a setting so the processes would naturally die as well?
    • Leveraged the existing file-based shutdown approach already implemented for the wallet weight proof pool shtudown.
  • Consider an attribute and reusing the same pool across multiple calls
    • Mariano said this was used infrequently enough that just recreating the pool is ok. Creating it looks to take about 0.1s, that is specifically from right before the context manager to inside of it. There may be other delays induced later that I didn't isolate.

@mariano54
Copy link
Contributor

mariano54 commented Feb 9, 2022

This code was recently changed as part of the new wallet, and it did not get much testing. So it makes sense that there might be a bug here

arvidn
arvidn previously approved these changes Feb 10, 2022
byte_chunks = []
for vdf_proof, classgroup, vdf_info in chunk:
byte_chunks.append((bytes(vdf_proof), bytes(classgroup), bytes(vdf_info)))
with ProcessPoolExecutor(self._num_processes) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we spin up these processes every time we validate a weight proof? It seems like we should just keep this ProcessPoolExecutor around, but maybe we only validate weight proofs very rarely.

Either way, probably not for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I asked about this and Mariano said it was fine to just recreate the pool.

@altendky
Copy link
Contributor Author

Thanks for looking. There's a 'bunch' more coming on this. It's basically the same situation as with the weight proof process pool in #9050. This will get another temp file to trigger shutdown of the work in the subprocesses since the executor can't handle such cancellation.

@altendky altendky marked this pull request as ready for review February 10, 2022 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants