Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promises: add --abort-on-unhandled-rejection #15335

Closed

Conversation

addaleax
Copy link
Member

@addaleax addaleax commented Sep 11, 2017

Support creating core dumps of unhandled promise rejections on POSIX systems by keeping fork()s of the process alive until it is known whether they are handled synchronously or not.

This is a kind of crude approach, but with some basic experimentation it looks like it can work well enough.

@nodejs/diagnostics @nodejs/post-mortem

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included (to the degree to which that is possible? we can’t test core dumps themselves)
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

promises

@addaleax addaleax added cli Issues and PRs related to the Node.js command line interface. post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. promises Issues and PRs related to ECMAScript promises. semver-minor PRs that contain new features and should be released in the next minor version. labels Sep 11, 2017
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. process Issues and PRs related to the process subsystem. labels Sep 11, 2017
Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo comments.

doc/api/cli.md Outdated

This option only affects `Promise` rejections. It is only available on POSIX
systems. It is implemented by keeping `fork()`ed copies of the process alive
until it is known whether the `Promise` is handled within the same eventloop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'event loop'

doc/api/cli.md Outdated
This option only affects `Promise` rejections. It is only available on POSIX
systems. It is implemented by keeping `fork()`ed copies of the process alive
until it is known whether the `Promise` is handled within the same eventloop
iteration or is left unhandled, which may come with performance implications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current wording it's somewhat ambiguous if 'performance implications' applies to forking or unhandled rejections.

doc/node.1 Outdated
systems. It is implemented by keeping \fBfork()\fRed copies of the process alive
until it is known whether the \fIPromise\fR is handled within the same event
loop iteration or is left unhandled, which may come with performance
implications.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -1,11 +1,13 @@
'use strict';

const { safeToString } = process.binding('util');
const abort_regex = /^--abort[_-]on[_-]unhandled[_-]rejection$/;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abortRegex?

if (wantAbort) {
// Defer core dump closing until the next tick since another Promise
// might become an unhandled rejection by adopting this promise's state.
process.nextTick(process._closeCoreDump, promise, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assign process._closeCoreDump to a const and delete it from process.

src/node.cc Outdated
}
if (pid == 0) {
bool do_abort;
int rc = read(pipes[0], &do_abort, sizeof(do_abort));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to retry on EINTR:

int rc;
do
  rc = read(...);
while (rc == -1 && errno == EINTR);

(And #include <errno.h> at the top.)

As well, I'd make do_abort a char, not a bool. sizeof(char) is always 1, sizeof(bool) need not be.

src/node.cc Outdated
if (do_abort) {
// This may happen when e.g. fork()ing itself failed due to resource
// constraints.
abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABORT(), here and elsewhere? It prints a stack trace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added it here, but the other instances of abort(); are used for an “expected” abort after printing the Promise’s stack trace.

src/node.cc Outdated
if (info->refcount > 0) return;
}
env->promise_reject_reason_map()->Delete(promise->Result());
if (write(info->fd, &do_abort, sizeof(do_abort)) <= 0) abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry on EINTR.

src/node.cc Outdated
if (write(info->fd, &do_abort, sizeof(do_abort)) <= 0) abort();
if (close(info->fd) == -1) abort();
int status = 0;
if (waitpid(info->pid, &status, 0) == -1) abort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor

@cjihrig cjihrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once @bnoordhuis nits are addressed.

@addaleax addaleax force-pushed the abort-on-unhandled-rejection branch 2 times, most recently from 60cff61 to bebefa8 Compare September 11, 2017 13:02
@addaleax
Copy link
Member Author

Updated with @bnoordhuis nits addressed

@@ -1,11 +1,13 @@
'use strict';

const { safeToString } = process.binding('util');
const abortRegex = /^--abort[_-]on[_-]unhandled[_-]rejection$/;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support the _ variant? We do for --abort-on-uncaught-exception, but that's because it's a V8 option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s here for parity with --abort_on_uncaught_exception but yes, we can choose to not support it if we like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't cost much to keep parity; I would suggest keeping it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +0 either way but if we do support it then we should probably have a test variant for it.

@@ -3861,6 +4008,7 @@ static void CheckIfAllowedInEnv(const char* exe, bool is_env,
"--force-fips",
"--openssl-config",
"--icu-data-dir",
"--abort_on_unhandled_rejection",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add this to the documented list of options that are allowed in NODE_OPTIONS in doc/api/cli.md.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardlau done!

@addaleax addaleax force-pushed the abort-on-unhandled-rejection branch 2 times, most recently from e1271fb to 29fcd36 Compare September 11, 2017 16:33
Promise.reject(new Error('child error'));
} else {
run('', null);
if (!common.isWindows) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we common.skip() on Windows, or perhaps assert that the flag is rejected?

systems. It is implemented by keeping `fork()`ed copies of the process alive
until it is known whether the `Promise` is handled within the same event loop
iteration or is left unhandled.
This makes unhandled rejections a significantly more complex operation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As scary as this already sounds, I think it remains a massive understatement. Fork, particularly at high frequency, has some pretty significant performance impact:

  • fork itself can take a bunch of time
  • after the fork completes, performance is impacted in both parent and child processes as subsequent writes require additional page faults, even if the memory is already resident
  • on systems that avoid overcommitting swap (including illumos and Linux, when configured to do so), the memory used by the child process is counted against the swap cap. In cloud environments with tuned caps, this can quickly exhaust swap, causing lots of other operations to fail.
  • If the child ends up writing to the shared pages, then the total memory used by the parent and child goes up, which can result in paging (again, particularly in environments with memory caps configured) that can dramatically reduce performance.

I think it's good that the documentation mentions the use of fork, and I think it would benefit from a firmer message about the possible impact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fork, particularly at high frequency, has some pretty significant performance impact:

Yes, I am aware that that makes this option unsuitable depending on your application type. I can try to incorporate some of your points into the doc.

If the child ends up writing to the shared pages, then the total memory used by the parent and child goes up, which can result in paging (again, particularly in environments with memory caps configured) that can dramatically reduce performance.

Why would this happen (for more than one or two pages of the stack)? The child process really isn’t doing much here.

Copy link
Member

@bnoordhuis bnoordhuis Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any pages marked copy-on-write by fork() get cloned when the parent mutates them. If the parent does (for example) a compacting GC, that's a significant number of pages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can limit GC or at least discourage it while the forked process lives?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be obvious but I wonder if reminding people they should ensure ulimit is set to allow core dumps would be helpful.


if (pipe(pipes) == -1) return;
pid = fork();
if (pid == -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this correctly that if we fail to fork (or do any of these operations), we just won't get a core file for that, and the user will never know? I think the documentation should probably reflect that this is a best-effort approach. It would also be great if there were some way for a user to know that a dump has been skipped because of a failure like this. I'm afraid that when combined with some of the problems I mentioned earlier (e.g., swap exhaustion, or exhaustion of some other cap -- like number of processes), users won't get a core file, and they won't know that they won't get one until it's too late.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the original process itself will abort() once it finds that there is no promise_abort_info struct attached to the promise when it would otherwise signal the child process to abort:

https://github.com/addaleax/node/blob/29fcd36f789a2f9913661ba7f6b964362517c9b6/src/node.cc#L1369-L1373

It’s certainly not ideal, but what’s the alternative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that aborting the original process if we can't be sure is the reasonable approach.

pid_t pid;

if (pipe(pipes) == -1) return;
pid = fork();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we avoid the problem where another thread is holding a lock that will remain held in the child process with no thread to unlock it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't matter, does it? The child does nothing but read() and _exit(), it doesn't take out any locks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fork() only clones the calling thread into the child process, if that answers your question. Plus, what Ben said.

Copy link
Contributor

@davepacheco davepacheco Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, although it relies on nobody adding anything more sophisticated to the child process in the future.

That also raises something I missed the first time: this only captures stack information about the forking thread. State about thread pool threads or other threads created by libraries or other add-ons will be missing from the core file. That might be another caveat worth documenting.

(edit: Sorry for the redundant note. I hadn't caught up to Julien's comment.)

@davepacheco
Copy link
Contributor

I've made a couple of notes on the implementation, but just to summarize: this has come up several times, and major issues were brought up with the approach. I think it's pretty likely that end users won't appreciate the severity of the impact that this can have, or the limitations of the resulting core files, until it's too late.

@addaleax
Copy link
Member Author

@davepacheco I know this isn’t ideal, but I’ve spent a lot of time thinking about this issue too. I’ll try to write up a bit more advisory text tomorrow; but generally, since this is an opt-in feature that – as you mention – has been brought up a few times, I think it’s something worth trying out. Maybe we can get V8 to give better predictions of whether an rejection is unhandled, but if you want a core dump at failure time, I’ve come to the conclusion that this is as good as it gets.

Also, regarding the performance issues: In the end, for all suggested solutions that I know of, avoiding uncaught rejections as much as possible is the key to reasonable performance. One nice thing is that with the increasing adoption of async/await, that’s going to become slightly less of an issue over time.

@jasnell
Copy link
Member

jasnell commented Sep 12, 2017

I appreciate the work on this and will do a more thorough review later, but I have to say that I'm not too keen on approaches for this that only work on posix. Will definitely have to give this a lot of thought.

@benjamingr
Copy link
Member

but I have to say that I'm not too keen on approaches for this that only work on posix

All the postmortem analysis tools that care about core dumps for Node.js don't work on windows at the moment anyway IIRC ( @nodejs/platform-windows - please do correct me if there is news or I missed anything?).

In particular, I'd like to ping @misterdjules to assert that this solves the promises post-mortem issue once and for all.

The approach here is the one we discussed during nodejs/post-mortem#45

Copy link
Member

@benjamingr benjamingr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best path forward in terms of making sure post mortem analysis tools are compatible with promises.

@misterdjules
Copy link

Is it correct that core dumps generated by such a mechanism would have information about only one thread, that is the thread where the promise reject callback is called?

If so, maybe that's one more caveat to add to the documentation.

@benjamingr

In particular, I'd like to ping @misterdjules to assert that this solves the promises post-mortem issue once and for all.

With all the potential problems that were mentioned in this GH issue, it doesn't seem to me that this would solve problems with promises and post-mortem debugging "once and for all".

@benjamingr
Copy link
Member

@misterdjules

Is it correct that core dumps generated by such a mechanism would have information about only one thread, that is the thread where the promise reject callback is called?

Since the whole process is forked I would assume this would contain all information about the process (fork is literally called).

Promise rejections are supposed to be rare, promise rejections that are handled asynchronously are supposed to be very rare so I'd assume the penalty for the average program would be low and I intend to run my own code with this flag when I can.

With all the potential problems that were mentioned in this GH issue, it doesn't seem to me that this would solve problems with promises and post-mortem debugging "once and for all".

Would you mind elaborating? I would like us to set up action items and a plan after which we can all agree that promises are compatible with post-mortem analysis tools.

In case it isn't abundantly clear - a lot of us (me included) are not particularly versed with these tools and take core dumps rarely and we could really use your knowledge on the topic.

With this pull request we don't lose any stack/heap information about the promise which I think is pretty neat and it's what I've been advocating for at nodejs/post-mortem#45

@misterdjules
Copy link

Is it correct that core dumps generated by such a mechanism would have information about only one thread, that is the thread where the promise reject callback is called?

Since the whole process is forked I would assume this would contain all information about the process (fork is literally called).

From man -s 2 fork on SmartOS:

   Threads
       A call to forkall() or forkallx() replicates in the child process all
       of the threads (see thr_create(3C) and pthread_create(3C)) in the
       parent process. A call to fork1() or forkx() replicates only the
       calling thread in the child process.


       A call to fork() is identical to a call to fork1(); only the calling
       thread is replicated in the child process. This is the POSIX-specified
       behavior for fork().

This is confirmed empirically by running the following (./node is a node binary with the changes in this GH PR applied to it):

[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# ls -lt /var/tmp/cores/
total 0
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# ./node --abort-on-unhandled-rejection -e "Promise.reject(new Error('boom'));"
Unhandled promise rejection
Error: boom
    at [eval]:1:16
    at ContextifyScript.Script.runInThisContext (vm.js:50:33)
    at Object.runInThisContext (vm.js:139:38)
    at Object.<anonymous> ([eval]-wrapper:6:22)
    at Module._compile (module.js:600:30)
    at evalScript (bootstrap_node.js:480:27)
    at startup (bootstrap_node.js:177:9)
    at bootstrap_node.js:626:3
Abort (core dumped)
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# ls -lt /var/tmp/cores/
total 48942
-rw------- 1 root root 59272149 Sep 12 14:59 core.node.28515
-rw------- 1 root root 58714981 Sep 12 14:59 core.node.28518
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# pstack /var/tmp/cores/core.node.28515 
core '/var/tmp/cores/core.node.28515' of 28515: ./node --abort-on-unhandled-rejection -e Promise.reject(new Error('boo
-----------------  lwp# 1 / thread# 1  --------------------
 ffffbf7fff22184a _lwp_kill () + a
 ffffbf7fff1b5960 raise (6) + 20
 ffffbf7fff18faa8 abort () + 98
 0000000000cf682c _ZN4node12_GLOBAL__N_113CloseCoreDumpERKN2v820FunctionCallbackInfoINS1_5ValueEEE () + 22c
 0000000001260f4b _ZN2v88internal25FunctionCallbackArguments4CallEPFvRKNS_20FunctionCallbackInfoINS_5ValueEEEE () + 14b
 00000000012e26d5 _ZN2v88internal12_GLOBAL__N_119HandleApiCallHelperILb0EEENS0_11MaybeHandleINS0_6ObjectEEEPNS0_7IsolateENS0_6HandleINS0_10HeapObjectEEESA_NS8_INS0_20FunctionTemplateInfoEEENS8_IS4_EENS0_16BuiltinArgumentsE () + 155
 00000000012e2f8f _ZN2v88internal21Builtin_HandleApiCallEiPPNS0_6ObjectEPNS0_7IsolateE () + bf
 00000447682846fd ???????? ()
 000004476837aa82 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476837aa82 ???????? ()
 000004476833f075 ???????? ()
 0000044768379e29 ???????? ()
 000004476833f075 ???????? ()
 0000044768398052 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476833e339 ???????? ()
 000004476828412d ???????? ()
 0000000001662a3c _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ () + 11c
 000000000124418a _ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_ () + 1fa
 0000000001252171 _ZN2v88Function4CallENS_5LocalINS_5ValueEEEiPS3_ () + 41
 0000000000cfc461 _ZN4node15LoadEnvironmentEPNS_11EnvironmentE () + 231
 0000000000d06dec _ZN4node5StartEP9uv_loop_siPKPKciS5_ () + 8ac
 0000000000cff286 _ZN4node5StartEiPPc () + 156
 0000000000cca36c _start () + 6c
-----------------  lwp# 2 / thread# 2  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e0240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 3 / thread# 3  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e0a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 4 / thread# 4  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e1240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 5 / thread# 5  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff213b6b cond_wait_queue (2c74e98, 2c74e80, 0) + 5b
 ffffbf7fff214011 cond_wait_common (2c74e98, 2c74e80, 0) + 1d1
 ffffbf7fff2141c2 __cond_wait (2c74e98, 2c74e80) + 52
 ffffbf7fff21425a cond_wait (2c74e98, 2c74e80) + 2a
 ffffbf7fff214295 pthread_cond_wait (2c74e98, 2c74e80) + 15
 000000000115cd19 uv_cond_wait () + 9
 0000000000d4eedb _ZN4nodeL16BackgroundRunnerEPv () + ab
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e1a40) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
-----------------  lwp# 6 / thread# 6  --------------------
 ffffbf7fff21a617 __lwp_park () + 17
 ffffbf7fff20cb63 sema_wait (2b8dd80) + 13
 ffffbf7fff200a78 sem_wait (2b8dd80) + 38
 000000000115cbb4 uv_sem_wait () + 24
 0000000000d95d3a _ZN4node9inspector12_GLOBAL__N_117StartIoThreadMainEPv () + 1a
 ffffbf7fff21a2ba _thrp_setup (ffffbf7ffe7e2240) + 8a
 ffffbf7fff21a5d0 _lwp_start ()
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]# pstack /var/tmp/cores/core.node.28518 
core '/var/tmp/cores/core.node.28518' of 28518: ./node --abort-on-unhandled-rejection -e Promise.reject(new Error('boo
 ffffbf7fff22184a _lwp_kill () + a
 ffffbf7fff1b5960 raise (6) + 20
 ffffbf7fff18faa8 abort () + 98
 0000000000cf4988 _ZN4node12_GLOBAL__N_121PromiseRejectCallbackEN2v820PromiseRejectMessageE () + 288
 000000000177fb35 _ZN2v88internal7Isolate19ReportPromiseRejectENS0_6HandleINS0_8JSObjectEEENS2_INS0_6ObjectEEENS_18PromiseRejectEventE () + 45
 000000000197640e _ZN2v88internal35Runtime_PromiseRejectEventFromStackEiPPNS0_6ObjectEPNS0_7IsolateE () + ce
 00000447682846fd ???????? ()
 00000447682fcb73 ???????? ()
 0000044768379f7c ???????? ()
 000004476833f075 ???????? ()
 000004476833e339 ???????? ()
 000004476828412d ???????? ()
 0000000001662a3c _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ () + 11c
 00000000012345cb _ZN2v86Script3RunENS_5LocalINS_7ContextEEE () + 21b
 00000000012522e2 _ZN2v86Script3RunEv () + 32
 0000000000d26868 _ZN4node12_GLOBAL__N_116ContextifyScript11EvalMachineEPNS_11EnvironmentElbbRKN2v820FunctionCallbackInfoINS4_5ValueEEEPNS4_8TryCatchE () + 388
 0000000000d28ba8 _ZN4node12_GLOBAL__N_116ContextifyScript16RunInThisContextERKN2v820FunctionCallbackInfoINS2_5ValueEEE () + 2d8
 0000000001260f4b _ZN2v88internal25FunctionCallbackArguments4CallEPFvRKNS_20FunctionCallbackInfoINS_5ValueEEEE () + 14b
 00000000012e26d5 _ZN2v88internal12_GLOBAL__N_119HandleApiCallHelperILb0EEENS0_11MaybeHandleINS0_6ObjectEEEPNS0_7IsolateENS0_6HandleINS0_10HeapObjectEEESA_NS8_INS0_20FunctionTemplateInfoEEENS8_IS4_EENS0_16BuiltinArgumentsE () + 155
 00000000012e2f8f _ZN2v88internal21Builtin_HandleApiCallEiPPNS0_6ObjectEPNS0_7IsolateE () + bf
 00000447682846fd ???????? ()
 000004476837a0ea ???????? ()
 000004476833f075 ???????? ()
 0000044768379f7c ???????? ()
 000004476833f075 ???????? ()
 000004476837a0ea ???????? ()
 000004476833f075 ???????? ()
 0000044768379948 ???????? ()
 000004476833f075 ???????? ()
 000004476837a0ea ???????? ()
 000004476833f075 ???????? ()
 0000044768398052 ???????? ()
 000004476833f075 ???????? ()
 000004476837a7e1 ???????? ()
 000004476833f075 ???????? ()
 000004476833e339 ???????? ()
 000004476828412d ???????? ()
 0000000001662a3c _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ () + 11c
 000000000124418a _ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_ () + 1fa
 0000000001252171 _ZN2v88Function4CallENS_5LocalINS_5ValueEEEiPS3_ () + 41
 0000000000cfc461 _ZN4node15LoadEnvironmentEPNS_11EnvironmentE () + 231
 0000000000d06dec _ZN4node5StartEP9uv_loop_siPKPKciS5_ () + 8ac
 0000000000cff286 _ZN4node5StartEiPPc () + 156
 0000000000cca36c _start () + 6c
[root@ff16a225-3889-4644-a2a7-b5b963bcdded ~/dev/node]#

In the output above, we can see that two core files where generated (I thought only one was supposed to be generated after reading https://github.com/nodejs/node/pull/15335/files#diff-cd53544f44aab2c697bcd7b6a57f23ccR1395), and the one with the promise reject callback on the stack has only one thread, while the other has more than one thread, that is all the threads you'd expected from a node process.

Promise rejections are supposed to be rare, promise rejections that are handled asynchronously are supposed to be very rare so I'd assume the penalty for the average program would be low and I intend to run my own code with this flag when I can.

I'm not sure these are safe assumptions to make. At scale, in pathological cases, or in any other unforeseen circumstance what used to be rare and insignificant can become quite problematic. Moreover, when occurrences of a bug are rare, we want the tools we use to investigate them to be robust, reliable and as complete as possible.

With all the potential problems that were mentioned in this GH issue, it doesn't seem to me that this would solve problems with promises and post-mortem debugging "once and for all".

Would you mind elaborating?

@davepacheco listed many potential problems with this approach with his comments on this PR, and in many other GH issues/PRs.

With this pull request we don't lose any stack/heap information about the promise which I think is pretty neat and it's what I've been advocating for at nodejs/post-mortem#45

It's definitely an interesting experiment and contribution, but as a post-mortem debugging user I don't think I would use it in its current form in production if it landed because of, among other things, the potential problems that @davepacheco and myself presented in this PR.

@mhdawson
Copy link
Member

@uzun FYI as another approach being taken to capture additional info related to promises.

@rnchamberlain
Copy link

fork() only preserves the single calling thread, but since that is the main node loop you'll get the thread and JavaScript stack that you need to diagnose the problem, also the entire heap. Missing the other (e.g libuv) native threads is not a show-stopper. I agree we should mention this in the documentation.

@mhdawson
Copy link
Member

@uzun can you also run the test cases you were building in nodejs/post-mortem#45 with this to see what the results are.

@benjamingr
Copy link
Member

@davepacheco

I've made a couple of notes on the implementation, but just to summarize: this has come up several times, and major issues were brought up with the approach.

The idea in the thread you linked to there was to start the cluster and treat the first unhandled rejection differently. This is similar (by forking the process and relying on CoW) but I think the performance profile can be very different. I think in the vast majority of bugs this approach can work well.


@misterdjules

From man -s 2 fork on SmartOS:

You are absolutely right, I'm going to send a very angry mail to my old CS professor :D Thanks.

Promise rejections are supposed to be rare, promise rejections that are handled asynchronously are supposed to be very rare so I'd assume the penalty for the average program would be low and I intend to run my own code with this flag when I can.

I'm not sure these are safe assumptions to make. At scale, in pathological cases, or in any other unforeseen circumstance what used to be rare and insignificant can become quite problematic. Moreover, when occurrences of a bug are rare, we want the tools we use to investigate them to be robust, reliable and as complete as possible.

I agree with everything you've written. I think that at this point we have to make painful tradeoffs and Node.js has been reasonably good at making those unsafe assumptions responsibly. I think determinism when debugging is extremely important and I acknowledge the value of postmortem analysis.

I don't think there is a way to solve the fundamental issue with errors thrown asynchronously (would a util.throwSynchronouslyOnRejection(promise) (name pending) help?). It's likely that the majority of Node.js users are using promises in some way. This might not be some big company but I know several companies (including the one I'm working for) that use promises in large scale projects with concurrency in the millions. These people need a good post-mortem story.

Given all that - I would really like an actionable list of items from the post-mortem side of things that should happen. I know very little beyond a consumer perspective about post-mortem analysis but I've been involved with promise work for more than 4 years. Except for @mhdawson the post-mortem working group has been awfully silent about making post-mortem analysis better with promises in the last year.

By "once and for all" I meant achieving the best outcome given these constraints. I realize this might require a lot of work but I think it's worth exploring.


@rnchamberlain

fork() only preserves the single calling thread, but since that is the main node loop you'll get the thread and JavaScript stack that you need to diagnose the problem, also the entire heap. Missing the other (e.g libuv) native threads is not a show-stopper. I agree we should mention this in the documentation.

If this is sufficient in the general case it sounds good to me. I've only used core dumps a handful of times in order to debug things and it was always the JavaScript or Buffers that were problematic and by the point Buffers were problematic they were already in the JavaScript heap.

@davepacheco
Copy link
Contributor

Given all that - I would really like an actionable list of items from the post-mortem side of things that should happen. I know very little beyond a consumer perspective about post-mortem analysis but I've been involved with promise work for more than 4 years. Except for @mhdawson the post-mortem working group has been awfully silent about making post-mortem analysis better with promises in the last year.

First of all, I appreciate all the work people have done to try to satisfy both the postmortem debugging use-cases and people's desires to use promises. I know this PR represents a lot of work, and as a very heavy postmortem debugging user I'm grateful that Node is trying so hard to support this use-case.

I don't think it's fair to say that members of the postmortem working group have been silent on this issue. Several of us have weighed in on several different threads now explaining why we believe it's impossible to satisfy both use cases. We're not saying that it's hard, or that people just haven't figured out how to do it yet, but that we believe the requirements are mutually exclusive. I understand people want to explore the space anyway. That's great! But I can't give you "an actionable list of items from the post-mortem side of things that should happen" because I don't think there's anything that can be done to resolve the starting constraints. I think @addaleax is correct in concluding that this is likely as good as it gets. Unfortunately, it's still a non-starter for a lot of postmortem debugging users. I think the best we can hope for is that this approach is helpful for some users, and that the documentation warns them about the limitations and risks of this approach so that they're not caught off guard if they wind up with dozens of extra processes in production or core files that don't have some of the state they expect.

Relatedly, I think it would be useful to know if there are people who use core-file-based postmortem debugging today who are ready to deploy with the proposed --abort-on-unhandled-rejection. It would be especially useful to know if any of those people have already used this in its prototype form to debug real issues. I know I haven't waded into use of promises much myself yet because of these issues, and it sounds like some of the people interested in this proposal haven't used core files much before, so it would be good to know from somebody firmly in both camps that we're really addressing their issue. I apologize for not knowing already if there are end users consuming this work already!

For those new to these discussions, to summarize the problem again most concisely: postmortem debugging with --abort-on-uncaught-exception collects the entire JavaScript and native state of the program (including all threads) in the precise context of the failure (so that stack state is preserved) with zero runtime cost until the failure happens, while promises explicitly specify that the program must continue executing after the point of the failure, meaning that state is necessarily mutated. Forking is the closest thing to having it both ways, but it has all of the drawbacks brought up here -- it's a heavy cost and it doesn't include all of the same state.

@misterdjules
Copy link

Relatedly, I think it would be useful to know if there are people who use core-file-based postmortem debugging today who are ready to deploy with the proposed --abort-on-unhandled-rejection. It would be especially useful to know if any of those people have already used this in its prototype form to debug real issues. I know I haven't waded into use of promises much myself yet because of these issues, and it sounds like some of the people interested in this proposal haven't used core files much before, so it would be good to know from somebody firmly in both camps that we're really addressing their issue.

It seems @yunong would potentially be a good person to ask.

@rnchamberlain
Copy link

I tried the PR with a simple testcase, on Ubuntu 16.04. Running vanilla, as expected I get
(node:20138) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): FAILED and the application continues.

With --abort-on-unhandled-rejection I get a core dump, but it is triggered from the assert in node.cc line 1306: CHECK_EQ(promise->State(), Promise::kRejected); so the dump is not of the saved state from the forked process. I'll do some more digging.

@addaleax
Copy link
Member Author

@rnchamberlain do you have the example code you used for that?

@rnchamberlain
Copy link

@BridgeAR
Copy link
Member

Ping @addaleax

@rnchamberlain
Copy link

The assert in node.cc line 1306: CHECK_EQ(promise->State(), Promise::kRejected); happens because promise->State() is zero (enum PromiseState { kPending, kFulfilled, kRejected }. Then we abort immediately. The calling code that supplies the promise is this:

void PromiseRejectCallback(PromiseRejectMessage message) {
  Local<Promise> promise = message.GetPromise();
  Isolate* isolate = promise->GetIsolate();
  Local<Value> value = message.GetValue();
  Local<Integer> event = Integer::New(isolate, message.GetEvent());

  Environment* env = Environment::GetCurrent(isolate);
  Local<Function> callback = env->promise_reject_function();

  if (message.GetEvent() == v8::kPromiseRejectWithNoHandler) {
    OpenCoreDump(env, promise);
  }

@addaleax
Copy link
Member Author

addaleax commented Oct 3, 2017

@rnchamberlain I still can’t reproduce this :( It sounds like a V8 bug though, because promise->State() == Promise::kRejected should be guaranteed by the V8 API when it calls its PromiseRejectCallback, if I understand it correctly? @gsathya am I mistaken about that?

@gsathya
Copy link
Member

gsathya commented Oct 3, 2017

https://cs.chromium.org/chromium/src/v8/src/builtins/builtins-promise-gen.cc?rcl=9c4f80520da791c712944265478d0272293e5d2b&l=980 does call out to PromiseRejectCallback before fulfilling the promise, so it is possible.

It does seem odd that this isn't guaranteed by the API. Filed https://bugs.chromium.org/p/v8/issues/detail?id=6880 to track this. Thanks @addaleax !

@uzun
Copy link

uzun commented Oct 4, 2017

We reproduced this in macOS 10.13.

/Users/mu/Downloads/node/out/Release/node[810]: ../src/node.cc:1306:void node::(anonymous namespace)::OpenCoreDump(node::Environment , Localv8::Promise): Assertion (promise->State()) == (Promise::kRejected)' failed.
1: node::Abort() [/Users/mu/Downloads/node/./node]
2: node::MakeCallback(v8::Isolate
, v8::Localv8::Object, char const*, int, v8::Localv8::Value, node::async_context) [/Users/mu/Downloads/node/./node]
3: node::(anonymous namespace)::PromiseRejectCallback(v8::PromiseRejectMessage) [/Users/mu/Downloads/node/./node]
4: v8::internal::Isolate::ReportPromiseReject(v8::internal::Handlev8::internal::JSObject, v8::internal::Handlev8::internal::Object, v8::PromiseRejectEvent) [/Users/mu/Downloads/node/./node]
5: v8::internal::Runtime_ReportPromiseReject(int, v8::internal::Object**, v8::internal::Isolate
) [/Users/mu/Downloads/node/./node]
6: 0x38667f2046fd
7: 0x38667f2853a5
8: 0x38667f284173
9: 0x38667f2be339
Abort trap: 6 (core dumped)

@addaleax
Copy link
Member Author

addaleax commented Oct 4, 2017

Fwiw it sounds like @gsathya already figured out what the issue in V8 is, so it seems like all we need is an upstream change that we’ll cherry-pick into this PR

Support creating core dumps of unhandled promise rejections on POSIX
systems by keeping `fork()`s of the process alive until it is known
whether they are handled synchronously or not.
@BridgeAR
Copy link
Member

Is there any update here? @gsathya Is there a fix in V8 that we can cherry-pick?

@addaleax
Copy link
Member Author

@BridgeAR I think for now I’m abandoning this in favor of looking into more creative userland solutions … if anybody else thinks this is a good idea and wants to push it over the line, please feel free to do that

@BridgeAR
Copy link
Member

@addaleax thanks for the info. I am closing this therefore. If anyone feels it should be reopened, please feel free to do so.

@BridgeAR BridgeAR closed this Nov 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. cli Issues and PRs related to the Node.js command line interface. post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. process Issues and PRs related to the process subsystem. promises Issues and PRs related to ECMAScript promises. semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.