Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IH-533: Remove usage of forkexecd daemon to execute processes #5995

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

freddy77
Copy link
Collaborator

Forkexecd was written to avoid some issues with Ocaml and multi-threading.
Instead use C code to launch processes and avoid these issues.
Interface remains unchanged from Ocaml side but implementation rely entirely on C code.
vfork() is used to avoid performance memory issue.
Reap of the processes are done directly.
Code automatically reap child processes to avoid zombies.
One small helper is used to better separate Ocaml and C code and handling syslog redirection. This allows to better debug in case of issues.
Syslog handling is done in a separate process allowing to restart the toolstack and keep launched programs running; note that even with forkexecd daemon one process was used for this purpose.
Code tries to keep compatibility with forkexecd, in particular:

  • SIGPIPE is ignored in the parent;
  • /dev/null is open with O_WRONLY even for stdin;
  • file descriptors are limited to 1024.

We use close_range (if available) to reduce system calls to close file descriptors.
Cgroup is set to avoid systemd closing processes on toolstack restart.
There's a fuzzer program to check file remapping algorithm; for this reason the algorithm is in a separate file.

To turn internal debug on, you need to set FORKEXECD_DEBUG_LOGS C preprocessor macro to 1.

The syslog feature allows to run a program and redirect its output/error
stream to system log. It's triggered passing "syslog_stdout" argument to
"Forkhelpers" functions like "execute_command_get_output".
To test this feature use a small preload library to redirect syslog writing.
This allows to test program without changing it.
The log is redirected to /tmp/xyz instead of /dev/log. The name was chosen
to allow for future static build redirection.
The C program is used only for the test, so the code style take into account
this (specifically it does not try to handle all redirect situation and all
error paths).

Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
// child
if (pipe_fds[0] >= 0)
close(pipe_fds[0]);
execve(args[0], args, envs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a lot closer to vfork, and should avoid previous concerns about what functions can be safely called in-between :)

(I haven't reviewed the rest yet)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to reduce that part (C code in Ocaml) the only type passed is a list of string (2 parameters) and the code doing the Ocaml -> C transformation (and vice versa) is in a separated function from the C function doing the job, as suggested by Ocaml manual.

@andyhhp
Copy link
Contributor

andyhhp commented Sep 16, 2024

Very happy to see this making progress again. Two questions:

  1. When you say "SIGPIPE is ignored in the parent;", which process is "the parent" in this context? Xapi definitely wants to be ignoring SIGPIPE, but others probably spawned processes should not have SIGPIPIE played with behind their back.
  2. "/dev/null is open with O_WRONLY even for stdin;" This is a plain bug and wants fixing. Does it want doing before or after this change? (i.e. does it want backporting.)

}

/*
* Wait a process with a given timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be implemented in OCaml? Is a binding to some function missing for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but it was done in C for resource constraints. Launching a thread in higher level languages is not that light (note that there are additional code even to reduce C resource usage).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason about the thread is not that easy to grasp.

On newer Linux kernel (similarly to Windows) you could use pidfd and a simple poll to wait for a process easily (on Windows you use the handle to the process). But that would require a newer kernel with newer Glibc (and Linux, but that should not be an issue). Waiting a process without pidfd and threads would require some global setting (like a signal handler) which, done in a library, is not nice (library should not change global settings) and racy.

Start to think this would be nice as a comment in the code...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these threads run? We wouldn't want to spawn additional C threads inside XAPI, if anything goes wrong there it could crash XAPI. Can we move the monitoring threads to forkexecd and query its status via the socket? (although if the spawned process is not a child of forkexec, monitoring its status might be more difficult).

In fact in the future we probably want to move most long running processes to systemd transient units (we just need a newer systemd that supports file descriptor passing), and we can leave the monitoring (and logging) to systemd.

We could also have a single dedicated OCaml thread in XAPI, that is responsible for waiting for all children and notifying the appropriate condition variables. In fact would calling Unix.wait in a loop on a dedicated thread achieve that? It wouldn't increase resource usage by much (a single thread should have a fixed cost), in fact with lots of processes it might even decrease the resources used compared to launching a C thread for each one.

This should then further reduce the size of the C code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, your comment does not make sense, the intention is to remove forkexed, it always has been, not to use it in different ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a dedicated thread in XAPI doesn't require retaining forkexecd, these were 2 separate suggestions:

  • keep the monitoring and vforking in forkexecd if that simplifies the code
  • use a dedicated thread in XAPI to avoid the need for extra C code

We can do the latter without doing the former.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation makes sense. Your main target is reducing C code. Specifically, this part waiting for a process, potentially with a timeout. However, my Ocaml skills are not that great, so I would probably end up with pretty bad Ocaml code. Can you propose something?

Copy link
Contributor

@edwintorok edwintorok Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look in more detail on Friday, in principle something like this (but untested, and probably needs a bit more error handling):

let wait_pid_with_timeout ~timeout pid =
  let need_to_kill = Atomic.make true in
  let kill_pid () =
     if Atomic.get need_to_kill then
     try Unix.kill Sys.SIGKILL pid
    with (Unix.Unix_error(Unix.ESRCH, _, _) -> ()
  in
  let some_unique_id = .... in
  let finally () =  Xapi_periodic_scheduler.remove_from_queue some_unique_id in
  Xapi_periodic_scheduler.add_to_queue some_unique_id Xapi_periodic_scheduler.OneShot timeout kill_pid;
  Fun.protect ~finally (fun () ->
    let pid, status = Unix.waitpid [] pid in
    Atomic.set need_to_kill false;
    status
  )

This has the race condition between a process exiting and the timeout kicking in just at the same time (although it is unlikely that we would've also went through all pids, rolled back and got something else meanwhile in that spot).
I see you improved that in your C code, so we could perhaps add a small binding to Unixext.waitid to waitid, so that we can use waitid instead of waitpid, and then do the 2 step:

let pid, status = Unixext.(waitid P_PID [WEXITED | WNOWAIT] pid) in
Atomic.set need_to_kill false;
let pid, status = Unix.waitpid [] pid in

@freddy77
Copy link
Collaborator Author

freddy77 commented Sep 17, 2024

Very happy to see this making progress again. Two questions:

1. When you say "SIGPIPE is ignored in the parent;", which process is "the parent" in this context?  Xapi definitely wants to be ignoring SIGPIPE, but others probably spawned processes should not have SIGPIPIE played with behind their back.

2. "/dev/null is open with O_WRONLY even for stdin;" This is a plain bug and wants fixing.  Does it want doing before or after this change?  (i.e. does it want backporting.)

Yes, normally you don't want these behaviors, however I would prefer to not introduce any regression that could cause frictions or issues. I bet that changing 2) won't cause any issue, but on 1) it's not exactly easy to say. For instance, thinking about migration we launch a program that uses sockets and at some point socket will be closed, potentially causing SIGPIPE to get raised. Forkexecd was there "protecting" SIGPIPE, removing this protection would cause crashes if the program itself forgot to handle that signal.

They can easily get fixed later.

Testing was the main reason this project was stuck for months. Now the lab is in a better shape, and we fixed some issue, but I would prefer trying avoiding that path again.


pthread_mutex_lock(&tm->mtx);
int res = tm->stop ? 0:
pthread_cond_timedwait(&tm->cond, &tm->mtx, &tm->deadline);
Copy link
Contributor

@edwintorok edwintorok Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a loop to check for tm->stop becoming true, or the deadline passing (ETIMEDOUT).

pthread_cond_(timed)wait can be woken up spuriously (and in fact one of the workarounds for the broken pthread in glibc >2.27 uses spurious wakeups to reduce the likelihood of hitting the glibc bug, so expect a lot more spurious wakeups on newer libc in XS9), so we always need to check that the change we've been waiting for has actually happened.

// Avoiding reaping the process will create a zombie process so
// the KILL would be directed to that.
siginfo_t info;
waitid(P_PID, pid, &info, WEXITED|WNOWAIT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should handle the errors from this

} timeout_kill;

static void *
thread_proc_timeout_kill(void *arg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XAPI already has a singleshot timer where we can register something to happen after a given time (in OCaml), so we don't necessarily need to spawn a separate thread to do that.
See 'make_timeboxed_rpc' for an example on how to use it.

Forkexecd was written to avoid some issues with Ocaml and
multi-threading.
Instead use C code to launch processes and avoid these issues.
Interface remains unchanged from Ocaml side but implementation rely
entirely on C code.
vfork() is used to avoid performance memory issue.
Reap of the processes are done directly.
Code automatically reap child processes to avoid zombies.
One small helper is used to better separate Ocaml and C code and
handling syslog redirection. This allows to better debug in
case of issues.
Syslog handling is done in a separate process allowing to restart
the toolstack and keep launched programs running;
note that even with forkexecd daemon one process was used for this
purpose.
Code tries to keep compatibility with forkexecd, in particular:
- SIGPIPE is ignored in the parent;
- /dev/null is open with O_WRONLY even for stdin;
- file descriptors are limited to 1024.
We use close_range (if available) to reduce system calls to close
file descriptors.
Cgroup is set to avoid systemd closing processes on toolstack restart.
There's a fuzzer program to check file remapping algorithm; for this
reason the algorithm is in a separate file.

To turn internal debug on you need to set FORKEXECD_DEBUG_LOGS C
preprocessor macro to 1.

Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants