Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zombie healtcheck threads #597

Closed
Tracked by #603 ...
josecelano opened this issue Jan 11, 2024 · 6 comments
Closed
Tracked by #603 ...

Fix zombie healtcheck threads #597

josecelano opened this issue Jan 11, 2024 · 6 comments
Assignees
Labels
Bug Incorrect Behavior

Comments

@josecelano
Copy link
Member

josecelano commented Jan 11, 2024

Parent issue: #582

The live demo is not working because the services crash periodically due to high memory consumption.

I'm not sure if the problem comes from having a lot of health check threads, but it's something we have to fix anyway.

live-demo-console-log.txt

In the past, we have a similar problem when you disable the cronjob to remove peerless torrents. See these related issues:

@josecelano josecelano added the Bug Incorrect Behavior label Jan 11, 2024
josecelano added a commit to josecelano/torrust-tracker that referenced this issue Jan 12, 2024
This should not be merged. The intention is only to document the bug.
@josecelano
Copy link
Member Author

josecelano commented Jan 12, 2024

Hi @da2ce7 @WarmBeer. The problem was solved after merging this PR.

The healthcheck does not have a timeout for the request, so it waits indefinitely. As we discussed in our meeting.

It happened that:

  • The UDP was actually not working on production, but working only on tests. Because the bootstrapping is not exactly the same.
  • The healthcheck was probably waiting because the socket was open but the server was not running. Binding the address and running the listener are not done in the same thread. That's the reason why the request was made and not refused but it did not get any response (I guess).

The following is the latest version but in the previous one binding and running the server are also done in different threads.

impl Udp {
    /// It starts the UDP server instance with graceful shutdown.
    ///
    /// # Panics
    ///
    /// It panics if unable to bind to udp socket, and get the address from the udp socket.
    /// It also panics if unable to send address of socket.
    async fn start_with_graceful_shutdown(
        tracker: Arc<Tracker>,
        bind_to: SocketAddr,
        tx_start: Sender<Started>,
        rx_halt: Receiver<Halted>,
    ) -> JoinHandle<()> {
        let socket = Arc::new(UdpSocket::bind(bind_to).await.expect("Could not bind to {self.socket}."));
        let address = socket.local_addr().expect("Could not get local_addr from {binding}.");

        info!(target: "UDP Tracker", "Starting on: udp://{}", address);

        let running = tokio::task::spawn(async move {
            let halt = tokio::task::spawn(async move {
                debug!(target: "UDP Tracker", "Waiting for halt signal for socket address: udp://{address}  ...");

                shutdown_signal_with_message(
                    rx_halt,
                    format!("Shutting down UDP server on socket address: udp://{address}"),
                )
                .await;
            });

            let listen = async move {
                debug!(target: "UDP Tracker", "Waiting for packets on socket address: udp://{address} ...");

                loop {
                    let mut data = [0; MAX_PACKET_SIZE];
                    let socket_clone = socket.clone();

                    match socket_clone.recv_from(&mut data).await {
                        Ok((valid_bytes, remote_addr)) => {
                            let payload = data[..valid_bytes].to_vec();

                            debug!(target: "UDP Tracker", "Received {} bytes", payload.len());
                            debug!(target: "UDP Tracker", "From: {}", &remote_addr);
                            debug!(target: "UDP Tracker", "Payload: {:?}", payload);

                            let response = handle_packet(remote_addr, payload, &tracker).await;

                            Udp::send_response(socket_clone, remote_addr, response).await;
                        }
                        Err(err) => {
                            error!("Error reading UDP datagram from socket. Error: {:?}", err);
                        }
                    }
                }
            };

            pin_mut!(halt);
            pin_mut!(listen);

            tx_start
                .send(Started { address })
                .expect("the UDP Tracker service should not be dropped");

            tokio::select! {
                _ = & mut halt => { debug!(target: "UDP Tracker", "Halt signal spawned task stopped on address: udp://{address}"); },
                () = & mut listen => { debug!(target: "UDP Tracker", "Socket listener stopped on address: udp://{address}"); },
            }
        });

        info!(target: "UDP Tracker", "Started on: udp://{}", address);

        running
    }

Just for the record, we did that because we wanted to send back the bound address when the configuration uses port 0 and it's assigned a free port dynamically.

I'm going to keep this issue to add a timeout to the http_health_check.rs binary.

When I implemented the http_health_check I did not add a timeout because I thought it was included in the HEALTCHECK instruction in the Containerfile.

HEALTHCHECK --interval=5s --timeout=5s --start-period=3s --retries=3 \  
  CMD /usr/bin/http_health_check http://localhost:${HEALTH_CHECK_API_PORT}/health_check \
    || exit 1

But docker does not abort the command. The timeout only means docker does not consider the service to be healthy if after 5 seconds it does not respond. So we need anyway the timeout in the http_health_check. I'm going to add a hardcoded timeout of 5 seconds. It could be a parameter in the future.

I've created a PR to reproduce and document the problem: #602

Finally, there are more places where not having timeouts could be a problem. Some of them are documented like this issue in the Index. But there could be more of them. For example, we do not have a timeout for handling the UDP request, if I'm not wrong.

I'm going to open a new issue to collect and track all places where we could have problems due to missing timeouts.

@josecelano
Copy link
Member Author

image

@josecelano
Copy link
Member Author

@josecelano
Copy link
Member Author

josecelano commented Jan 12, 2024

6 hours

image

@josecelano
Copy link
Member Author

24 hours

image

@josecelano josecelano self-assigned this Jan 16, 2024
@josecelano
Copy link
Member Author

7 days

image

Fixed via: #608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Incorrect Behavior
Projects
Status: Done
Development

No branches or pull requests

1 participant