-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zombie healtcheck threads #597
Comments
This should not be merged. The intention is only to document the bug.
Hi @da2ce7 @WarmBeer. The problem was solved after merging this PR. The healthcheck does not have a timeout for the request, so it waits indefinitely. As we discussed in our meeting. It happened that:
The following is the latest version but in the previous one binding and running the server are also done in different threads. impl Udp {
/// It starts the UDP server instance with graceful shutdown.
///
/// # Panics
///
/// It panics if unable to bind to udp socket, and get the address from the udp socket.
/// It also panics if unable to send address of socket.
async fn start_with_graceful_shutdown(
tracker: Arc<Tracker>,
bind_to: SocketAddr,
tx_start: Sender<Started>,
rx_halt: Receiver<Halted>,
) -> JoinHandle<()> {
let socket = Arc::new(UdpSocket::bind(bind_to).await.expect("Could not bind to {self.socket}."));
let address = socket.local_addr().expect("Could not get local_addr from {binding}.");
info!(target: "UDP Tracker", "Starting on: udp://{}", address);
let running = tokio::task::spawn(async move {
let halt = tokio::task::spawn(async move {
debug!(target: "UDP Tracker", "Waiting for halt signal for socket address: udp://{address} ...");
shutdown_signal_with_message(
rx_halt,
format!("Shutting down UDP server on socket address: udp://{address}"),
)
.await;
});
let listen = async move {
debug!(target: "UDP Tracker", "Waiting for packets on socket address: udp://{address} ...");
loop {
let mut data = [0; MAX_PACKET_SIZE];
let socket_clone = socket.clone();
match socket_clone.recv_from(&mut data).await {
Ok((valid_bytes, remote_addr)) => {
let payload = data[..valid_bytes].to_vec();
debug!(target: "UDP Tracker", "Received {} bytes", payload.len());
debug!(target: "UDP Tracker", "From: {}", &remote_addr);
debug!(target: "UDP Tracker", "Payload: {:?}", payload);
let response = handle_packet(remote_addr, payload, &tracker).await;
Udp::send_response(socket_clone, remote_addr, response).await;
}
Err(err) => {
error!("Error reading UDP datagram from socket. Error: {:?}", err);
}
}
}
};
pin_mut!(halt);
pin_mut!(listen);
tx_start
.send(Started { address })
.expect("the UDP Tracker service should not be dropped");
tokio::select! {
_ = & mut halt => { debug!(target: "UDP Tracker", "Halt signal spawned task stopped on address: udp://{address}"); },
() = & mut listen => { debug!(target: "UDP Tracker", "Socket listener stopped on address: udp://{address}"); },
}
});
info!(target: "UDP Tracker", "Started on: udp://{}", address);
running
} Just for the record, we did that because we wanted to send back the bound address when the configuration uses port 0 and it's assigned a free port dynamically. I'm going to keep this issue to add a timeout to the http_health_check.rs binary. When I implemented the HEALTHCHECK --interval=5s --timeout=5s --start-period=3s --retries=3 \
CMD /usr/bin/http_health_check http://localhost:${HEALTH_CHECK_API_PORT}/health_check \
|| exit 1 But docker does not abort the command. The timeout only means docker does not consider the service to be healthy if after 5 seconds it does not respond. So we need anyway the timeout in the I've created a PR to reproduce and document the problem: #602 Finally, there are more places where not having timeouts could be a problem. Some of them are documented like this issue in the Index. But there could be more of them. For example, we do not have a timeout for handling the UDP request, if I'm not wrong. I'm going to open a new issue to collect and track all places where we could have problems due to missing timeouts. |
7 days Fixed via: #608 |
Parent issue: #582
The live demo is not working because the services crash periodically due to high memory consumption.
I'm not sure if the problem comes from having a lot of health check threads, but it's something we have to fix anyway.
live-demo-console-log.txt
In the past, we have a similar problem when you disable the cronjob to remove peerless torrents. See these related issues:
The text was updated successfully, but these errors were encountered: