Decouple the sending of probes from the latency reporting in the NodeLatencyMonitor #6570

antoninbas · 2024-07-29T18:42:40Z

At the moment, the NodeLatencyMonitor in the Agent reports latency measurements immediately after sending ICMP probes:

Lines 444 to 451 in 1907856

    
           case <-tickerCh: 
        
           	// Try to send pingAll signal 
        
           	m.pingAll(ipv4Socket, ipv6Socket) 
        
           	// We no not delete IPs from nodeIPLatencyMap as part of the Node delete event handler 
        
           	// to avoid consistency issues and because it would not be sufficient to avoid stale entries completely. 
        
           	// This means that we have to periodically invoke DeleteStaleNodeIPs to avoid stale entries in the map. 
        
           	m.latencyStore.DeleteStaleNodeIPs() 
        
           	m.report()

I believe this is not ideal, because when outputting the NodeLatencyStats, the values for the lastRecvTime and lastSendTime
fields can be a bit confusing / misleading:

kubectl get nodelatencystats/kind-worker -o yaml

apiVersion: stats.antrea.io/v1alpha1
kind: NodeLatencyStats
metadata:
  creationTimestamp: null
  name: kind-worker
peerNodeLatencyStats:
- nodeName: kind-control-plane
  targetIPLatencyStats:
  - lastMeasuredRTTNanoseconds: 5837000
    lastRecvTime: "2024-07-26T22:40:03Z"
    lastSendTime: "2024-07-26T22:40:33Z"
    targetIP: 10.10.0.1
- nodeName: kind-worker2
  targetIPLatencyStats:
  - lastMeasuredRTTNanoseconds: 4704000
    lastRecvTime: "2024-07-26T22:40:03Z"
    lastSendTime: "2024-07-26T22:40:33Z"
    targetIP: 10.10.2.1

We are "always" going to have lastRecvTime < lastSendTime, because we always update NodeLatencyStats right after sending a new probe (before the response has had a chance to be received). Ideally most of the time, especially with very low inter-Node latency like we have here (a few ms), most of the time we would observe timestamps which are very close to each other / identical. This can be achieved by providing enough time to the NodeLatencyMonitor to receive / process the response, before calling m.report().

Another advantage of decoupling the sending of probes from the latency reporting would be the ability to enforce a minimum time interval between two consecutive reports. At the moment it is possible for someone to set pingIntervalSeconds to 1s (minimum supported value in the NodeLatencyMonitor CRD). In turn, this would cause m.report() to be invoked every second. That may be a bit too frequent for a monitoring tool, especially for a large cluster. So we could consider enforcing a minimum interval of 10s (even though that would mean that values of pingIntervalSeconds under 10s are not very useful).

The text was updated successfully, but these errors were encountered:

antoninbas added the good first issue Good for newcomers label Jul 29, 2024

Yushmanth-reddy mentioned this issue Aug 13, 2024

Added delay before reporting and enforced a minimum interval of 10s #6608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple the sending of probes from the latency reporting in the NodeLatencyMonitor #6570

Decouple the sending of probes from the latency reporting in the NodeLatencyMonitor #6570

antoninbas commented Jul 29, 2024

Decouple the sending of probes from the latency reporting in the NodeLatencyMonitor #6570

Decouple the sending of probes from the latency reporting in the NodeLatencyMonitor #6570

Comments

antoninbas commented Jul 29, 2024