cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
pawel_stenka
Dynatrace Contributor
Dynatrace Contributor

Abstract

It's possible that in many production environments, OneAgent retransmission metrics may diverge from those of other tools such as "wireshark tcp.analysis.retransmission" or "netstat -s" output. There are a couple of reasons for that. The basic thing is to have a good comprehension of TCP retransmission, what types it has, and eventually how it relates to OneAgent's network module calculated metrics retransmission.

 

Problem

This issue might concern network retransmission metrics. Focusing on metrics identifiers for Grail only and ignoring older Metrics API identifiers for clarity.  Those metrics refer to the following Grail metrics identifiers.  
dt.process.network.packets.re_tx_aggr  - number packets transmitted as a retransmission
dt.process.network.packets.re_rx_aggr  - number packets received as retransmissions

Also, it is important to take into account what packets might be subject to retransmission: 

dt.process.network.packets.base_re_tx_aggrnumber of sent retransmission base packets
dt.process.network.packets.base_re_rx_aggrnumber of received retransmission base packets

The above 4 metrics operate on absolute packets sum which may be clumsy. When comparing retransmission between two points, it is better to analyze the percentage of retransmitted packets per process, PGI, host, or network interface. From percent definition percentage of retransmitted packets is given by above abstract expression. 

retransmissions% = dt.process.network.packets.re_tx_aggr / dt.process.network.packets.base_re_rx_aggr * 100%

Full DQL expression to get this for e.g. given process group - ROCESS_GROUP_INSTANCE-54FF3ADD0B5EDB11 - is quite complex:

 

timeseries Tx=avg(dt.process.network.packets.re_tx_aggr), nonempty:true, timeframe:"00:30/04:30",
    filter: { matchesValue(dt.entity.process_group_instance, "PROCESS_GROUP_INSTANCE-54FF3ADD0B5EDB11") }
| join [
      timeseries TxBase=avg(dt.process.network.packets.base_re_tx_aggr), nonempty:true, union:true,
          filter: { matchesValue(dt.entity.process_group_instance, "PROCESS_GROUP_INSTANCE-54FF3ADD0B5EDB11") }
    ], kind:leftOuter, on:{timeframe}
| fieldsAdd {Txperc = 100 * (Tx[]/right.TxBase[])}
| fields Txperc, timeframe, interval
| fieldsAdd metricName = "Nginx retranmissions sent out"

 

Resolution

 

Clarifications which packets are recognized as a retransmission by OneAgent

  • Retransmitted TCP segment is a segment with a duplicated sequence number within a defined period of time (maximal retransmission timeout) i.e. it has been sent more than once in a defined direction. Duplicated packets are fully visible for the outgoing direction, for the incoming direction duplicated packets usually are not visible because the primary packet didn’t reach the destination host and got lost somewhere before. 
  • TCP segments without data (data length == 0) are not considered as a retransmission. That lets us exclude duplicate ACK (SACK) and TCP fast retransmission. From a performance point of view these events are not interesting because the application doesn't need to stop sending by longer time and change its state to waiting.  The OneAgent focuses only on timeout retransmission which acts adversely on transmission throughput performance through network and usually slows down application.
  • Duplicated TCP SYN and FIN are considered as a retransmission
  • TCP keep-alive segment (with data size of 1 byte and duplicated sequence number) are not considered as a retransmission
  • For incoming direction out-of-sequence packets can be considered as a retransmission
  • Retransmissions are calculated for incoming or outgoing communication channels.
  • Retransmissions are not calculated for local or forwarded TCP sessions.

 

Reference to other tools

Wireshark with filter tcp.analysis.retransmission  is certainly a more sophisticated tool and offers more options e.g out-of-order or spurious retransmission classification.  Wireshark documentation gives more details.  It's worth emphasizing that when the OneAgent network module and wireshark run in parallel each of them may see a bit different set of packets. This occurs due to bpf (libpcap) doesn't guarantee that 100% of packets will be captured due to limited size of used buffers.

Regarding netstat -s tool. This tool prints out a lot tcp counters which are global per TCP/IP stack.  Among these counters is number of packets retransmitted

root@kpi-server:/var/log/dynatrace/oneagent/os# netstat -s |grep retransmitted
102383510128 segments received
169869495375 segments sent out
425003509 segments retransmitted
5579 bad segments received

As you can see netstat as well as wireshark don't aggregate the retransmission metric per process. These metrics usually are accessible only per host or per network adapter.

 

Troubleshooting steps

If retransmissions reported by the OneAgent are definitely higher that reported by another tool.  You can disable incoming retransmissions and check again. Incoming retransmissions can be disabled per host by setting runtime flag debugNetAgentDisableIncomingRetransmissionsNative to true by the support team or with the environment variable:

DT_DEBUGFLAGS=debugNetAgentDisableIncomingRetransmissionsNative=true

 

Version history
Last update:
‎03 Apr 2025 11:28 AM
Updated by: