on 03 Apr 2025 11:28 AM
It's possible that in many production environments, OneAgent retransmission metrics may diverge from those of other tools such as "wireshark tcp.analysis.retransmission" or "netstat -s" output. There are a couple of reasons for that. The basic thing is to have a good comprehension of TCP retransmission, what types it has, and eventually how it relates to OneAgent's network module calculated metrics retransmission.
This issue might concern network retransmission metrics. Focusing on metrics identifiers for Grail only and ignoring older Metrics API identifiers for clarity. Those metrics refer to the following Grail metrics identifiers.
dt.process.network.packets.re_tx_aggr - number packets transmitted as a retransmission
dt.process.network.packets.re_rx_aggr - number packets received as retransmissions
Also, it is important to take into account what packets might be subject to retransmission:
dt.process.network.packets.base_re_tx_aggr - number of sent retransmission base packets
dt.process.network.packets.base_re_rx_aggr - number of received retransmission base packets
The above 4 metrics operate on absolute packets sum which may be clumsy. When comparing retransmission between two points, it is better to analyze the percentage of retransmitted packets per process, PGI, host, or network interface. From percent definition percentage of retransmitted packets is given by above abstract expression.
retransmissions% = dt.process.network.packets.re_tx_aggr / dt.process.network.packets.base_re_rx_aggr * 100%
Full DQL expression to get this for e.g. given process group - ROCESS_GROUP_INSTANCE-54FF3ADD0B5EDB11 - is quite complex:
timeseries Tx=avg(dt.process.network.packets.re_tx_aggr), nonempty:true, timeframe:"00:30/04:30",
filter: { matchesValue(dt.entity.process_group_instance, "PROCESS_GROUP_INSTANCE-54FF3ADD0B5EDB11") }
| join [
timeseries TxBase=avg(dt.process.network.packets.base_re_tx_aggr), nonempty:true, union:true,
filter: { matchesValue(dt.entity.process_group_instance, "PROCESS_GROUP_INSTANCE-54FF3ADD0B5EDB11") }
], kind:leftOuter, on:{timeframe}
| fieldsAdd {Txperc = 100 * (Tx[]/right.TxBase[])}
| fields Txperc, timeframe, interval
| fieldsAdd metricName = "Nginx retranmissions sent out"
Wireshark with filter tcp.analysis.retransmission is certainly a more sophisticated tool and offers more options e.g out-of-order or spurious retransmission classification. Wireshark documentation gives more details. It's worth emphasizing that when the OneAgent network module and wireshark run in parallel each of them may see a bit different set of packets. This occurs due to bpf (libpcap) doesn't guarantee that 100% of packets will be captured due to limited size of used buffers.
Regarding netstat -s tool. This tool prints out a lot tcp counters which are global per TCP/IP stack. Among these counters is number of packets retransmitted
root@kpi-server:/var/log/dynatrace/oneagent/os# netstat -s |grep retransmitted
102383510128 segments received
169869495375 segments sent out
425003509 segments retransmitted
5579 bad segments received
As you can see netstat as well as wireshark don't aggregate the retransmission metric per process. These metrics usually are accessible only per host or per network adapter.
If retransmissions reported by the OneAgent are definitely higher that reported by another tool. You can disable incoming retransmissions and check again. Incoming retransmissions can be disabled per host by setting runtime flag debugNetAgentDisableIncomingRetransmissionsNative to true by the support team or with the environment variable:
DT_DEBUGFLAGS=debugNetAgentDisableIncomingRetransmissionsNative=true