At my customer, we've set up a virtual 12.4.5 High Speed AMD on ESXi with auto-discovered traffic turned on. The AMD diagnostics is reporting a consistent sequence number gap rate, with 5-minute averages ranging between 5 and 17% during business hours. The AMD is monitoring two physical links, with each Tx line being tapped separately, resulting in 4 capturing interfaces at the AMD. The reported traffic usage at peak is 50Mbps. The CPU usage is no more than 2%, and there are no dropped packets on the AMD.
I exported a 5 minute packet capture from the AMD into an old version of Wireshark. About 11% of packets have some TCP error or issue, and there's a range of servers and clients involved. Each of the 4 interface captures had a similar rate of TCP errors, suggesting that its not a single tap at fault
What are some of the things that we can do to determine the likely cause of the gaps - whether there's an issue with the tapping, issue with the ESXi setup or there's a wider network problem that we're observing?
Where are those links going?
I'd take a look at the other end too. Just as you , I get a hunch that something might not be all perfect in the virtual World.
Did you do a Steven graph in Wireshark?
It'll tell you if there are some sort of regular pattern in the issue.
Do the sequence gap also result in degraded performance/RTT/latency?
To help determine if there is a tap or real issue - I'll try and get an independent capture. Can you get a capture off of one of the worst servers that DCRUM is reporting high loss rate for.
If the server capture comes back clean, then there may be a problem with un-matched duplicates getting through to the AMD - or not getting de-duplicated properly.
If the server capture is dirty, then there may be a more pervasive network problem.
When new spans/amds are setup - there always seems to be some verification of "bad things" that has to happen before we can pin it on an actual problem.