In this post I’d like to start a discussion about the metric specified in https://www.dynatrace.com/support/help/how-to-use-dynatrace/real-user-monitoring/basic-concepts/user... as a Network time. You can find it on various dashboards of the solution. For example:
This metric can be considered as an overall delay of the operation that caused by the network, but this is far not the truth. I dedicated some time to studying of this subject and came to conclusion that Dynatrace RUM, and possibly the other solutions based on W3C navigation timing, are unable to identify how network affects the operation performance. But before going further into technical details let me put couple of words about why this metric is important.
Unfortunately, I found that we cannot rely at Dynatrace RUM in this case. In the documentation they say:
Network time = (requestStart - actionStart) + (responseEnd - responseStart)
That is far not enough to identify overall network delay of the operation.
Here is a simple experiment that can be reproduced with various front-end web applications. First, with Chrome DevTools we artificially decrease network speed, increase latency and disable the cache:
Then open one of the web pages of the business application that is instrumented with Dynatrace RUM monitoring. Here is what we can see at the Dynatrace RUM dashboards:
We can see a lot of heavy static object downloads, and this was the main delay of the operation, but Dynatrace assigns only 7.8 seconds at the “Network time”! I use this artificial example because anyone can repeat that. In reality, we unlikely would have 100 kbps with 1 second RTT. However, I have many practical cases when a great deal of front-end delay was coming from static objects download delay. One example was described in my previous post here, that yet did not receive any attention from the Dynatrace guru (see https://community.dynatrace.com/t5/Dynatrace-Open-Q-A/Dynatrace-RUM-metrics-meaning/m-p/171557).
The mistake can be avoided if we precisely follow up every request waterfall diagram, but this doesn’t give generalized picture that might be very helpful on the early stage of the investigation process.
Technically, Dynatrace already has or can have all the necessary numbers to calculate the Network time properly. Every static object has Request and Response time. We should apply some algorithm to consider parallelization influence. Also, we might need server to client RTT to consider network latency influence within the “Request” time. This all can be done either at OneAgent or at Dynatrace reporting side if the additional calculation load cannot be afforded at the production servers. But maybe I consider this approach too shallow and proper calculation of Network time is totally impossible with W3C resource timing API…
Would be nice to have an official comment from Dynatrace about this point.
…and, yes, I have tried to follow up this issue opening the ticket with Dynatrace support. One month has passed since I opened #8251 and I’ve got no constructive reply yet. Now Dynatrace closed this ticket offering me the training session 😊
Also, I could not find anyone was triggering this topic neither in Dynatrace forum, nor anywhere else. Maybe I missed something… How do you guys identify performance issues caused by large static objects or their inefficient management at client web browsers?
Solved! Go to Solution.
I have had this discussion in the past also, and also in the synthetic context. First of all, let me congratulate you with such a thorough investigation. Dragging the speed is something that I figured quite interesting in this analysis.
Trying to figure out the network contribution is always particularly difficult. We have to consider usual network latency, but than how should we count requests that are being made in parallel, while connections to third parties are also being done. It gets highly complicated pretty fast, and putting that into a number WILL always be debatable.
Hopefully, we might get more insight into how it is calculated, so we can understand the metric better first 😊
Really good point, during my troubleshoot, i avoid to use "network consumption", metrics. like (TTL, TCP, DNS, connection failed, etc). Only one usefull is "size of loading", can make sense if we dont have any cache.
I think that @Radoslaw_Szulgo just might be the right guy, to shed some light on this story, what do you reckon of this Radoslaw?
I'll try to bring some experts here to help you 🙂
Great, thanks! 🙂
> the problem is to aggregate values for multiple requests into 1 value. requests are in parallel, so what do you consider server time, what network time, when both happen at the same time? This is why we have a strange calculation in place, which confuses everyone.
We're aware of the problem - but we don't have this currently evaluated and we don't know how to solve this at best.