We've been troubleshooting some issues such as missing data and would appreciate some input from others who may have had more experience with this.
Above is 12 hours of server health. One of the things that stands out to me is of course the skipped events and skipped (non-analyzed) purepaths. Also I have been investigating the high and spiking purepath lengths.
I am also interested in the spikes across measurements here that seem to occur every hour at about 40 minutes past. The spikes seem to correlate among cpu usage, MPS, memory usage, and to some extent suspension time.
I would like some input as to ways that I can drill into some of these issues, or if anythings jumps out maybe some ideas as to what is occuring.
Solved! Go to Solution.
One thing I have seen support do in the past in look at charts in the dynatrace self monitoring system profile. Then they chart the bytes per agent. This gives you an indication of where the traffic spikes are coming from. Then you look at the Purepaths for those agents in and look to see what transactions are coming in at that point in time. There you will find clues into volume, potential configuration tweaks you can make to reduce some of these spikes you are seeing. This is just a start I understand, but more will be unlocked from looking at this information.
That's a good idea and I've been looking at that now. I don't know if it is related to what I was actually looking at but using data from it we found a sensor configuration that was capturing EVERY method in a pretty big package.
However I didn't see anything traffic spikes that occurred every hour so I'm kind of thinking it might be a scheduled thing on the server side. One thing we're investigating is excessive business transaction baselining so possibly that calculation is occurring on an hourly basis.
Something that I am noticing is that your Server Memory keeps climbing then dropping when the CPU spike come in so it could be a memory leak issue. I can try and take a closer look at it this weekend, feel free to send me a session at firstname.lastname@example.org if you can. I would try and do 10 minutes before and after the spike if possible.
Here is my quick analysis.
One of the reasons for the skipped events and skipped PPs is that you have PurePaths that exceed the 10k Node Limit. The agents will still keep sending events for PurePaths that the dynatrace server already closed due to that limit. That explains some of your skipped events. these PurePAths also get skipped from analysis because they are truncated and not completed.
As for the CPU Spikes. That is just Garbage Collection Time (=Suspension Time). You can easily see the correlation of CPU to Suspension Time to the Drop in Memory -> thats when the GC cleans up Memory. That is totally normal and not to worry about. You would only worry about it if that would happen more frequently and would consume much more CPU and stall your server - OR - when the memory over a longer period of time overall increases and leads to an out-of-memory situation. But I think it doesnt.
I suggest you first focus on these PurePaths that exceed the 10k limit. MAybe these are PPs of transactoins you are not interested in, e.g: long running backend jobs, .. -> or maybe you have over instrumtenation that leads to that problem. I think if you solve this issue you will see these problems go away
It is good to hear that the spikes and garbage collection are not out of the ordinary. I have indeed noticed that this skipped data does come up when there are big spikes in the PurePath size and I was able to make a chart that showed where these increases came from in certain cases. Over instrumentation is definitely something we are looking at as we are going through the process of cleaning up our Dynatrace deployment. It is also possible that some of this is also related to my previous inquiry here: