We are struggling to understand why ALR impacts one Managed environment. It is consistently around 40% CPU usage when ALR kicks in and stays that way. Memory is only used in around 2/3 of the total amount of RAM.
There is little information about ALR, so I'm putting this in "Tips & Tricks", as this is not particularly a question, but might be a thread where several of us might exchange ideas on how to get Managed to use the hardware in the most efficient way.
The main page where ALR is discussed in the documentation is:
Also, in a previous discussion in the Community some ideas have been exchanged:
Also, Dynatrace (and @Radoslaw_Szulgo particularly) have showed us some light about what is going on below the surface, and optimizing how resources are used:
Now, in this specific case, and as @Julius_Loman suggested previously, I have noticed the following points in the server.log files:
Now, one can just put more hardware on the issue, or several clusters. But that's not how I like to do it, and quite frankly, I believe a lot of you don't too. So, I'm putting some ideas on how tackling ALR could be approximated:
I had the same question as your number 2 recently and got an answer from support.
Basically it's just bad/wrong wording in the UI, when the message is something like:
Server's service call limit has been reached. Processed service calls: 245612, Limit: 488833.
The number for Processed service calls is not actually what it says, but the projected number of services calls after sampling through ALR.
We keep ~50% of the memory available to the operating system so it can be used for disk caches (the OS keeps recently read/written data in caches for faster access – this is not visible as used memory). So 50% available memory does not mean that we only need/use half of the memory, it would be a big mistake to reduce the memory of the host…
We need to focus on the health of the processes that are running. One of the most important goals is that the cluster is responsive at all times. As we are using Java processes, this means that we need to avoid longer GC pauses. So the suspension times of our processes is one of the most important things we look at. Our goal is to not exceed around 5% suspension time. This limit can be hit before we have 90% cpu utilization on the host. Larger machines (especially more memory) give use more headroom for our processes and can significantly increase the traffic we can process before reaching critical suspension times.
There are lots of other factors as well. We have a lot of experience in sizing and operating Dynatrace clusters. It is important to listen to our recommendations. Wrongly sized clusters will lead to a lot of work and frustration for both, You and us. They will experience things like poor performance, not all data being processed,… probably lots of support tickets that could be avoided.
Regarding points (2), (3) and (4) -> These limits were removed in version 1.222 - so should not longer be a problem, and docs are obviously not required any more. So the cluster processes all service calls coming from the agents as long as there are no hints that the cluster is unhealthy (e.g. from longer suspensions by GC or overloaded correlation engine).
I think it is worth mentioning that we are working on a self-monitoring dashboard for managed customers where the utilization of the cluster will be shown in a simplified form. Still, the initial point holds true. Looking at cpu and memory utilization on the OS level cannot be used for that.
Thanks for your input. Some couple of considerations:
Thanks for the update. I was looking at data for the last 6 months and had not noticed that "Subpath traffic" reasons stopped happening in June. Regarding "service call" reasons, 1.222 was installed Aug.5th, but there are messages there till the end of the logs I've got, that are only until Aug.26th, so they don't include yet the latest data, namely after 1.224. I'll check that out 😊