Re: Understanding and fighting ALR

AntonioSousa · ‎10 Sep 2021

We are struggling to understand why ALR impacts one Managed environment. It is consistently around 40% CPU usage when ALR kicks in and stays that way. Memory is only used in around 2/3 of the total amount of RAM.

There is little information about ALR, so I'm putting this in "Tips & Tricks", as this is not particularly a question, but might be a thread where several of us might exchange ideas on how to get Managed to use the hardware in the most efficient way.

The main page where ALR is discussed in the documentation is:

https://www.dynatrace.com/support/help/how-to-use-dynatrace/transactions-and-services/purepath-distr...

Also, in a previous discussion in the Community some ideas have been exchanged:
https://community.dynatrace.com/t5/Dynatrace-Open-Q-A/RFE-Why-ALR/m-p/112116

Also, Dynatrace (and @Radoslaw_Szulgo particularly) have showed us some light about what is going on below the surface, and optimizing how resources are used:

https://www.dynatrace.com/news/blog/process-more-with-less-using-smarter-cluster-overload-prevention...

https://www.dynatrace.com/news/blog/storage-handling-improvements-increase-retention-of-transaction-...

Now, in this specific case, and as @Julius_Loman suggested previously, I have noticed the following points in the server.log files:

About 50% are related to messages like the one below. It's GC kicking in...
Garbage collection's average time in percentage for the service calls throughput (9.23) is greater than the threshold limit (4.50)
About 25% are related to the message below. Of these, 80% are like the one below, where the traffic is lower, much lower, but apparently still triggers ALR:
Server's incoming sampled subpath traffic in bytes has been reached. Current incoming traffic: 235099314, Limit: 469762048.
The remaining 25% are related to the message below. In this case, 90% are like the one below, where the number of processed service calls is way below the limit, but still triggering ALR:
Server's service call limit has been reached. Processed service calls: 245612, Limit: 488833.

Now, one can just put more hardware on the issue, or several clusters. But that's not how I like to do it, and quite frankly, I believe a lot of you don't too. So, I'm putting some ideas on how tackling ALR could be approximated:

If we have GC, than it's essentially about CPU, and probably mainly memory. Since Managed is composed ofsome big monolithic java program, with fixed Xmx values, that uses a predefined percentage of system RAM to run, it seems that there is little control around that. But given that the servers run with almost 1/3 of RAM not used, why is this so? To leave room for upgrades or other programs running? Does it make sense to have more RAM, and maintain CPU, as CPU usage is not that high?
Why is "Subpath traffic" triggering ALR when apparently the limit isn't being breached? The same really applies to "service call limit".
"Subpath traffic" is something that seems undocumented.It seems related to traffic coming in through the network interface. In this case, servers are very near Managed, in terms of network latency. Some might be even in the same ESX server. Activegate usage was not deemed necessary in this case, but would somehow the compression that it does be useful here?
The "service call" limit seems to be associated with the "Adaptive capture control" referenced in the ALR documentation link above. I do have a great number of calls, and it seems that minimizing the ACC for certain process groups that have a lot of Purepaths that are not that needed, might be a way forward in my case.
How does specific hardware solve ALR? I believe CPU frequency is factored in, but maybe only indirectly, in the way it affects GC. Does the type of storage affect ALR? It is understandable that better hardware specs will help, but what would help more in the context of ALR?
Today, everywhere I run Managed, it's virtualized. There are some requirements/recommendations from Dynatrace in the link below, but might there be others?
https://www.dynatrace.com/support/help/setup-and-configuration/dynatrace-managed/installation/dynatr...

Antonio Sousa

pahofmann · ‎10 Sep 2021

I had the same question as your number 2 recently and got an answer from support.

Basically it's just bad/wrong wording in the UI, when the message is something like:

Server's service call limit has been reached. Processed service calls: 245612, Limit: 488833.

The number for Processed service calls is not actually what it says, but the projected number of services calls after sampling through ALR.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

Radoslaw_Szulgo · ‎10 Sep 2021

We keep ~50% of the memory available to the operating system so it can be used for disk caches (the OS keeps recently read/written data in caches for faster access – this is not visible as used memory). So 50% available memory does not mean that we only need/use half of the memory, it would be a big mistake to reduce the memory of the host…

We need to focus on the health of the processes that are running. One of the most important goals is that the cluster is responsive at all times. As we are using Java processes, this means that we need to avoid longer GC pauses. So the suspension times of our processes is one of the most important things we look at. Our goal is to not exceed around 5% suspension time. This limit can be hit before we have 90% cpu utilization on the host. Larger machines (especially more memory) give use more headroom for our processes and can significantly increase the traffic we can process before reaching critical suspension times.

There are lots of other factors as well. We have a lot of experience in sizing and operating Dynatrace clusters. It is important to listen to our recommendations. Wrongly sized clusters will lead to a lot of work and frustration for both, You and us. They will experience things like poor performance, not all data being processed,… probably lots of support tickets that could be avoided.

Regarding points (2), (3) and (4) -> These limits were removed in version 1.222 - so should not longer be a problem, and docs are obviously not required any more. So the cluster processes all service calls coming from the agents as long as there are no hints that the cluster is unhealthy (e.g. from longer suspensions by GC or overloaded correlation engine).

I think it is worth mentioning that we are working on a self-monitoring dashboard for managed customers where the utilization of the cluster will be shown in a simplified form. Still, the initial point holds true. Looking at cpu and memory utilization on the OS level cannot be used for that.

(source: @markus_pfleger)

Senior Product Manager,
Dynatrace Managed expert

AntonioSousa · ‎10 Sep 2021

@Radoslaw_Szulgo,

Thanks for your input. Some couple of considerations:

I have confirmed through vmstat, in a 64GB server, that 20GB are being used at the OS level for file caching (cache column). While it does have the advantages that you refer to, this was truer with HDDs than it is now with SSDs. It seems that this is clearly excessive. Is this ratio in someway configurable?
What you have said goes along what I was imaging: in scenarios where GC is triggering ALR, especially with not high CPU usage as is the case I'm analyzing, it makes more sense to give the machine more RAM, and not invest in more cores?

Antonio Sousa

Radoslaw_Szulgo · ‎10 Sep 2021

That’s correct. More memory would help to solve ALR. I’ll check with the team on how configurable the buffer size is.

Senior Product Manager,
Dynatrace Managed expert

Radoslaw_Szulgo · ‎10 Sep 2021

See my updated message above.

Senior Product Manager,
Dynatrace Managed expert

AntonioSousa · ‎10 Sep 2021

Thanks for the update. I was looking at data for the last 6 months and had not noticed that "Subpath traffic" reasons stopped happening in June. Regarding "service call" reasons, 1.222 was installed Aug.5th, but there are messages there till the end of the logs I've got, that are only until Aug.26th, so they don't include yet the latest data, namely after 1.224. I'll check that out 😊

Antonio Sousa