30 Sep 2024 02:03 AM - edited 30 Sep 2024 02:07 AM
We’re getting Memory saturation alerts on File and MS SQL servers. The Windows Admin and Performance monitoring teams say page faults are normal and consider the Dynatrace alerts false. We have many servers, which strains these teams. Any suggestions other than modifying the alerting profile or disabling Memory saturation configurations are appreciated.
Thanks
Raj
30 Sep 2024 04:02 AM
Hi @mrc15816
Have you considered using Auto-adaptive thresholds for anomaly detection?
https://docs.dynatrace.com/docs/platform/davis-ai/anomaly-detection/auto-adaptive-threshold
02 Oct 2024 03:07 AM
@p_devulapalli thank you for your comment. I am not sure if auto-adaptive alerting support OOTB setup, we are managed deployment.
30 Sep 2024 05:34 AM
Hey Raj,
If you use host groups to group these similar hosts together you could then modify the alerts to increase the thresholds across all those hosts in one easy configuration. If not then you could also modify the alerts on each of the hosts. Also, if the team is considering them false alarms, reach out and ask what they'd consider to be real problem and use that to influence any changes made to thresholds. If these are short alerts you could increase the time required over the threshold required, or if the team knows how many page faults they'd consider an issue you could increase that threshold.
02 Oct 2024 03:01 AM
@Fin_Ubels We considered the host group option but are concerned about the complexity and operational challenges involved, especially with hundreds of host groups and multiple environments like cloud, on-premise, and co-location. We don’t immediately alert when we see memory saturation but wait to see if it resolves within 30 minutes. If the problem persists, we open an incident ticket. If the behavior is normal, we wonder if every customer is changing, or if Dynatrace OneAgent detect a server running MS SQL Server and adpopt?
02 Oct 2024 03:15 AM
I don't believe the OneAgent adapts in these scenarios. From my experience customers often develop an onboarding strategy for new OneAgent deployments. When they are deployed, they get attached host groups, network zones, any custom tags/metadata required, custom alerting settings as required and any other settings required. This helps prevent scenarios such as this where going back is difficult and ensures that alerts generated are accepted by the teams they are relevant to. This doesn't really help in the current scenario but it would be good to consider in the future.
The auto adaptive thresholds that @p_devulapalli suggested would require you to create a custom alert and disable memory alerting on the hosts that the custom alert covered to ensure there weren't double ups on alerts. By the sounds of things this would be a considerable manual effort as well.
02 Oct 2024 03:39 AM
Thank you for your quick response @Fin_Ubels. I would agree with you that gathering much of the information will help, but at a large scale where many things come into play, it will make things complicated and operational very challenging.Custom metric events were not feasible as Dynatrace doesn’t support the use of two metrics, i.e., Memory Used & Page faults combination, hence we didn’t.My bad and I agree that OneAgent doesn’t do much on the alerting part, but the cluster should be capable of adopting vendor best practices, etc.