01 Jul 2022 10:29 AM - last edited on 10 Aug 2022 06:47 AM by Ana_Kuzmenchuk
Given a situation where a process group contains 3 different hosts and the process group is setup for availability monitoring if minimum threshold 3 is not met. Let's say one hosts gets shutdown in preparation for decommission. In this situation a problem would be created for the threshold of 3 not being met. However, there will never be 3 process instances in that group anymore due to 1 being shutdown for decom.
How does everyone handles this situation to avoid getting problems/tickets created? One solution is to have a script ran during the decom process that checks the process groups on the host and decreases the count by one so that in this example it would be 2. This does not work though if you have 'if any process becomes unavailable'.
Generally if a host or set of hosts are being decommissioned we toss in a MW that looks at any entity that has a tag value of the given host names. As a result everything gets suppressed then once its fully decomed the groups dissolve, the settings wouldn't be valid anymore.
Unfortunately Dynatrace Process Group Availability monitoring is not automatic or self-aware. It also unfortunately does not permit a percentage value (< 66%, < 50%, etc.) (Feel free to submit an RFE for that.)
As a result it's Suggested Practice that at the time of maintaining or modifying your clustered environment you work with your Monitoring Team to also adjust the the active cluster monitoring.
At first this bugged me but then I came to terms that if you're taking on such a major change such as altering the size of a clustered environment, it's reasonable to have as part of your steps to re-evaluate the current monitoring configuration. You may want to adjust more than just the Availability size.
Your script idea is fantastic, but the time to write and maintain the script is likely going to be more than just attaching a Task to the decom request to update the Availability size.
OS Service monitoring on the other hand may provide what you are looking for. First, the process you are monitoring needs to be represented as a Service (Windows Service or Linux Systemd). Then when configuring it, activate the "Monitor" toggle so that DT collects the metrics. Then create a Metric Event filtered on the OS Service Monitor and Hosts in the cluster.
Using the CODE mode you could apply some logic. Such as get a total and if under 50% it likely means 50% of the cluster is down.
Note: OS Service monitoring captures a metric every 10 seconds, so that means every 1 minute there are 6 data points. When you evaluate the Availability you will get a number of 6 for each Service, meaning it was available 100% of the time for that 1 minute.
e.g. you have 5 hosts, if your availability metric is 30 (6x5) that means the OS Service was available 100% of the time across all 5 hosts. If you get a 15 it means either all 5 host were down 50% of the time or more likely that 3 hosts are having issues. (DT has some improvements to make here obviously).