With the release of OneAgent 1.261 a new and long awaited metric is available!!!
The OneAgent OS module now reports the host availability state as a metric.
Metric key: builtin:host.availability.state
The resolution is 1 minute: states are sent every minute, always with the value 1, which means that the reported state occurred in the given minute. If there is no sample with the given state, it means that the state was not detected in this minute.
Why is important?
Because it allow us to generate alerts based on different use cases:
and thus improve the management we have over the deployed OneAgents
Thanks Dynatrace Team!!!!
Thanks for this info @DanielS What should be the threshold if I use the metric key "builtin:host.availability.state" to create a Metric Event? I need a Metric event when the actual host is down. What should be threshold?
Hello @JDS thanks, you could use the metric selector option with this text:
Also if you want to add more "down states" you could do it:
Thanks @DanielS I tried to see additional options in data explorer but I am getting on "no_data" and "up" states alone. If I use the this expression in Metric selector what should be the value of the threshold that will trigger an event when the host goes down?
I tried using the host availability percentage metric but it isn't scaling as the allowed dimension is only 5000. We have around 12K hosts in our Non-Production environment.
Hi @JDS, you didn't see the other states because you don't have such events in the selected time window. I have posted all the events in my first post, try using them in the filters, when they are at 0 and go up to 1 this will trigger the event. I guess this is the best approach.
Thanks @DanielS. I used this Metric selector expression "builtin:host.availability.state:filter(and(or(eq("availability.state","up")))):splitBy("dt.entity.host"):sort(value(auto,descending)):limit(100)" and received the data points shown in the screenshot.
I don't understand what does the value 10, 6, 12, etc. means? Also what does the values in the Y axis stands for? The values in X axis refers to the time line which I is clear for me.
Hi @JDS if you use up as state you are going to see the count of all host that are in up state. In that case you need to know the quantity and trigger the alert when this decrease.
uh understood, @DanielS If I need to use the UP state monitor and if I have 1k hosts in our environment then the threshold for this Metric event will be 1000 so it will trigger an alert when the count is less than the threshold, correct.
I am looking for an alert to be triggered when any specific host in the environment is down where the alert/dynatrace problem generated will tell me the specific host as down which I can send to server team through problem notification
Definitely a great improvement here. Have you been able to get a problem to remain open when the agent is not running? Creating a problem is no issue but it closes shortly after even when the agent is still not running. I assume it is because there is no continuous datapoint of 1 for the unmonitored_agent_stopped state, you only get the initial 1 datapoint when the agent stops.