cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

PRO TIP - New Metric Host Availability State

DanielS
DynaMight Guru
DynaMight Guru

With the release of OneAgent 1.261 a new and long awaited metric is available!!!

The OneAgent OS module now reports the host availability state as a metric.

Metric key: builtin:host.availability.state

DanielS_0-1679573991026.png

Reported states:

  • UP - host working, OneAgent active and sending data
  • NO_DATA - host working, OneAgent active but not sending data
  • NO_DATA_AGENT_INACTIVE - host working, OneAgent inactive (disabled manually in configuration) and not sending data
  • SHUTDWON_HOST - host has been shut down
  • UNMONITORED_AGENT_STOPPED - host unmonitored: OneAgent stopped
  • UNMONITORED_AGENT_UPGRADE - host unmonitored: OneAgent in upgrade
  • UNMONITORED_AGENT_UNINSTALLED - host unmonitored: OneAgent uninstalled

The resolution is 1 minute: states are sent every minute, always with the value 1, which means that the reported state occurred in the given minute. If there is no sample with the given state, it means that the state was not detected in this minute.

Why is important?

Because it allow us to generate alerts based on different use cases:

  • Unistalled OneAgents
  • Stopped OneAgents
  • OneAgents that are working but no data is received.
  • OneAgents manually disabled .

and thus improve the management we have over the deployed OneAgents

DanielS_1-1679574031336.png

:clap::clap:Thanks :dynaspin: Dynatrace :dynaspin:Team!!!!

 

The true delight is in the finding out rather than in the knowing.
11 REPLIES 11

radek_jasinski
DynaMight Guru
DynaMight Guru

Great - it will be very useful!

Have a nice day!

JDS
Frequent Guest

Thanks for this info @DanielS  What should be the threshold if I use the metric key "builtin:host.availability.state" to create a Metric Event? I need a Metric event when the actual host is down. What should be threshold? 

Host Availbility State Metric Event.png

DanielS
DynaMight Guru
DynaMight Guru

Hello @JDS thanks, you could use the metric selector option with this text:

builtin:host.availability.state:filter(and(or(eq("availability.state",shutdown_host)))):splitBy("dt.entity.host"):sort(value(auto,descending)):limit(100)

DanielS_0-1680199530213.png

Also if you want to add more "down states" you could do it:

builtin:host.availability.state:filter(and(or(eq("availability.state",shutdown_host),eq("availability.state",unmonitored_agent_stopped)))):splitBy("dt.entity.host"):count:sort(value(avg,descending)):limit(100)

 

The true delight is in the finding out rather than in the knowing.

JDS
Frequent Guest

Thanks @DanielS I tried to see additional options in data explorer but I am getting on "no_data" and "up" states alone. If I use the this expression in Metric selector what should be the value of the threshold that will trigger an event when the host goes down? 

I tried using the host availability percentage metric but it isn't scaling as the allowed dimension is only 5000. We have around 12K hosts in our Non-Production environment. 

 

JDS_0-1680200663074.png

 

DanielS
DynaMight Guru
DynaMight Guru

Hi @JDS, you didn't see the other states because you don't have such events in the selected time window. I have posted all the events in my first post, try using them in the filters, when they are at 0 and go up to 1 this will trigger the event. I guess this is the best approach.

The true delight is in the finding out rather than in the knowing.

JDS
Frequent Guest

Thanks @DanielS. I used this Metric selector expression "builtin:host.availability.state:filter(and(or(eq("availability.state","up")))):splitBy("dt.entity.host"):sort(value(auto,descending)):limit(100)" and received the data points shown in the screenshot. 

I don't understand what does the value 10, 6, 12, etc. means? Also what does the values in the Y axis stands for? The values in X axis refers to the time line which I is clear for me. 

JDS_0-1680204535149.png

 

DanielS
DynaMight Guru
DynaMight Guru

Hi @JDS if you use up as state you are going to see the count of all host that are in up state. In that case you need to know the quantity and trigger the alert when this decrease.

 

The true delight is in the finding out rather than in the knowing.

JDS
Frequent Guest

uh understood, @DanielS If I need to use the UP state monitor and if I have 1k hosts in our environment then the threshold for this Metric event will be 1000 so it will trigger an alert when the count is less than the threshold, correct. 

I am looking for an alert to be triggered when any specific host in the environment is down where the alert/dynatrace problem generated will tell me the specific host as down which I can send to server team through problem notification

sivart_89
Mentor

Definitely a great improvement here. Have you been able to get a problem to remain open when the agent is not running? Creating a problem is no issue but it closes shortly after even when the agent is still not running. I assume it is because there is no continuous datapoint of 1 for the unmonitored_agent_stopped state, you only get the initial 1 datapoint when the agent stops. 

thanks @sivart_89 but you can alert on missing data also:

DanielS_0-1680279488524.png

 

The true delight is in the finding out rather than in the knowing.

santruan
Participant

Hi experts, regarding this topic, is it possible to obtain the time in which a host was in the shutdown state?

Featured Posts