cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Avoiding host unavailable alerts for GCP preemptible VMs?

r_weber
Pro

Hello,

I was wondering if anyone has worked with Googles Preemptible nodes. These are basically discounted resources that have a max lifespan of 24hours bur can be terminated at any time.

This is quite useful to use with GKE (K8s) as K8s takes care of rescheduling workloads.

However when using with Dynatrace when these hosts get killed they will create host unavailable alerts.
In the case of a K8s cluster this should be handled differently and not alerted.

First I thought of just creating a customized anomaly detection based on tagging these preemptible nodes, but that is not possible in Dynatrace.
Then I thought of creating an alerting profile that excludes these tagged hosts but that is also not possible (see here: https://answers.dynatrace.com/idea/241247/view.html)

Then I looked a bit deeper into Google's documentation and found that when a preemptible node is being shut down GCP sends a ACPI G2 Soft Off signal that should be captured by a user script to ensure a graceful shutdown of the services running on a preemptible host.

For me this seems to be the logical step, let the Dyantrace agent detect this G2 Soft Off signal and react on it by performing a graceful shutdown. This would then not lead to an alert in Dynatrace. Is this something that the oneagent operator can be enhanced with?
I will create an RFE if no other solution exists as of yet.

Reinhard

3 REPLIES 3

adam_gardner
Dynatrace Champion
Dynatrace Champion

I'd raise this as an RFE too, but in the short term, if your script can catch the G2 Soft Off signal, you can always use this:

POST to https://tenantID.live.dynatrace.com/api/v1/events
with a MARKED_FOR_TERMINATION event

Not sure though what the marked for termination event would do on the problem detection AI engine then. If that ensures that the shutdown is seen as a graceful one then great - only have to ensure the DT event API is accessible.

I'd rather send this event to the oneagent on the host directly instead of the tenant event API.

Anonymous
Not applicable
Yeah, that's the idea, avoid raising a problem for that host.


In my old org was used when sending a termination for short life cycle EC2 instances and a lambda executed the calls to the API in such events.