I was wondering if anyone has worked with Googles Preemptible nodes. These are basically discounted resources that have a max lifespan of 24hours bur can be terminated at any time.
This is quite useful to use with GKE (K8s) as K8s takes care of rescheduling workloads.
However when using with Dynatrace when these hosts get killed they will create host unavailable alerts.
In the case of a K8s cluster this should be handled differently and not alerted.
First I thought of just creating a customized anomaly detection based on tagging these preemptible nodes, but that is not possible in Dynatrace.
Then I thought of creating an alerting profile that excludes these tagged hosts but that is also not possible (see here: https://answers.dynatrace.com/idea/241247/view.html)
Then I looked a bit deeper into Google's documentation and found that when a preemptible node is being shut down GCP sends a ACPI G2 Soft Off signal that should be captured by a user script to ensure a graceful shutdown of the services running on a preemptible host.
For me this seems to be the logical step, let the Dyantrace agent detect this G2 Soft Off signal and react on it by performing a graceful shutdown. This would then not lead to an alert in Dynatrace. Is this something that the oneagent operator can be enhanced with?
I will create an RFE if no other solution exists as of yet.
I'd raise this as an RFE too, but in the short term, if your script can catch the G2 Soft Off signal, you can always use this:
POST to https://tenantID.live.dynatrace.com/api/v1/events
with a MARKED_FOR_TERMINATION event
Not sure though what the marked for termination event would do on the problem detection AI engine then. If that ensures that the shutdown is seen as a graceful one then great - only have to ensure the DT event API is accessible.
I'd rather send this event to the oneagent on the host directly instead of the tenant event API.