23 Oct 2020 09:27 AM - last edited on 16 Oct 2023 03:17 PM by random_user
Hello,
I was wondering if anyone has worked with Googles Preemptible nodes. These are basically discounted resources that have a max lifespan of 24hours bur can be terminated at any time.
This is quite useful to use with GKE (K8s) as K8s takes care of rescheduling workloads.
However when using with Dynatrace when these hosts get killed they will create host unavailable alerts.
In the case of a K8s cluster this should be handled differently and not alerted.
First I thought of just creating a customized anomaly detection based on tagging these preemptible nodes, but that is not possible in Dynatrace.
Then I thought of creating an alerting profile that excludes these tagged hosts but that is also not possible (see here: https://community.dynatrace.com/idea/241247/view.html)
Then I looked a bit deeper into Google's documentation and found that when a preemptible node is being shut down GCP sends a ACPI G2 Soft Off signal that should be captured by a user script to ensure a graceful shutdown of the services running on a preemptible host.
For me this seems to be the logical step, let the Dyantrace agent detect this G2 Soft Off signal and react on it by performing a graceful shutdown. This would then not lead to an alert in Dynatrace. Is this something that the oneagent operator can be enhanced with?
I will create an RFE if no other solution exists as of yet.
Reinhard
27 Oct 2020 11:10 AM - last edited on 27 Mar 2023 08:36 AM by MaciejNeumann
I'd raise this as an RFE too, but in the short term, if your script can catch the G2 Soft Off signal, you can always use this:
POST to https://tenantID.live.dynatrace.com/api/v1/events
with a MARKED_FOR_TERMINATION event
27 Oct 2020 02:03 PM
Not sure though what the marked for termination event would do on the problem detection AI engine then. If that ensures that the shutdown is seen as a graceful one then great - only have to ensure the DT event API is accessible.
I'd rather send this event to the oneagent on the host directly instead of the tenant event API.
27 Oct 2020 04:15 PM
In my old org was used when sending a termination for short life cycle EC2 instances and a lambda executed the calls to the API in such events.