15 Aug 2025 06:51 PM
Hi,
I have a case with one of our customers who claims that during an update of the Dynatrace Managed Cluster, they receive false-positive alarms — specifically Host monitoring unavailable.
While I still need to check their environment (and I doubt that these alarms are truly false positives), their request made me think about how Dynatrace actually handles problems and alerts during an upgrade.
When we create a Maintenance Window, it’s clear — for the given filters and time period, all alarms or problems are ignored. However, during a Managed Dynatrace Cluster update, it obviously doesn’t work the same way (which is absolutely expected).
My question is: How does Dynatrace decide which alerts or problems to ignore during the upgrade and actually how the upgrade works?
Thanks!
Regards, Deni
16 Aug 2025 08:52 AM
Hi @deni
If customer's cluster is running on a single node the unavailable alerts are expectable since the downtime of the cluster is longer then 3 minute and after the cluster is upgraded and running again it will open those "false" alarms.
If the cluster is running on 3 nodes or more that's odd and you will need to look at the logs / open a ticket.
HTH
Yos
20 Aug 2025 10:25 AM
Hi @Yosi_Neuman
Thanks a lot for your reply.
Actually, we already opened a ticket, but still don’t have a resolution.
This is a cluster with more than 3 nodes, and for the last 1–2 years everything was working fine. The client confirmed that this behavior only started about a month ago.
We were able to reproduce and confirm:
If we stop the automatic updates, the alarms also stop.
If we manually trigger the update, the alarms immediately start appearing.
I’ll continue digging into the logs, but would really appreciate any advice on what specifically to look for in this case.
Regards, Deni
01 Sep 2025 07:09 AM
Hi @deni we are starting to notice similar alerts for Host or Monitoring Unavailable during out AKS cluster upgrade. please check if anything got changed in the Dynatrace side. We have upgrade to the latest version of the Dynatrace operator 1.6.1
01 Sep 2025 08:22 AM
We opened a ticket for this — the Dynatrace team is still investigating. What we know so far is:
Once we have some resolution I'll update this post.
Regards, Deni
01 Sep 2025 09:00 AM
Hello @deni
What is the Cluster version?
Please provide us with the findings from your support team, as we have not previously encountered this type of issue.
Regards,
Babar
01 Sep 2025 01:50 PM - edited 01 Sep 2025 01:58 PM
Hi @Babar_Qayyum,
I'm sharing it the information is a lot: https://one.dynatrace.com/hc/en-us/requests/532933
If not - I'll try to summarize what we have till now.
About the versions - actually I requested the customer to check when the problem starts e.g to get working and not working version - they couldn't tell me, but its NOT in the last month as they said the first time, before that it also had opened problems, but a month ago they started to monitor them. Since the problem is reproducible I start digging exactly when the pods receives termination signal, when they stats again and how this can be mapped to the opened Problem times.
Regards, Deni
02 Sep 2025 05:40 AM
Hello @deni
Regrettably, I was unable to access the ticket. If feasible, kindly provide a summary along with the version so that we can verify it on our end, as we are regularly, or at most every other month, upgrading the cluster.
Regards,
Babar
02 Sep 2025 02:23 PM
I extracted most of the information from the ticket. I asked for the version and will post soon (i don't have direct access to the customer's env - I get info while sitting with someone with access)
We are able to reproduce the issue consistently. Regardless of whether we start a manual or scheduled update, false-positive “Host or monitoring unavailable” alarms appear during the process.
Host → OneAgent → ActiveGate (Kubernetes) → ActiveGate (second) → Dynatrace Managed
Timeline of Reproduced Problem
Monitoring unavailable
“Monitoring unavailable status means that the Dynatrace server didn’t receive the heartbeats of your monitored hosts for more than 3 minutes.”
In our case, the disconnection was 1 minute, not 3+.
OneAgent sends an Unavailable signal only if it loses connection with the host for >5 minutes.
Here, downtime is <2 minutes, but alarm was still raised.
The support replied:
Why does Dynatrace report a Host problem when the host was fully operational, and the issue was due to OneAgent / ActiveGate update communication loss? >
Dynatrace reported a host problem because, during the update, no monitoring data was received.
Shouldn’t the alarm type indicate “Agent/ActiveGate communication interruption” instead of “Host unavailable”?
The problem page does clarify that the issue may be due to "Host or monitoring unavailable due to connectivity issues or server outage." While the label may suggest a host-level issue, it also encompasses scenarios where monitoring is temporarily interrupted, such as during updates.
What are the expected steps during a OneAgent/ActiveGate update?
This process is similar for both OneAgent and ActiveGate. Below are timestamps from your environment:
Here there is list of timestamps which differs from the once I got:
gateway0: 17:59:24 - 17:59:33
agent: 18:01:10 - 15:03:06
Since both updates took more than 3 minutes, the temporary loss of communication triggered the host problem alert.
After that I left the following questions and still wait for response of these questions + attached the logs + screenshots from kubernetes commands:
1. Agent logs from our analysis vs agent logs from your comment:
While monitoring the agent connected to the host from the alarm, the logs show different times from the one in your comment.
- Example: 17:58:05 → Shutdown signal received → update starts.
- kubectl confirms the pod was in Running state by 18:04 (running for 6 minutes).
This means the pod was running already at ~17:59
However, you mentioned that the agent was down from 18:01 to 18:03.
What am I missing here? Which logs are you checking to reach this conclusion?
2. Gap between ActiveGate and OneAgent updates
According to your logs, ActiveGate update finishes at 17:59 and OneAgent update starts at 18:01. That leaves a 2-minute
period where both should be running.
Why, during this time, does the Dynatrace cluster not receive host-alive events?
3. Alarm trigger vs running state
The alarm indicates the host lost connection at 17:56 and was fired at 18:58.
Yet, both your logs and mine show that OneAgent and ActiveGate were running at that time.
How is this possible?
4. Documentation mismatch
In the https://docs.dynatrace.com/docs/discover-dynatrace/platform/davis-ai/root-cause-analysis/concepts/ev...
it says:
“Network connection to the monitored host is lost unexpectedly while OneAgent and host are still running. The connection must be lost for more than 5 minutes, before OneAgent starts sending signals again and cached metric data fills in the missing chart data.”
Based on this, when exactly should the alarm be fired? Because the behavior we observe doesn’t fully align with this description. - In my first comment I posted another link to the documentation where it written 3 minutes - What is the difference in both documents?
5. Update process behavior
If the combined OneAgent + ActiveGate update exceeds the timeout:
- Shouldn’t the process update OneAgent first, wait until it’s running, and only then update ActiveGate (or vice versa)?
- Alternatively, shouldn’t the system suspend alarms automatically (at least for the host undergoing the update) if such timeouts are expected during updates?
6. Next steps / Solution needed
We need a practical solution.
- We can’t simply use a maintenance window, since this cluster is not the only one being monitored.
- The update sequence is: Dynatrace cluster update (≈1h, variable) → OneAgent/ActiveGate update (which actually causes the problem).
- We know when the cluster update starts, but we don’t know when OneAgent/ActiveGate update will begin, so we can’t align a maintenance window reliably.
What are the recommended options in this case?