cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Looking to upgrade from Dynatrace Managed to SaaS? See how

Maintains window during Dynatrace Managed Cluster update

deni
Advisor

Hi,

I have a case with one of our customers who claims that during an update of the Dynatrace Managed Cluster, they receive false-positive alarms — specifically Host monitoring unavailable.

While I still need to check their environment (and I doubt that these alarms are truly false positives), their request made me think about how Dynatrace actually handles problems and alerts during an upgrade.

When we create a Maintenance Window, it’s clear — for the given filters and time period, all alarms or problems are ignored. However, during a Managed Dynatrace Cluster update, it obviously doesn’t work the same way (which is absolutely expected).

My question is: How does Dynatrace decide which alerts or problems to ignore during the upgrade and actually how the upgrade works?

Thanks!

Regards, Deni

 

Dynatrace Integration Engineer at CodeAttest
8 REPLIES 8

Yosi_Neuman
DynaMight Guru
DynaMight Guru

Hi @deni 

If customer's cluster is running on a single node the unavailable alerts are expectable since the downtime of the cluster is longer then 3 minute and after the cluster is upgraded and running again it will open those "false" alarms.

If the cluster is running on 3 nodes or more that's odd and you will need to look at the logs / open a ticket.

HTH

Yos 

dynatrace certificated professional - dynatrace master partner - Matrix Soft Ware Division - Israel

Hi @Yosi_Neuman 
Thanks a lot for your reply.
Actually, we already opened a ticket, but still don’t have a resolution.

This is a cluster with more than 3 nodes, and for the last 1–2 years everything was working fine. The client confirmed that this behavior only started about a month ago.

We were able to reproduce and confirm:

  • If we stop the automatic updates, the alarms also stop.

  • If we manually trigger the update, the alarms immediately start appearing.

I’ll continue digging into the logs, but would really appreciate any advice on what specifically to look for in this case.

Regards, Deni

Dynatrace Integration Engineer at CodeAttest

Ramprasath
Newcomer

Hi @deni  we are starting to notice similar alerts for Host or Monitoring Unavailable during out AKS cluster upgrade. please check if anything got changed in the Dynatrace side. We have upgrade to the latest version of the Dynatrace operator 1.6.1

We opened a ticket for this — the Dynatrace team is still investigating. What we know so far is:

  • Once the cluster update finishes successfully, the OneAgent and ActiveGate updates are triggered.
  • This updates activity seems to cause the alarm.
  • In the logs, we can see they are never down for more than 2 minutes.

Once we have some resolution I'll update this post.

Regards, Deni

Dynatrace Integration Engineer at CodeAttest

Babar_Qayyum
DynaMight Guru
DynaMight Guru

Hello @deni 

What is the Cluster version?

Please provide us with the findings from your support team, as we have not previously encountered this type of issue.

Regards,

Babar

Hi @Babar_Qayyum,

As DynaMight Guru I suppose you have  access to the ticket?

I'm sharing it the information is a lot: https://one.dynatrace.com/hc/en-us/requests/532933

If not - I'll try to summarize what we have till now.

About the versions - actually I requested the customer to check when the problem starts e.g to get working and not working version - they couldn't tell me, but its NOT in the last month as they said the first time, before that it also had opened problems, but a month ago they started to monitor them. Since the problem is reproducible  I start digging exactly when the pods receives termination signal, when they stats again and how this can be mapped to the opened Problem times. 

Regards, Deni

Dynatrace Integration Engineer at CodeAttest

Hello @deni 

Regrettably, I was unable to access the ticket. If feasible, kindly provide a summary along with the version so that we can verify it on our end, as we are regularly, or at most every other month, upgrading the cluster. 

Babar_Qayyum_0-1756787850945.png

Regards,

Babar

 

@Babar_Qayyum 

I extracted most of the information from the ticket. I asked for the version and will post soon (i don't have direct access to the customer's env - I get info while sitting with someone with access)

Problem Statement

We are able to reproduce the issue consistently. Regardless of whether we start a manual or scheduled update, false-positive “Host or monitoring unavailable” alarms appear during the process.

Observations

  1. Dynatrace Cluster Update - Rolling update across 3 nodes runs successfully, everything looks OK.
  2. After Cluster Update - ActiveGate and OneAgent updates begin. -> At this point, the false-positive alarms are triggered.

Architecture

Host → OneAgent → ActiveGate (Kubernetes) → ActiveGate (second) → Dynatrace Managed

Timeline of Reproduced Problem

OneAgent

  • 17:58:05 Shutdown signal received → update starts.
  • 17:59:43 Bootstrapping read-only deployment → update finished.
  • kubectl confirms pod was in Running state by 18:04 (running for 6 minutes).
  • OneAgent downtime due to update: ~1 minute (17:58 – 17:59).

ActiveGate (Kubernetes)

  • 17:58:06 Shutdown started.
  • 17:59:23 Restart initiated.
  • 17:59:33 State changed to RUNNING.
  • ActiveGate downtime due to update: ~1 minute (17:58 – 17:59).

ActiveGate (Second, external)

  • 18:04:20 Unregistered from cluster.
  • 18:05:25 Registered again.

Alarm Details

  • Alarm timeframe: 17:58 – 18:03 (open for ~4 minutes).
  • Host communication lost: 17:56 – 18:04. (At 17:56 according to the logs everything looks up and running?)
  • Host itself was never down → alarm is false positive.

Documentation References

Monitoring unavailable

“Monitoring unavailable status means that the Dynatrace server didn’t receive the heartbeats of your monitored hosts for more than 3 minutes.”

Source: https://docs.dynatrace.com/docs/ingest-from/dynatrace-oneagent/oneagent-troubleshooting/troubleshoot...

In our case, the disconnection was 1 minute, not 3+.

Host unavailable events

OneAgent sends an Unavailable signal only if it loses connection with the host for >5 minutes.

Source: https://docs.dynatrace.com/docs/discover-dynatrace/platform/davis-ai/root-cause-analysis/concepts/ev...

Here, downtime is <2 minutes, but alarm was still raised.

Open Questions

  1. Why does Dynatrace report a Host problem when the host was fully operational, and the issue was due to OneAgent / ActiveGate update communication loss?
  2. Shouldn’t the alarm type indicate “Agent/ActiveGate communication interruption” instead of “Host unavailable”?
  3. What are the expected steps during a OneAgent/ActiveGate update?
    In other words, what should the expected timeline of events look like when an agent goes briefly offline for update while the host itself remains up and running?

The support replied:

Why does Dynatrace report a Host problem when the host was fully operational, and the issue was due to OneAgent / ActiveGate update communication loss? >

 
Dynatrace reported a host problem because, during the update, no monitoring data was received.
 

Shouldn’t the alarm type indicate “Agent/ActiveGate communication interruption” instead of “Host unavailable”?

 
The problem page does clarify that the issue may be due to "Host or monitoring unavailable due to connectivity issues or server outage." While the label may suggest a host-level issue, it also encompasses scenarios where monitoring is temporarily interrupted, such as during updates.
 

What are the expected steps during a OneAgent/ActiveGate update?

 

  • The updater process detects a new version.
  • It downloads the installer and performs pre-update checks.
  • The installer is executed, which stops the currently running agent.
  • After installation, the updated agent starts and re-establishes communication with the server.

 
This process is similar for both OneAgent and ActiveGate. Below are timestamps from your environment:

Here there is list of timestamps which differs from the once I got:

gateway0: 17:59:24 - 17:59:33
agent: 18:01:10 - 15:03:06

Since both updates took more than 3 minutes, the temporary loss of communication triggered the host problem alert.

After that I left the following questions and still wait for response of these questions + attached the logs + screenshots from kubernetes commands:

1. Agent logs from our analysis vs agent logs from your comment:
While monitoring the agent connected to the host from the alarm, the logs show different times from the one in your comment.
- Example: 17:58:05 → Shutdown signal received → update starts.
- kubectl confirms the pod was in Running state by 18:04 (running for 6 minutes).
This means the pod was running already at ~17:59
However, you mentioned that the agent was down from 18:01 to 18:03.
What am I missing here? Which logs are you checking to reach this conclusion?

2. Gap between ActiveGate and OneAgent updates
According to your logs, ActiveGate update finishes at 17:59 and OneAgent update starts at 18:01. That leaves a 2-minute
period where both should be running.
Why, during this time, does the Dynatrace cluster not receive host-alive events?

3. Alarm trigger vs running state
The alarm indicates the host lost connection at 17:56 and was fired at 18:58.
Yet, both your logs and mine show that OneAgent and ActiveGate were running at that time.
How is this possible?

4. Documentation mismatch
In the https://docs.dynatrace.com/docs/discover-dynatrace/platform/davis-ai/root-cause-analysis/concepts/ev...
it says:

Network connection to the monitored host is lost unexpectedly while OneAgent and host are still running. The connection must be lost for more than 5 minutes, before OneAgent starts sending signals again and cached metric data fills in the missing chart data.

Based on this, when exactly should the alarm be fired? Because the behavior we observe doesn’t fully align with this description. - In my first comment I posted another link to the documentation where it written 3 minutes - What is the difference in both documents?

5. Update process behavior
If the combined OneAgent + ActiveGate update exceeds the timeout:
- Shouldn’t the process update OneAgent first, wait until it’s running, and only then update ActiveGate (or vice versa)?
- Alternatively, shouldn’t the system suspend alarms automatically (at least for the host undergoing the update) if such timeouts are expected during updates?

6. Next steps / Solution needed
We need a practical solution.
- We can’t simply use a maintenance window, since this cluster is not the only one being monitored.
- The update sequence is: Dynatrace cluster update (≈1h, variable) → OneAgent/ActiveGate update (which actually causes the problem).
- We know when the cluster update starts, but we don’t know when OneAgent/ActiveGate update will begin, so we can’t align a maintenance window reliably.
What are the recommended options in this case?

 

 

 

 

Dynatrace Integration Engineer at CodeAttest

Featured Posts