In a disaster situation, when a Dynatrace cluster and the security gateways go down during an outage, I have the following questions, some of which I've already gotten answered:
- How long do the OneAgents continue to try to send data back to the fallen cluster or SGs?
When the cluster becomes available again, will the agents automatically reconnect?
- How long must a cluster be unavailable for the agents to not automatically reconnect once it becomes available?
- What is the OneAgent cache limit? = According to the PMs, that the Agent has a 10MB backup queue that will be transmitted to the cluster once communication is available.
- How can the Dynatrace cluster be restored in a DR datacenter(cloud, or on-premise)? = According to the PM, the Dynatrace Managed Backup procedure is supported as of S140, and can be applied to a Dynatrace cluster with the same IP.
Just curious about the Oneagent cache - is this documented somewhere? (the 10MB). Based on my observation, it is non-persistent and data is not stored on disk during the time agent is unable to reach cluster.
Hi @Julius L. - No, it's not documented anywhere that I could find. It was a comment posted by a PM on slack.
It can be seen in the logs, although I'm not sure if it is the backup queue or async message queue. Some parameters are somehow self-describing:
[native] Dispatcher buffersize ....... 25165824
[native] Backup queue size ........... 10485760
[native] Async message queue size .... 10485760
[native] Wait time for server ........ 20s
[native] Dispatcher taskinterval ..... 10s
[native] Switching URLs every ........ 5m 0s
[native] SQL string length Max .... 4096
You can It would certainly help to know the internals to be able to answers questions from customers.
Hello Eric ,
As of now there is no DR solution provided by Dynatrace .
But we can have these two custom Options possible :
1) Build the same number of DT Managed Nodes in DC as well as in DR
For ex : 3 in DC & 3 in DR , then add all 6 nodes as part of One Cluster.
And you should use public SG for oneagents to load balance to the next available SG
Concerns : It will involve a lot of Data transfer across the two DataCenters and calls for a detailed evaluation of available Bandwitdh between the two DCs.
As oneagent captures a lot of data , this DR solution may potentially impact Network pipeline across two Datacenters
2) Backup DT Managed Cluster and restore it to DR site in case of Failover(as already mentieond by Eric)
this is possible via :
a) Restoring the DT Managed cluster on identical servers on the DR site (assuring same IPs)
b) Restoring the complete server AS-IS on DR site at H/w level
Concerns : It has to be tested to see any Infra level replication challenges depending on the environment complications.
But before you decide on any of these scenarios or any other possible DR scenario , please answer for below questions about you application DR
1) During DR , How are things failing over for your applications and other Infra : are all the applications failing over to DR or just few?
How many network VLAN segments are failing over to DR? all or few?
How is you DR planned for such applications ? Automatic DR restore or manual?
2) How critical is DR for your environment : if your primary DC goes entirely down , is anybody really concerned about Dynatrace monitoring?
3) Are you also trying to make this DR able to handle Component level failures like below :
- If all the DT cluster nodes go down in Primary DC but Application oneagents & SG remain up?
- If all SG go down in DC but the DT cluster nodes are up
if that is an assumption in your DR planning , then this can all be handled at Infra level failover options like Vmotion and similar stuff.
For your option 1), if the DR site is set to have OneAgents disabled and only get activated when the whole primary site goes down. In this case, do we really need to worry about network bandwidth, especially the transaction storage is not in the replication list. In other words, if 3 nodes in DR site are just active and standby with no agent traffic, does network latency cause any impact to the 3 cluster nodes in the primary site?