cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

What happens if software update fails?

nkobayashi1
Helper

Hi,

If Dynatrace Managed Cluster upgrade fails, will it roll back automatically?
Or is there any other recovery process?

Our customer is planning to provide Dynatrace Managed to their customers as a service.
So they want to minimize Dynatrace's downtime as much as possible.

Best Regards,
Noah Kobayashi

3 REPLIES 3

gautier_begin
Advisor

Noah,

It really depends of the failure. In much cases (90%); because you are using a cluster of at least 3 nodes, the end user doesn't see any trouble. It's rare that the upgrade failes on every nodes on the same element (Cassandra, ElasticSearch etc.) in the same time.

Regards;

nkobayashi1
Helper

Hi Gautier,

Thank you for your response.
But I can not find any recovery plan in your post.

> It's rare that the upgrade failes on every nodes on the same element (Cassandra, ElasticSearch etc.) in the same time.

I understood it is the very rare case. But our customer considers that.
Could you tell me whether you have recovery plan when the cluster crushes by the updating?

I think the timing of updating of the cluster nodes is done at same time.
I've got the answer following post about that. https://answers.dynatrace.com/spaces/482/dynatrac...

I think that if something wrong is included in the update file, it is possible to stop all of the nodes.

Thanks, Noah

Radoslaw_Szulgo
Dynatrace Leader
Dynatrace Leader

Copy of answer from :

https://answers.dynatrace.com/spaces/482/dynatrace...

During upgrade all nodes are being shutdown for the moment of
upgrading binaries - so the downtime is expected. It works like that:

1. We shutdown all nodes.

2. Upgrade one by one.

3. Once upgrade is done, we start all nodes one by one.

As
mentioned it takes about 10 minutes until the first node starts,
usually. All operation in normal situation can take up to 30 minutes.
This strongly depends on the speed of disk, network operations and
load.


Answering directly your questions - yes downtime is expected even for multi-node cluster.

FYI
With version 148 we will start supporting rolling upgrades - which
means zero downtime, as only one node is being upgraded at the same
time. Mission Control team will control if upgrade is going to be
performed in rolling fashion or full upgrade with downtime. You'll be
informed on the mode with the e-mail notification before the upgrade.

Regarding the recovery, it is as following:

1. Whenever it's possible we try to roll-back to the previous version.

2. In case it's not possible or unexpected crash happens (worst scenario) - there's a restore from backup possible.

3. Even if upgrade fails - data is kept.

4. Mission Control team proactively inspects all upgrade processes and acts on each failure to bring cluster up as soon as possible.

Can I answer any other questions ?