Did anyone tested the high availability feature with the DCRUM May 2017 SP3?
In the documentation it is stated that the HA deployment has been tested for the DCRUM May 2017 service pack 1 only.
Hello @Travis B.
Thank you for the confirmation about the HA for the SP3.
Please let me know the issues you had been faced and the way you handled those issues.
Also share the adopted design from your side for the HA especially the databases HA and their data replication because the documented design is focused on the application HA without mentioning databases scenario except the Console which is replicating with the primary database.
Looking forward for maximum information from your side to start the work on HA as our earliest.
Hey Babar. So we have three HA environments here at Optum. Dev/Stage/Prod
I'll stick to Stage since it's just a small version of our Prod environment setup.
We have 1 Farm, with 1 Cluster containing 4 Primary/Failover pairs.
Each CAS has it's own database. I believe (and Krys can correct me if I'm wrong) that the Primary Node directs each pair to process the exact same data, so thus each of their databases contain the exact same data. In the case that the main node becomes unavailable the failover node steps up and provides the same data. I do not believe there is any sort of specific 'database replication' going on, just that each CAS pair is always processing the same data on their respective primary/failover node.
We have also setup our F5 Load balancer so that the url dcrum-stage.optum.com always points to the active master node. casstg1 or casstg1ha (if casstg1 goes down). So customers only have to hit this URL and the processing is automatically distributed among the primary/failover node pairs.
Our issues so far are dictionary inconsistency errors (which are auto fixed in 17.x so they do not cause outages) and some issues where a cluster node gets behind and then the entire farm gets behind waiting on it. We are actively working with Krys and team on these issues.
Hello @Travis B.
First of all I apologies for the late reply and secondly I am really thankful for sharing the desired information with me.
I spoke to DBA regarding the Availability/Failover Zone and shared the deployment scenario provided in the Dynatrace documentation.
He said due to customer's defined policies we will have to have a DB HA/AG (High Availability/Availability Group) so we will provide you 2 x VIPs for the CAS/ADS applications to connect to the databases.
What is your opinion about this?
Will this work as expected or we will have to take the exceptions from the customer due to limitations of the applications?
Is this mandatory that the OS and SQL Server should be the same for the Farm/Cluster or we can have different OS/SQL Server versions?
Looking forward for your as usual kind reply.
HI Babar, I've been super busy also so haven't had time to check this.
The DB HA/AG sounds like it would work, I'm not sure how the technology behind it works so I can't say with 100% certainty. If the 2 VIPS they mention are behind a DB HA/AG GTM, you could point your CAS/ADS at the GTM and in the event of a catastrophe it would automatically use the available DB in the HA/AG. I would make sure that the DB HA/AG is always 100% in sync. I've had major problems when a CAS switches to a new DB that doesn't have all the same information as the one it's been using.
I think you can have different versions of OS/SQL Server. Our 2017 instances all have the same version of OS and SQL Server currently, but our old 12.4 Deployment has a mix of Windows versions and SQL server versions. I would verify this with Kris first. 🙂
We have finished our SP4 deployment and are running stable on Version: 2017 May SP4
Build: 126.96.36.199. This is how our prod environment is setup:
Optum Prod Farm - 2 Clusters
High Speed AMD Cluster 5 CAS (Including Farm Master)- 5 CASHA
Classic AMD Cluster 8 CAS - 8 CASHA
Accessible via an internal VIP that goes to the primary farm master.
Primary farm master pulls all aggregate data from all other CAS/HA pairs for reporting.
We are currently in process of migrating from our Classic AMDs to new High Speed AMDs.
We currently have 35 Classic AMDs and 3 High Speed AMDs up and running, with 15 More High Speed AMDs being built. Processing ~60-90TB of traffic a day with a ~2-5minute processing delay. Over 1000 applications being monitored.
We have seen instances where the HA Pair takes over processing when the main node gets behind, and then the main node takes back over afterwards. Overall it is running very smoothly.
Hello @Travis B.
Thank you for sharing a huge setup of DCRUM. While reading I looked at the processing data in a day which is enormous 🙂
We are in the process to migrate to the Version: 2017 May SP4 Build: 188.8.131.52 and then will start the work on the HA.
In case I stuck somewhere or need some more assistance to setup a proper HA cluster then will bother you once again.