cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Best Practices for Setting Up DCRUM 12.4

genesius_jarom1
Organizer

Hi,

We have a DCRUM 12.3 environment.

  • 1 AMD in each of our two DC's
  • 16 applications monitored
  • 100 servers
  • 70 software services (another 50+ not in use, but configured in RUM Console)
  • Large site configuration; routinely exceeding the 10M operations limit (~16M)
  • Data retention for 10 days
  • Leaf-Spine topology

We are building a 12.4 environment, and will migrate particular Servers, Apps, Software Services, etc. After the 12.4 environment is live, we will be increasing the number applications to well over 50; within over 30 locations, and doubling the number of servers. Besides the AMD's located in the DC's, we will be deploying 10 AMD Express (5 permanent in strategic locations, and 5 floaters to support troubleshooting). We would like to retain data for 6 months to a year (if possible).

Question.

How are very large organizations configured?
I don't see how increasing the number of applications and software services would not increase the number of operations. But we exceed the 10M limit every day. Would someone mind providing a network diagram (sanitized) of their large site (hundreds of apps, users, locations, etc.)? With Leaf-Spine, how are you monitoring each tier?

As a side question.

We have NetworkVantage probes and save 10 months worth of detailed data (little or no aggregation that I am aware of). Is it possible to archive DCRUM 12.4 data to another device, and import back into CAS/ADS to perform future capacity planning or other tasks later?

If you need any other information, please advise.

Thanks and God bless,

8 REPLIES 8

genesius_jarom1
Organizer

Hello.

Would someone please advise.

Thanks and God bless,

Genesius

genesius_jarom1
Organizer

Is there ANYONE who is using a Spine/Leaf network architecture for DCRUM? Have you had any issues?

We are having major problems, and the techs at Dynatrace have not been able to resolve.

For those who don't know Spine/Leaf, think of all 3 tiers of data coming into your AMD through ONE connection. Because of this we no longer are able to see many of our web servers.
This is a major problem that might cause us to not use the product any longer.

Thanks and God bless,

Genesius

Did you look into using Network Taps at the leaf and spine. These taps can aggregate to a packet broker. From packet broker to load balance to multiple AMD. You can check with Ixia or Gigamon.

by the way, are you able share what's the issue that you facing where Dynatrace cannout resolve ? am interested to know 🙂

@Chuan Sern W.

Apologies. I forgot to include the Gigamon GV-HC2 that is in the mix. When this was a 3-tier network the AMD had a cable connected from the switch in each tier. Now the HC2 is capturing 3 SPANed Vlans from the spine switches, and forwarding it to the AMD on a single cable.

Here's the issue.

When this was a 3-tier network DCRUM would display the web and app servers in the Web Tier. With spine-leaf topology DCRUM only displays the app servers. In 3-tier, when I run tcpdump (either the rcon or RH version) traffic between the client PC's and the web servers was captured. In spine-leaf, tcpdump traffic is between the client PC's and the NLB/NAT address (DCRUM 12.3 AMD) or between the client PC's and a main web server - aka portal server (DCRUM 12.4).

Later this week I will connect a Fluke directly to the SPAN (in front of the HC2) and the connection from the HC2 to the AMD's (12.3 and 12.4) and analyze these captures.

I want to know if anyone had issues when they changed from 3-tier to spine-leaf?

Let me know if there is any other information you need.

Thanks and God bless,

Genesius

O you using Gigamon. and Span. Base on my experience SPAN is unreliable and Gigamon erhmm you can get Ixia to do PoC. They will advise you and do PoC to compare to convince you why choose them . :_)

matthew_eisengr
Inactive

Genesius,

Lots of clients are using that type of architecture.

Chaun is quite right that using a taps that then are sent to an packet broker will give you flexibility around shaping/filtering larger amounts of traffic. You could also then load balance multiple ports heading to your AMD or AMDs with your packet broker so it isn't running over one connection.

If you are running shy on capacity on the CAS side, you could look add additional CASs and cluster them together. This will balance the workload and you will be able to access all data from a single point (Primary CAS).

Hope that helps.

chris_v
Dynatrace Pro
Dynatrace Pro

@Genesius J.

For information, DCRUM can scale to much great capacity than your environment.

For example I look after one customer which has numerous feeds in the DC being sent to a packet broker that then load balances that out to 6 AMDs (to deal with the load). They'll soon be upgrading to multiple 40Gbps links, and more AMDs to process all that data. We're talking 350+ software services, thousands of servers, and ~30Gbps of monitored traffic, on a 12.3 system.

For the ADS, if it has sufficient hardware (RAM and database space/performance) you can increase the operation limit to avoid the error message.

NetworkVantage data in my experience isn't handled well by the CAS, it doesn't have all the required data available so the CAS can't optimise it. To upgrade to 12.4+ you will have to replace the NV Probes with AMDs as support for NV is long gone, and isn't available in 12.4.

You will also find a significant increase in capacity by simply upgrading to 12.4.5+ the labs have done great work in optimising processing paths to handle more load on the same hardware.

This forum isn't the best place to be talking architecture, I suggest getting in contact with your Dynatrace rep, and seeing about getting some services to help design the best way forward for your particular environment.

ulf_thornander3
Inactive

I concur with the others here - @Genesius J. - If you don't see something and you are as you say using SPAN - then most likely the SPAN is incorrect. This happens all the time as the control of SPAN is usually something that we simple APM'ers don't have!

Network teams change VLANS and switches like others change socks 🙂

Often the SPAN is forgotten in the process and end up showing something else than what it was intended for.

This is also one of the compelling reasons to have a pysical/virtual TAP in Place of the SPAN as TAP don't care about logic, only physics. A TAP will also give you "TRUE" packets that might be broken or invalid while a SPAN doesn't.