I'd like to raise the general question regarding deployment best practices.
I know the knowledge is somehow distributed in SE/ES field, so it might be finally good idea to combine it together.
Customers and Partners want to build this right from the start, rather than learn by mistake as they go along.
Key questions from my side are:
- Setting up agent mappings and naming schemes?
- Administration process automation and other things worth automation?
- How to use system profiles?
- Other do's & don'ts?
Currently I've following topic on best practices radar.
I'd kindly ask you to add some comments, other thoughts (or even, which is best, lessons learnt from your Customers)
Automation & naming schemas:
1) Naming schemas largely depend on the specific app, there is not generic answer or scheme
- PCI DSS considerations, i.e. captured values restrictions per payment components
- Performance consideration & managed config (i.e. level of captured details triggered by an alert)
- Finally keep the agents group defined per logical tiers
- What else?
2) With 100+ agents most customers decide to invest into automating agent rollout.
- We have extensive documentation on that, keyword is unattended installation:
- What else?
3) Based on the numbers (500+ agents), 1 server instance should be enough but must be a powerful box with 24 physical cores, Intel preferred, 2.8 Ghz or higher, hyper-threaded cores not counted.
4) Collector boxes and database box considerations.
- Do a proper sizing,
- Open a support ticket and add a draft sizing proposal and description of the setup that, so Support can simply review and avoid the information-gathering round trips.
Some DO’S from Support perspective:
5) Keep the number of system profiles as small as possible; this increases server throughput
6) Keep the length of PurePaths and captured nodes small; this is a good advice here since the system is at 250k MPS already and may grow or spike during peaks
- many issues we encounter are due to capturing too much data; the defaults do not suit everyone)
7) Avoid wildcards capture (e.g. in servlet sensor properties)
😎 For DotNet agents, max. agents per collector instance is 50, so you need 12+ instances (this could be 2 to n collector boxes depending on how powerful they are)
9) Collector needs fast storage, such as SSD
10) Database needs 6-8 cores, 8 GB RAM, fast storage for log/data, optimized for writing, Oracle/SQL server preferred (though others supported)
11) In virtual environments (Vmware), please double check if the performance of the box is really as high as expected (= equivalent to physical) or if cycles are stolen
12) Also, VMware virtual network adapters sometimes have issues with handling the type and amount of traffic dynaTrace generates (I can tell you more if we have such an environment)
13) When using UEM, consider lowering the visit timeout from 30 minutes to 10 minutes since this frees up server memory
14) Use the latest version and fixpacks
15) Double check versions of monitored applications (compare with release notes) to see if they are supported
16) Avoid regular expressions in BTs if a simple string match can do the same
17) Avoid unsupported community plugins in production systems – and do always use dedicated collector for any monitoring plugins
18) When creating incident rules, set a suppression period to avoid floods of identical incidents (lession learned from BNY recently)
19) Watch out for ticking time bombs: measure explosion (growing number of measures over time), class cache explosion (growing class cache over time)
20) After initial run, I’d exclude the static content (in case of web apps)
21) I'd recommend to use SQL aggregation (with combination of different configurations, triggered by particular alerts)
– this helps to deduce the # of nodes, which in case of PROD which should be approx. 50-100
22) Accessors for method return values might be heavy
23) Same regarding session attributes
24) Webserver overhead (CPU & MEM) - check the UEM injection rules, zipped content, etc.
- there're nice ppt slides available (please do contact Roman), which explain this topic in details.
25) Do not connect an agent to a collector that is not in the same LAN (latency should be as low as possible, ideally < 1ms)
26) If a firewall is in between, it may interfere (e.g. blocking transfer of executable code over the network)
27) Do not use a database box that is shared with other applications, it must be dedicated to DT
28) Don’t put session storage on NFS or SMB
Thats a great Idea - thanks for sharing this. Also - I suggest you also post this on the internal dynatrace forum and link back to this public forum post. With that we get more ES/SE resources being aware of your effort to collect this type of best practice data