Before we moved to Dynatrace SaaS, we were using tools such as IBM OMNIbus which had hundreds of rules to maintain that consisted of an Impact and Urgency value that matched a Priority Matrix as ITIL calls out. This was extremely complicated to maintain and required high administration. Because we have numerous monitoring tools, it also meant that each monitoring tool needed to be able to send a value for Impact and Urgency when showing or sending an event in order for an incident to be manually created by our operations team. This of course is not ideal as you are reproducing work in multiple tools rather than send event payloads to a single point.
I am curious as to how others are dealing with Impact and Urgency that use Dynatrace.
From what I know, it feels to me that Dynatrace should pass a payload from a "Problem", possibly with a CI for the CMDB to the ITSM tool which in return would have an Event Manager and the EM would lookup the CI to determine the Impact, Urgency, Owner, etc. Am I thinking about this correctly?
I am attempting to determine where that responsibility should sit. Should that information be within the CMDB that the ITSM tool uses or should it be the responsibility of the Dynatrace configuration to send an Impact, Urgency, Owner, and so on?
Currently, our notification is handled by our ITSM tool which is EasyVista. This in return is integrated with xMatters.
Very curious to hear how most are dealing with this. Thanks!
Hey @Larry R., I know this is an older post but I wanted to share the structure we have in place.
First and foremost, we take all Dynatrace alerts as critical alerts that are actionable. Previously we had a monitoring platform that would send out alerts on a low, medium and high severity levels. What we found was that staff as ignoring the low and medium alerts, as they were not actionable.
We decided as we moved to Dynatrace that we needed to cut out these low and medium alerts. It was decided that Dynatrace should be alerting on issues that are critical and need action. For example, we use to have alerts on the old platform for disk space, 25% left = low alert, 10% left = medium alert, and 5% left = High alert. So management decided that at 15% free space remaining is the lowest amount before it is actionable. 15% allows staff enough time to get online and grow the drives as needed.
In essence we cut out the "Crying Wolf".
We also made strides to simplify life, where all alerts that get paged out are also set via a webhook to our help desk system where a ticket is created, filled out via the payload and put in a que. That way staff dont have to sit around and make a ticket for each alert.
We also incorporated the Dynatrace Mobile App Alert and placed it on the Oncall Phone. Now staff can get alerted via the app and make comments on the problem tile. Which then allows other staff and management to review these comments. For example, low disk space. John who is on call gets alerted of low disk space, he comments, "Jumping online to grow drive". Now management can look at that and see that John is taking care of the issue, Then once completed John puts in that the drive has been grown and the alert gets closed. With that webhook integration, John also gets a ticket created to where he can then close out when he gets into the office and link that Ticket number to his overtime report.
I think it is important to look at your old methods and see what can be cut out, what is annoying and how can we make things better. I always would ask people "What would you like to see?" and then work to provide solutions to those statements.