We're pleased to share a second challenge where you can also win a new and unique badge!
At Dynatrace, we continuously develop a product that can save your time and let your teams focus on innovation. We believe that Dynatrace alerting capabilities provide a lot of options to avoid any delays in your customer digital journey, but we're curious to hear your personal experience.
Tell us about the alerts that give you the most precise solutions and improve your work. Maybe you have an “Alert saved the day” story? Or just your favorite, most time-saving alert configuration?
Share your expertise with your fellow Community members and let's learn from each other.
Everyone who shares their alert story or screenshot will get a new limited-edition badge and +100 bonus points which can help you get a higher rank in the Dynatrace Community.
Entries close on May 11th and on that day we will reward all participants with the new badge.
Can’t wait to see your favorite alerts!
There are alerts and alerts... Most of the alerts that occur in Dynatrace happen when there is a problem. And that's not normally good news. What's better though is that Dynatrace alerts normally give us excellent MTTI (Mean Time To Identify), which also normally gives us even better MTTRs (Mean Time To Repair) given the mostly available root cause detection.
Given this, one of the types of alerts that I most like are those ones that tell me that bad things might be happening, but have not happened yet. Like the disk is getting almost full. It has probably saved the day, because simply the problem didn't happen! It's also on the top of my wish list for Dynatrace: giving me an indication of what might be happening in the future!
Over the years I have shared multiple use cases and methods with Dynatrace. One that we didn't share centered on 3rd party requests and a rogue firewall change.
Early on with our Dynatrace Journey we were alerted to an increase of failures to our 3rd party vendor. Dynatrace alerted us of this issue as it deviated from our baseline. But keep in mind we were also very new to Dynatrace. Immediately we started to look into the failure rate increase alert and found that at 8am failures went from 0% to 100%. Dynatrace provided us the URL and the port number that this request was coming from and its 3rd Party destination.
Immediately we raised this issue to senior staff who called out to Ecommerce to confirm/deny our findings. Senior staff asked Ecommerce to initiate a "Synthetic" sale. Shortly there after Ecommerce stated that their Synthetic sale completed without issue.
At this point we were all thinking that Dynatrace might have given us a false positive. It wasn't until AppDev came running stating that they were also seeing failures.... 30 mins after Dynatrace detected the first failure. We worked together along with the Network Team to discover that another employee had a made a rogue firewall change. Immediately corrective action was taken, and at 9am we saw failures drop to 0%.
There was still a question though - "Why did the Ecommerce pass without issue?" As it turned out the system was designed to never provide users any errors, rather give them a "simulated response". This event sparked discussions and corrective action to ensure similar future scenarios would provide true test data.
Ultimately, Dynatrace did its job. We had a major issue that lasted 1 hour, but if we had trusted Dynatrace more and tests came back with true results, then we could have reduced this downtime by half! Because of this issue not only was corrective action taken in the Ecommerce Synthetics, but also showed staff that they can trust Dynatrace and helped foster a growing relationship with App Dev and the Dynatrace Team.
There where a few great alerts, but as a partner they are always harder to share if it's not "your" environment.
Some of my favorite alerts where during PoCs though where the customer already had a vague idea what was causing problems but wasn't able to verify/prove it e.g. to a third party vendor.
Dynatrace picks up the problem, identifies the root cause automatically --> great use case for the presentation.
Of course I can't imagine to work without Dynatrace! One of the most powerful and most common issue we were hit in one of our internal services is a common "N+1 JPA" issue. This is related to retrieving data from a database. The problems occurs when a SQL query is executed to fetch N records and each of N records need additional query to fetch some relational records... and so on. As a result one operation results in hundreds or thousands of SQL queries ... which of course is slow and impacts end-user. Dynatrace alerts on "Response time degradation" quickly and shows this at hand - it reduces MTTR (as @AntonioSousa mentions) to the max!
The other - small, quick but yet powerful -> Who remembers or cares every day if the disk space on one of dozens or hundreds of hosts is sufficient? Dynatrace! A quick alert reminder and problem solved:
The alert we use the most is custom alerts. These help us to identify with greater sensitivity possible failures in the applications or services that we are monitoring. We like to know much earlier for certain applications that an error that is just beginning is showing. In this way, the response time is much longer and the impact on the end customers is less. (For 100% critical applications.)
Additionally, our team works with IIS, which is very important in the face of an APPpool crash that is quickly identified.
>The Process group availability monitoring alert
>>if any process becomes unavailable
This alert is very useful.
I truly enjoy the variety of use cases shared in this challenge 🙂 It's interesting to see how we utilize the same functionality depending on our needs, teams we work in, business goals, etc.
Thanks for sharing!
A really useful one is the ones we receive from our HTTP monitors.
We have a situation where the application heavily relies on the information from 3rd party integrations to work. By creating some multi request HTTP Monitors, we are simulating those integrations and notifying the team ahead of time about the issue.
These alerts have changed the way that the operations team works and reduced the MTTD and MTTR.
One of my favourite alerts recently was a case of calls to an external service, where the customer got charged by volume of successful requests. Dynatrace alerted on the calling service because in 60% of the cases it returned an HTTP 400 to the client due to wrong payload.
However the call to the external service was still being made (and "successful) and therefore charged. By fixing the wrong payload handling and immediately returning instead of performing the 3rd party service call the cost was reduced significantly.
Without the service dependency and traces this would have gone unnoticed or the cost impact would not have been discovered.
My favorite alerts are those when in a PoC a client discovers a problem that he does not know he has and he gets an alert. Lots on the list, I/O, Garbage Collector, Memory Saturation Of course I always have a little help from DAVIES 😜 I love seeing the expressions on their faces and finally they said, DAVIES is not just a pretty face.
My "Alert saved the day” story would be when Dynatrace alerted us regarding a third party service outage responsible for processing customers' applications. Capturing specific log messages was the only indication that the third party vendor was having issues.
"There were no other indications that this third-party service was having a problem apart from the errors in the log files. That is what made the log analysis valuable in this case. We configured a custom log event to look for the known error that can occur when the third-party service is not functioning properly.
Dynatrace alerted us to the problem when the error messages started appearing in the logs. We know that when this specific error appears, we need to contact the third party to investigate.
Without Dynatrace Log Analytics catching these errors, the problem would have continued until our back-office teams noticed, which could have taken days or weeks. We contacted the third party, and they confirmed an issue on their side, which was quickly resolved, saving us time and money.”
Many thanks to you all for the wonderful alerting stories. It was a real pleasure to read about so many different experiences. By the end of the week, each of you will receive a new, special badge on your profile. Looking forward to your participation in our next challenge 🙂