We had a production issue with a host, but It was not alerted because these kind of problems are Frequent Issues by Dynatrace.
Solved! Go to Solution.
Detection of frequent issues is quite an interesting topic, and has tricked me sometimes in the past. It is especially tricky when a problem appears in the "middle" of the problem. Seems confusing, and it sometimes is. I would recommend reading the following (several times):
In the end, it's an interesting solution to alert spam, but you have to understand it...
I found this thread in a search for "Frequent Issues" because we're having problem with DT altering us when an issue occurs in the middle of the night (but it's not critical so we sit on it till morning) but by morning (2 hours later) DT already auto closes the problem as Frequent Issue, then during business hours the issue goes from bad to worse but now we don't get alerts because in the middle of the night it flipped to a Frequent Issue. And other situations that have baffled us. I feel the logic of "Frequent Issues" is just not working in a useful way.
If we turn it off then alerts go off everywhere and it puts us back to having to get in and configure the environment down to very minute details but then you encounter issues where DT doesn't allow you the flexibility needed. If we turn it on then we miss getting alerts of critical production events. Again because of lack of detailed configuration but this time that Frequent Issue is on or off for everything with no option per entity to adjust.
I read the linked article above. Feels out dated because i see no mention of the 2 hour marker for which a problem flips to Frequent Issue. Either way, the current logic is causing issues.
I agree with on you this, the logic for those Frequent Issues are a bit weird, it happened not once that we missed actual issues like this.
But if you raised this topic I actually have a question, what happened if I go to the Anomaly detection Settings and I turn the "Detect Frequent Issues" off?
Will it raise a problem? Will it not? Not actually sure how this settings works.
I'll provide updates once I get more details. The online documentation does a good job at explaining this very complicated equation but it doesn't go into enough detail to explain the behaviors we've experienced. Thus it's difficult to predict.
The situation where if a problem is opened for 2 hours it auto flips to Frequent Issue is one example i completely don't get. That's not what frequent means, something happening 1 time for a long period of time is not frequent. Sure, if you break that down into data points the argument might change, but that issue at hand isn't data points, it's that a single problem (1 occurrence) became marked as frequent.
A few months back we encountered similar issues so we turned off the Frequent Issues and within minutes our environment blew up with problems cards. We had a rough night and by morning we decided to just turn it back on, but it took 20% (of 7 days) to actually work again.
The reason though it blew up was because we didn't configure the safety nets in advance (configurations), so this time around we're starting to configure the environment in preparation, then turn it off.
Interesting to know!
In our situation, some problems that were marked as Frequent Issues just escalated and caused an impact. When investigation why there was no alert, we saw that is was due to this Frequent Issue feature that masked the problem.
Waiting to hear if you have any news about that.