Is there any way how we could make any difference on how long problems are staying open?
I have this kind of situations frequently when there is some failing request on few minute period but the actual problem stays open almost 20 minutes after the last failed request?
So therefore in some environments we are getting alerts constantly and therefore the whole alerts are becoming quite useless since there is so many "false positives". I have tried to setup 15 minutes delay under problem notifications but since the problem seems to stay open 20 minutes almost every time the delay setting does not work since would need to raise the limit over 30 minutes.
Is this the default way how the problem engine works or are we having some issue on our Managed cluster which could cause this kind of delays?
When using automatic baselining (anomaly detection), the detection of failures is performed in sliding windows (5min and 15min). They cannot be changed. For details consult docs here: https://www.dynatrace.com/support/help/how-to-use-dynatrace/problem-detection-and-analysis/problem-d...
Before changing or finetuning the anomaly detection I would strongly suggest examining the detected errors if it's really a failed request. If it's not a failure of the service, then first configure the error detection for the service to ignore those error and not mark the requests as failed.
If the requests are really failures, then you will have to finetune the detection of failures - in the anomaly detection configuration of your service
I'm currently going trough all the problems on my environments case by case and the issues itself are real problems and bring those up which would really need some kind of solution when the reality is that all of these are not going to be fixed.
But we would need to somehow bring the noise down which are coming from the alerts and still maintain the ability to react when there is something bigger problem building on.
We are trying to figure out how to get less alerts on those cases where some single user had some issues for short period of time or some rest interface was facing problems 5 minute period which affected on few requests on on some silent moment on the day.
And because every problem seems to stay open 20 to 30 minutes it's little bit complicated try to clean out the noise and still try to keep the ability that we would get the notification when there is something bigger building on.