Re: Davis AI not correlating between application HTTP error rate and database connect failures

kalle_lahtinen · ‎08 Apr 2021

Hi,

Today we had an issue with database connectivity, which resulted in 2 alerts coming in simultaneously:

1. Application HTTP error rate

2. Failed database connects

This was causing a bit of alert spam, because the issue went away every now and then (as certain user actions accessing this database were not being executed) which again resulted in double alerts, due this issue not being combined as one.

From the database PurePaths for the request "failed-connect", I can jump back to both the Service request and the Application user action. So I wonder why Dynatrace couldn't group this problem into one, when it's so obvious to human eyes?

One thing that could be the reason is that the Service in the middle did not generate a problem, even though the failure rate was around 60 %. That is because the request volumes were below the threshold of 10 requests / minute. Could it be that if in the path "User action -> Request -> DB statement" the middle part does not generate a problem, Davis is then unable to patch together that the user action issue and the DB issue are part of the same problem?

Julius_Loman · ‎08 Apr 2021

I think the problem is that the service (services) in middle did not raise any failure rate increase events.
I don't know your situation, but it's likely:

- you have a volatile failure rate and you need to setup the failure detection for those services properly and also the anomaly detection
- the volume of the failed requests was not that large compared to the overall service requests - in this case selected key requests would improve the detection.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

kalle_lahtinen · ‎08 Apr 2021

Hi,

Thanks for the response Julius. I don't think there's anything volatile about the failure rate really, it's normally 0 % but during this incident, certain user actions connecting to a certain DB instance were fully failing for an extended period of time, after which everything returned to normal again. As for the volume of failed requests under Services, it was indeed 60 % like I mentioned earlier, so even proportionally I'd say it was high enough.

I suppose the only problem was having this Service anomaly detection setting "To avoid over-alerting do not alert for low load services with less than X requests/min" enabled, which caused the AI to 1) not generate a problem for the affected services and 2) not understand that those user actions were failing due to the database calls. On the other hand if we wouldn't have that setting active, we'd be getting tons of alerts during night time, when only a handful of failing requests would result in an error. Maybe the best course of action would be to separately edit the anomaly detection settings of these specific services, while leaving the global setting as it is.