Solved: Healthcheck requests skewing stats

alianos · ‎19 Oct 2022

We recently realised that the multidimentional analysis stats are pretty wrong and missleading in terms of the real response time of the service. Here is the issue:

Imagine a service whose true average response time is about 500ms. Each server that runs this service has a healthcheck, but the healthcheck responds in 1ms. There is a liveness and a readiness healthcheck and they happen every 10s x 3 servers = 2req * 3ser * (60/10s) = 60 healthcheck req/m. As you can imagine, the healthcheck brings down the average response time and affects all the percentiles as well. This has significant implications when looking at the multidimentional analysis to draw conclusions.

--------

Problem number 1:
During the day, the traffic pattern changes.
-Overnight: 20req/m real traffic + 60req/m healthcheck = average response time 125ms
-Start of day: 60req/m real traffic + 60req/m healthcheck = average response time 250ms
-Peak traffic: 600req/m real traffic + 60req/m healthcheck = average response time 450ms

The real response time is fixed at 500ms, but the conclusion drawn from the multidimention analysis is that the service becomes slower as it gets more and more traffic.

---------

Problem number 2:

Increasing the servers from 3 to 10 to better support the traffic. We now have 200 healthcheck req/m.

-Before: 600req/m real traffic + 60req/m healthcheck = average response time 450ms
-After: 600req/m real traffic + 200req/m healthcheck = average response time 375ms

The conclusion drawn by looking at the multidimentional analysis is that increasing the number of servers improved the performance. In reality it improved nothing, it is but an artifact.

-----------

Problem number 3:

The service only dealth with 10% of the full possible load. We now put 50% load on the service.
-Before: 600req/m real traffic + 60req/m healthcheck = average response time 450ms
-After: 3000req/m real traffic + 60req/m healthcheck = average response time 490ms

The conclusion drawn by looking at the multidimentional analysis is that the service is now significantly slower and cannot support the extra load. In reality the response time has remained constant.

------------

So I hope that I have convinced you of the importance of excluding that call. I confirmed with support that "Muting" the call does not affect the multidimentional analysis, only the alerts. There is a different suggestion to globably exclude the path from deepmonitoring but for big organisations globably is a difficult word so I haven't been able to test that yet.

My question is, how is this not a problem for everyone?

Mizső · ‎19 Oct 2022

Hi @alianos,

I guess in lot of use cases there are less hc requests (1/min) and much more real traffic sometimes 10k - 50k / min so the impact of hc requests are minimal.

Your use case is quite special in my point of view. Maybe you should raise a product idea about this problem.

Dynatrace product ideas - Dynatrace Community

Best regards,

Mizső

Dynatrace Community RockStar 2024, Certified Dynatrace Professional

Julius_Loman · ‎19 Oct 2022

@alianos if the health check is a web request, then you can exclude it from capturing. Go to Settings -> Server side service monitoring -> Deep Monitoring:

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Mizső · ‎19 Oct 2022

Hi @alianos,

The truth is on @Julius_Loman side. Thanks for the correction.

I used this feature some years ago only one tenant in a differnet format.

So you do not need to raies product idea if you can use this solution.

Best regards,

Mizső

Dynatrace Community RockStar 2024, Certified Dynatrace Professional

alianos · ‎19 Oct 2022

Thanks Both. This is what I meant when I mentioned

@alianos wrote:
There is a different suggestion to globably exclude the path from deepmonitoring but for big organisations globably is a difficult word so I haven't been able to test that yet.

Glad to know that this will work and this is what people generally do, it helps build confidence.

mgome · ‎19 Oct 2022

We encountered two problems with global exclusion of healthcheck and readinesss for lightly used services, Often these would be the only calls over long periods of time, causing issues with trying to find services in time picker windows where no other calls occured. Second, I had a user who insisted on seeing these calls even though i was trying to reduce noise in our cluster and reduce trace creation in environments.

paul_hill · ‎22 Dec 2023

Its a common problem and a catch-22 situation.

If the health checks fail then having a trace can be useful to determine why. Its also useful to see sporadic health check failures even though liveness/readiness failures are not reported. Most K8 containers will have a threshold of 3 health check failures before the probe actually fails. so in theory the health check failure rate could be 66% but the probe never fails - that can get really confusing in dynatrace!

but merging health check metrics with real traffic metrics defintiely hides the real performance of the service. To avoid merging we ensure name of the health check service is distinct from the real traffic service names.

To avoid alerts on the health check service we switch off failure rate detection for the named health check service.