cancel
Showing results for
Show  only  | Search instead for
Did you mean:

## Healthcheck requests skewing stats

Visitor

We recently realised that the multidimentional analysis stats are pretty wrong and missleading in terms of the real response time of the service. Here is the issue:

Imagine a service whose true average response time is about 500ms. Each server that runs this service has a healthcheck, but the healthcheck responds in 1ms. There is a liveness and a readiness healthcheck and they happen every 10s x 3 servers = 2req * 3ser * (60/10s) = 60 healthcheck req/m. As you can imagine, the healthcheck brings down the average response time and affects all the percentiles as well. This has significant implications when looking at the multidimentional analysis to draw conclusions.

--------

Problem number 1:
During the day, the traffic pattern changes.
-Overnight: 20req/m real traffic + 60req/m healthcheck = average response time 125ms
-Start of day: 60req/m real traffic + 60req/m healthcheck = average response time 250ms
-Peak traffic: 600req/m real traffic + 60req/m healthcheck = average response time 450ms

The real response time is fixed at 500ms, but the conclusion drawn from the multidimention analysis is that the service becomes slower as it gets more and more traffic.

---------

Problem number 2:

Increasing the servers from 3 to 10 to better support the traffic. We now have 200 healthcheck req/m.

-Before: 600req/m real traffic + 60req/m healthcheck = average response time 450ms
-After: 600req/m real traffic + 200req/m healthcheck = average response time 375ms

The conclusion drawn by looking at the multidimentional analysis is that increasing the number of servers improved the performance. In reality it improved nothing, it is but an artifact.

-----------

Problem number 3:

The service only dealth with 10% of the full possible load. We now put 50% load on the service.
-Before: 600req/m real traffic + 60req/m healthcheck = average response time 450ms
-After: 3000req/m real traffic + 60req/m healthcheck = average response time 490ms

The conclusion drawn by looking at the multidimentional analysis is that the service is now significantly slower and cannot support the extra load. In reality the response time has remained constant.

------------

So I hope that I have convinced you of the importance of excluding that call. I confirmed with support that "Muting" the call does not affect the multidimentional analysis, only the alerts. There is a different suggestion to globably exclude the path from deepmonitoring but for big organisations globably is a difficult word so I haven't been able to test that yet.

My question is, how is this not a problem for everyone?

5 REPLIES 5

Hi @alianos,

I guess in lot of use cases there are less hc requests (1/min) and much more real traffic sometimes 10k - 50k / min so the impact of hc requests are minimal.

Your use case is quite special in my point of view. Maybe you should raise a product idea about this problem.

Dynatrace product ideas - Dynatrace Community

Best regards,

Mizső

Certified Dynatrace Professional
DynaMight Guru

@alianos if the health check is a web request, then you can exclude it from capturing. Go to Settings -> Server side service monitoring -> Deep Monitoring:

Certified Dynatrace Master | Alanata a.s., Slovakia, Dynatrace Master Partner

Hi @alianos,

The truth is on @Julius_Loman side. Thanks for the correction.

I used this feature some years ago only one tenant in a differnet format.

So you do not need to raies product idea if you can use this solution.

Best regards,

Mizső

Certified Dynatrace Professional
Visitor

Thanks Both. This is what I meant when I mentioned

@alianos wrote:

There is a different suggestion to globably exclude the path from deepmonitoring but for big organisations globably is a difficult word so I haven't been able to test that yet.

Glad to know that this will work and this is what people generally do, it helps build confidence.