Re: Server-side service monitoring: failure detection rules do not persist longer than 2h?

Ingrida · ‎21 Dec 2022

Hi dear community,

We have a user on our cluster who is defining specific Failure detection parameters for service-side service monitoring and then applies them to subset of services using Failure detection rules for service-side service monitoring.

and then:

It works fine as long as services the rules match are active and receiving traffic.

If services is idle for more than 2(?) hours, it "loses" the rule: it is not visible anymore in settings of this service (but visible as effective settings in settings API) and it is not working (despite being visible in API)

Then services becomes active again it takes ~ 10 min to see rules again in UI and more than 30 min till rule actually starts working

The rule is catching specific business exceptions and qualifies them as failures, so we expect to get a Problem if this happens, and we do it, but just 30 min after services started to be used and generates these exceptions.

We do not have issues with frequent used services.

Does anybody else have similar issues? Is it "works as designed" ?

Why services "fails out" from rule if it was not beeing used for longer? Is there an workaround for it? Because in such "design" it is not really usable for us 😞

Thanks for any hints or experience sharing!

Ingrida

josef_schiessl · ‎02 Jan 2023

Hello,

please get in contact with Support so we can investigate this and really look at the rules and why the do not match after the inactivity. IMO it should not happen, but maybe there are some conditions that are not stable.

Cheers,

Josef

Ingrida · ‎02 Jan 2023

Thanks @josef_schiessl for feedback. I will try to contact support then and will post the results here

I.

Ingrida · ‎09 Jan 2023

As promissed, update from DT-Support:

"Long story short, the current behaviour you observe is correct. You can see that after two h, in the UI, the information about PG is gone from unused service. Meanwhile, in 258+, if the rules are applied, it will be kept for seven days, not only two h. "

I just wonder why:

- I was unable to find any official documentation of Dynatrace for this feature. Do I do bad searching or there is none? And it is not like this for this one feature, there are numerous others which miss any propper official documentation 😞 Where does community documents such things to be found for others?

- is it something other ppl use and should it then be followed by RFE? (is it at all possible to change this behavior?)

I.