30 Mar 2021 11:35 PM - last edited on 31 Aug 2022 03:30 AM by MaciejNeumann
Which failure rate metric your are usually using on your dashboards regarding service monitoring?
In our case it starts to be quite a annoyance trying to figure out some actually usable metric since the results our users are getting varies way too much by the view or metric they are utilizing.
My current understanding is also that none of these failure rate metrics we can use on our custom dashboards would also "respect" the exclude rules we have made on the service level?
Lets take the Failure Rate any Errors as example. So we can see on our dashboard that there has been short spike on the failure rate.
But when our users are trying to drill down to dig down further to analyze what has happened.
They end up on the Multidimensional analysis view which shows zero as Failure rate.
The support pointed out that the Failed request count as seen by caller would show the failed calls on the multidimensional view which id did but this starts to be quite a challenge that every user would always need to know that in this case we need to change that or this to reveal the calls.
The support pointed out that the Failed request count as seen by caller would show the failed calls on the multidimensional view which it actually seems to do so on this case it's more or less which metric to use on our dashboards graphs so that the drilldown would be easy task for our users
And on this case it was even more problematic when users did the drilldown to actual purepaths which did not have any exceptions or failed calls...But after some digging I found that some of the purepaths had some "internal" exceptions and these was actually counted as failed calls on our custom dashboard (even tho, none of the filter picked these up as failed calls).
You reached the right location to see the exceptions/errors.
From this point onwards, you can create calculated service metrics for the required/important exceptions/errors to plot the chart and use them for alerting purposes as well.
I suppose that the situation is still so illogical that there is no Failure Rate metric which we could utilize on our custom dashboards which would actually utilize the same metric which we actually see on the service level itself?
In my opinion it's starts to quite unnecessary work if we first try to exclude every unnecessary or false positive event from the service level and then we still would need to create some "own" calculated failure rate which would fit on our needs.
These failure rate metrics is something which causes quite a annoyance and actually are eating quite a lot of my time when I need to go this trough again and again with teams that why they see rate X in this view, rate X on this and rate X on service level.
You are right but at some point, you will have to have this approach as per my experience.
The generic approach is to get the server-side and client-side error with the Dimensions to split by.
Yep, it seems that I really need to consider this road since the default failure rate metrics which we can utilize on the custom dashboards are more or less problematic. This solution of course have some limitation since I would need decide beforehand what are the required/important limitations which I should include on the calculated metrics.
We also have quite a big environment so I have already had some challenges since the calculated metric can be tied up only to 100 entities and I have certain exception messages which I need to monitor whole system wide on every instance.
Since we don't currently the log monitoring purchased this makes certain definitions quite painful since we basically would need to split the entities to groups where the maximum amount of entities is 100 and create multiple calculated metrics from the same thing.
Have you tried charting the failure rate for server side errors, instead of any errors? That approach works at least for me, when doing similar reports.
Moi @kalle_lahtinen 🙂
I would say that in certain cases it work better but then we would lost the 4xx series failures or we would need to use it as separate metric.
We also have quite a challenge with the server side error metric also since the metric which we an use on dashboards does not exclude the errors we have defined on the service level.
We are utilizing quite widely REST interfaces and the HTTP 500 answer is quite widely utilized as an generic error response so therefore it's quite complicated for us that the metrics we have on dashboards actually do not respect those exclude rules.
I can live this one but it's quite a hard to try to explain this for every user that why you see error rate X on your dashboards and error rate zero on the service level and which kind of tricks you need to make to actually filter out the request which are counted as failed on your case.
Ok, gotcha. Your use case is then a bit more complicated, typically for me it's enough to tune the error detection based on exceptions, and the server side errors graph does at least respect those. Is it so that you've defined some 5xx responses as client side failures, instead of server side? Or even fully excluded them from error reporting..? Since the error graphs do take into account the exception-specific congifs, I would assume it also does that for configs related to the HTTP codes - or else it's a bug? Overall it sounds like Babar's suggestion of splitting the graph client vs. server side should do the trick.