First topic that I open and I directly have to apologize that it will start with a long explanation
We monitor a webservice using the XML over HTTP analyzer. Because of the special conditions I am not happy with the DCRUM alerts that I am able to define.
It is possible that I miss some potential options that could be used to improve the quality.
Installed versions: DCRUM (including ADS) 12.2.1
Description of the configuration: a tag exists in the response that contains the status (OK/NOK). I defined it as operation attribute (3) and use a regex so only if the status is NOK (what a letter-saving abbreviation for “not ok”) the value will be reported.
That way in CAS the metric “operation attributes (3)” will show the number of NOKs for the monitoring interval.
So far I saw two different types of NOK responses
a) Functional “errors”, means plausibility checks fail like eg. “start date for an insurance can’t be located in the past”. Those return HTTP code 200 and are reported and counted as “operation” in CAS.
b) Technical “errors”, mostly missing user rights to access the backend. Those return HTTP code 5xx and are NOT reported and counted as “operation” in CAS. Nevertheless the NOK for that kind or error will be counted in operation attributes (3), too.
Example 1 - CAS data for one monitoring interval
1) The number of operations in all lines doesn’t show the total attempts. The operations with an HTTP error are not counted as operations.
This is clearly visible for the 3rd line for transaction ABN-Leben-getQuote. For this transaction there have been 12 attempts that all went wrong with an HTTP error (probably insufficient user rights).
2) The first line for the transaction ABN-Leben-getOffer delivers the informations:
By the way, not the actual question yet why I open this topic, but – is there a chance to calculate a metric in a DMI report as the result of an arithmetical operation of two or more other metrics ? E.g. is there a way to show the “10” in a column as the result of [“operations” + “HTTP server errors (5xx)”] ?
3) The 2nd line for the transaction ABN-Leben-getOrder delivers the informations (summary):
There have been 16 requests in total, 4 were successful, 10 failed with a technical error (HTTP 5xx), 2 failed with a functional error
Example 2 – ADS data for the same monitoring interval (showing all attempts)
1) Still not my actual question, but – how is the ADS able to count the requests that fail with HTTP 5xx as an operation, even including an operation time, while the CAS isn’t ?
Example 3 – ADS data filtered on transaction ABN-Leben-getOffer (column “operation attributes – messages (3)” was added)
1) The table contains 10 rows that show the getOffer requests in the monitoring interval of Eyample 1
2) The six requests at the bottom with NOK and [HTTP errors = 1] show the technical errors. Those aren’t “counted” as operations by the CAS.
3) The first 4 lines with [HTTP errors = 0] are counted as operations by the CAS
Thank you for your patience so far, but I thought the explanations will maybe help … because now I finally will describe my actual problem and concern
The average operation time for the monitoring interval for this transaction that will be used – compared to a threshold - to trigger a DCRUM alert is calculated out of the first 4 lines in Example 3.
For this monitoring interval: 21,4s + 0,514s + 22,2s + 21,2s = ~16,3s (see operation time for ABN-Leben-getOffer in Example 1)
The functional errors have a very, very fast operation time so the higher the number of functional errors the more unrealistic the average operation time will be.
Unfortunately I couldn’t consider a way yet how to “filter out” the errors from the calculation.
Not in the definition of the transaction because operation attributes – messages (3) is an ADS metric.
Not in the alert definition because the calculation is for the complete monitoring interval, means that if I use something like [operation attributes (3) = 0] as auxiliary metric the complete monitoring interval will be skipped if there is only 1 error, no matter how many successful (but eventually slow) requests have been there.
Additional challenge is that the technical errors (HTTP 5xx) aren’t counted as operations but are counted in the value for “operation attributes (3)“ in the CAS. That makes it even worse to identify the functional errors (maybe I will get rid of this in the future, application development will try if it is possible to write different status for functional and technical errors).
The analyzer itself works great. The results that we get and that we are able to report in DMI get commendation.
The alerting is the part I am worrying about. At the moment there are a lot of functional errors, and I am afraid it will stay like this because the requests are created by real users of comparision portals … and they probably will keep on making errors.
With this “adulteration” of the average operation time forced by the very quick operation times of the functional errors I don’t know how to create meaningful alerts.
That’s it … thanks again for anyone’s patience who made it to this point … any hints or suggestions are welcome.
I actually read your whole post but got lost a bit amongst all the 1)s and a)s.
An operation with an error is not used to calculate a response time alert on AFAIK - if you have that case then I suggest you open a ticket.
I "think" the reason that you can measure a failed transaction in ADS but not in CAS is that the ADS contains object level details so you can see where in a transaction/page load the error happens, but in a CAS you will only know that it fails.
I also think that you might be slipping a bit . In example 1 the transactions is taking place at 8.40 while the list in 3 says the transaction is taking place 8.39.59 and should be accounted for in the previous interval.
I apologize( again) for length of the post ... do I have to mention that most people are frightened of my mails, too ?
But I thought that maybe people would waste time to think about hints and suggestions that at the end don't fit my actual problem if I don't include that detailed informations and only post a 2-line question.
I will try to summarize the challenge to create meaningful alerts for this special application/service:
In case that an internal application plausibility check fails the operation time is very, very short compared to the operations which really do a calculation (200ms <-> 20s). The operation itself returns HTTP code 200 in both cases. The information about the failed check is found in the Response and is reported as “operation attribute (3) (content: NOK).
In CAS the count of “operation attribute (3)” can be reported in DMI as a metric.
It is the same in ADS but in addition in ADS the content can be reported in DMI, too, as dimension “operation attribute – messages (3)”.
Of course the average operation time for a monitoring interval that is calculated and can be used in the alert is way too short if there have been many failed checks.
So far my only approach is to build a compound metric as percentage “operations/operation attributes (3)”. This auxiliary metric is used as filter, kind of “only trigger an alert if the average operation time is low if there were more than 5 times as many operations then failed checks in this interval”.
I hoped to maybe get new ideas how to improve the quality of the alerts.