I have configured alert for messaging queues.It was giving false alerts,so trying to understand how the alerts are working.
we need to get an alert whenever the error queue count reaches 1 it should trigger an warning incident and whenever the error queue count reaches to 5 it should trigger an severe incident.So with these conditions I have configured alerts providing the aggregation as MAX and threshold as severe or warning.I got an incident as severe and threshold is 1 which should be less than 1.But what i came to know from app team that the count reached to 95,it didn't triggered the incident with value as 95 and when I do the dashboard for count it showed as 95 which satisfied in this case.Can someone help me in understanding.
Firstly it helps to distinguish between incidents and alerts. Incidents are when the conditions you configure are violated, and alerts are actions configured to occur when an incident starts, ends, etc... This is important because there are configurations (i.e smart alerting) that will prevent an action such as email from firing until past occurrences of the incident have been 'confirmed' in the client.
Regarding an alert for warning and severe thresholds this would require two incidents. Notice in your incident you set the incident severity at the top. You would want want one with a warning severity and in the measure condition set it to use the warning threshold and set up the email action. Then you need a separate incident similar to the one noted but with a severe severity and the measure set to look at the severe threshold with again an action configured. In this setup it will deliver and email for the warning and for the severe incident, as-is you're telling it to trigger one warning level incident if either the severe or warning threshold is violated but say if it goes from above warning to above severe no new incident will occur because it is already in violation.
As for no incidents occurring first check the email actions to make sure that smart alerting is disabled, and you can see if any incident occurred and maybe just didn't send an alert bu looking at the incidents dashlet.
Hello @James K.
We came across a situation where need your assistance to understand the below scenario about the triggered incident.
Configured thresholds are in the below screenshot. The incident was triggered with the value of 2253.25 [ms] which is a peak value.
The upper warning threshold is 2110.45 ms which is exceeded threshold (upper warning) and the peak value is 2253.25 [ms].
Evaluation Timeframe is 5 min and the incident aggregation is avg. Violation was ended after 5min 40s (below screenshot is the reference).
Now here is the situation. When we open the incident chart there is no reference to the 2253.25 [ms], value, therefore, it is very hard to justify or correlate the triggered incident with the chart.
I don't know about those peak values you mentioned - I don't see them in the screenshots and whatever they are it's probably from a specific point in time that those numbers will be there so you might not be able to recreate the exact view in the chart that it saw.
Regarding the incident it looks pretty straightforward. At 22:49 the measure shot up to almost 16 seconds which thus put that 5 minute window easily into the violation of it's threshold. This was also high enough to keep it in violation for the minutes afterward when the incident remained open - when using averages high spikes like that can do that.
To get a clearer view of what the incident is actually looking at I would change the chart resolution to 5 minutes with the average aggregation then you should see something similar (though not identical since the windows will probably not be aligned perfectly) to what the incident sees. A bar chart is normally clearest to understand and it will show you the 5 minute average for each window. The bar for the minutes this was open should be above the threshold.
The below chart is with 5 min average aggregation for the same incident. I can not see the value 2253.25 [ms] anywhere in the chart.
Our intention is that the incident recipient should be able to see the exact value he received and we don't want to let them think that this was a false positive.
I would imagine the number comes from the value that it sees right at the time the incident triggers which is why it isn't the peak of the entire incident timeframe. I couldn't say why it appears to be below the threshold.
If you're concerned about that I would open a ticket with support. But personally I wouldn't think much of that - it could be a minor statistical issue given that there clearly was an incident at that time and all other aspects looks correct. I don't look at that "peak value" too much since it is only of limited value since it will only reflect the value right when the incident is detected.
Thank you for your thoughts.
As I shared earlier that for me it doesn't matter (being a technical person) but when the incident reaches to the business or the respective application team then they inquire and asks for the evidences or the authenticity of the voilaiton.
Nowadays, we are configuring different severity levels for the alerts e.g. Email, SMS and Call etc... therefore, we are more curious to provide the exact information on the first stage and avoid any kind of false positive to make / build the trust on the ALERTS/INCIDENTS triggered by the Dynatrace APM.