topic Percentage-based Thresholds for Site Reliability Guardian in Automations

Percentage-based Thresholds for Site Reliability Guardian

brianrutherford — Fri, 03 Jan 2025 17:53:04 GMT

When we used keptn for Quality Gates, we were able to set SLIs that were percentage-based - things like "if response times have not risen more than 5% from previous values, pass" or "if failures haven't risen more than 3% or by 100 flat, pass". These are very similar to how we can define anomaly detection today.

We have traffic that significantly varies throughout the day and between different days of the week, so using static thresholds does not work effectively. I cannot determine how to do this with SLOs. I tried using the auto-adaptive feature of the SRG, but it's still using a static threshold as it learns.

This has created some issues - one is that it learned an error rate of 0%, but then the SRG ran and had a single error in 10,000 requests and failed (the interface still showed 0%, I had to look at the query to see it was rounding the 0.0001%) which I would not want. Another is that it learned Saturation based on a couple runs later in the day, then failed when run in the morning when we are busier and the application is consuming more resources.

Being able to do this as a percentage comparison (how does the application compare to what was running just before the SRG ran) is much more useful in our context. How can I accomplish this?

Re: Percentage-based Thresholds for Site Reliability Guardian

Fin_Ubels — Mon, 06 Jan 2025 01:05:03 GMT

Hey @brianrutherford

Would something like the following be what you're looking for?

timeseries avg(dt.synthetic.browser.availability), avg(dt.synthetic.browser.duration), by:{dt.entity.synthetic_test} | filter contains(entityName(dt.entity.synthetic_test), "nameFilter") // Fetch timeseries for validation timeframe | lookup [timeseries avg(dt.synthetic.browser.duration), by:{dt.entity.synthetic_test}, shift:-7d | filter contains(entityName(dt.entity.synthetic_test), "nameFilter")], sourceField:dt.entity.synthetic_test, lookupField:dt.entity.synthetic_test // Fetch timeseries for comparison timeframe, in this example it's -7d | fields `7d duration change` = ((arrayAvg(`avg(dt.synthetic.browser.duration)`) - arrayAvg(`lookup.avg(dt.synthetic.browser.duration)`)) / arrayAvg(`avg(dt.synthetic.browser.duration)`)) * 100 // Find the percentage change in the average of the timeseries | fieldsAdd `7d duration change` = if(isNotNull(`7d duration change`), `7d duration change`, else:0) // If there is no data when the comparison happens it returns null, this line ensures it returns 0 instead of null | fields `7d duration change` // Return only the comparison field as that is all we need.

Re: Percentage-based Thresholds for Site Reliability Guardian

JohannesBraeuer — Tue, 07 Jan 2025 06:32:56 GMT

Hello @Fin_Ubels ,

very interesting approach. Thanks for sharing it.

I`m I right that the accepted percentage (7d duration change) is then defined as static threshold?
E.g.:
Warning if result: 3%
Fails if result: 5%

Re: Percentage-based Thresholds for Site Reliability Guardian

Fin_Ubels — Tue, 14 Jan 2025 03:43:09 GMT

Hey @JohannesBraeuer

If I understand the original post correctly then yes, you'd then define a static threshold. That static threshold would be the percentage change over time that is unacceptable. In my above DQL that would be over 7 days but it could be over any timeframe.

The downside to the above approach is that if it only get's 2% worse every time, but it does get worse every time consistently, then that performance degradation could fly under the radar and compound. So alongside the above approach I would recommend also have a static threshold on the underlying timeseries without doing a timeframe comparison so that there is a hard limit.