04 Jan 2023 05:16 PM - edited 04 Jan 2023 05:21 PM
I have two questions regarding the creation and alerting of SLO within dynatrace.
Context:
Dynatrace auto detects services deployed in our environment and we use environment/auto generated tags to classify workload types and performance requirements.
Creation of SLOs on every service (Eg. Service Errors)
When creating an SLO, seems that we need to specify a "service filter" to tide it to a specific service, or otherwise the SLO is applied without considering the ServiceName dimension. We'd rather apply the same SLO to every service matching specific tags. Using the prometheus example here (https://sre.google/workbook/alerting-on-slos/)
record: job:slo_errors_per_request:ratio_rate10m expr: sum(rate(slo_errors[10m])) by (job) / sum(rate(slo_requests[10m])) by (job)
Would it be possible to achieve the same in dynatrace?
Note: Due to the large number of microservices, we are not keen on imperatively/programmatically create an SLO for each new service deployed. We'd rather have a generic set of SLOs calculated by serviceName, applied to all services matching specific tags
Alerting
When creating an alert on an SLO, a metric event is created on a static threshold, using an alerting window of 60 minutes.
Would it be possible to create a long/short alerting window as explained here or here
Thanks
10 Jan 2023 07:05 AM
Hi Federico. I saw you also discussed this topic through the Dynatrace In-Product Chat. Just curious to learn whether you have tried the burn down rate custom alert which should alert when the burn down rate is violated per service. Also happy to jump on a call with you to discuss SLOs in general as it is a topic I have been advocating for in the past
10 Jan 2023 09:53 AM
Hi Andreas, not yet but it's in my list of todos. I'll update this thread once I have some data to share. Thanks for following up!
23 Jan 2023 03:12 PM
Hi, unfortunately I have been notified by support that is not possible to create custom metrics via expression either via UI nor API. I was planning to create a custom metric via the following expression in order to calculate burn rate for every service (similarly to what the SLO functionality does in the background for you, but split on splitBy).
((100) - ((100)*(builtin:service.errors.server.successCount:filter(in("dt.entity.service",entitySelector("type(~"SERVICE~")"))):splitBy("dt.entity.service"))/(builtin:service.requestCount.server:filter(in("dt.entity.service",entitySelector("type(~"SERVICE~")"))):splitBy("dt.entity.service")))) / ((100) - (99.98))
23 Jan 2023 10:22 PM
Upon further conversation, it's not possible to create custom `func:` metric using a metric_selector, however it's possible to put directly the metric selector into a custom metric event. Hence, we are going to trial the following (provisioned via terraform)
resource "dynatrace_metric_events" "SLO_Burn_Rate" {
enabled = true
event_entity_dimension_key = "dt.entity.service"
summary = "SLO Burn Rate"
event_template {
description = "The {metricname} value was {alert_condition} normal behavior."
davis_merge = true
event_type = "CUSTOM_ALERT"
title = "{dims:dt.entity.service.name} Burn Rate"
}
model_properties {
type = "STATIC_THRESHOLD"
alert_condition = "ABOVE"
alert_on_no_data = false
dealerting_samples = 5
samples = 5
threshold = 14
violating_samples = 3
}
query_definition {
type = "METRIC_SELECTOR"
metric_key = ""
metric_selector = <<-EOT
((100) - ((100)*(
builtin:service.errors.server.successCount:filter(
in("dt.entity.service",entitySelector("type(~"SERVICE~"),tag(~"tier:1~")"))
):splitBy("dt.entity.service")) /
(builtin:service.requestCount.server:filter(
in("dt.entity.service",entitySelector("type(~"SERVICE~"),tag(~"tier:1~")"))
):splitBy("dt.entity.service")))) / ((100) - (99.98))
EOT
}
}
25 Jan 2023 10:20 AM
Thanks for the update.
Is there a reason why you are creating the "Error Budget" metrics vs an "SLO Metric"? What I mean is - instead of creating a metric that shows you, e.g: 95% success rate you are creating a metric that would say 5% error budget left because you are subtracting your target (99.98 in the example above) as part of your metric expression.
Just curious to learn why you do it this way vs the other way. I assume it is because you can then just say that those numbers just have to be above 0 and you are good instead of always having to also know the target threshold
Thanks - please keep us posted