All Services SLO Alerting

federicobarera · ‎04 Jan 2023

I have two questions regarding the creation and alerting of SLO within dynatrace.

Context:

Dynatrace auto detects services deployed in our environment and we use environment/auto generated tags to classify workload types and performance requirements.

Creation of SLOs on every service (Eg. Service Errors)

When creating an SLO, seems that we need to specify a "service filter" to tide it to a specific service, or otherwise the SLO is applied without considering the ServiceName dimension. We'd rather apply the same SLO to every service matching specific tags. Using the prometheus example here (https://sre.google/workbook/alerting-on-slos/)

record: job:slo_errors_per_request:ratio_rate10m
expr:
  sum(rate(slo_errors[10m])) by (job)
    /
  sum(rate(slo_requests[10m])) by (job)

Would it be possible to achieve the same in dynatrace?

Note: Due to the large number of microservices, we are not keen on imperatively/programmatically create an SLO for each new service deployed. We'd rather have a generic set of SLOs calculated by serviceName, applied to all services matching specific tags

Alerting

When creating an alert on an SLO, a metric event is created on a static threshold, using an alerting window of 60 minutes.

Would it be possible to create a long/short alerting window as explained here or here

Thanks

andreas_grabner · ‎10 Jan 2023

Hi Federico. I saw you also discussed this topic through the Dynatrace In-Product Chat. Just curious to learn whether you have tried the burn down rate custom alert which should alert when the burn down rate is violated per service. Also happy to jump on a call with you to discuss SLOs in general as it is a topic I have been advocating for in the past

Contact our DevRel team through devrel@dynatrace.com

federicobarera · ‎10 Jan 2023

Hi Andreas, not yet but it's in my list of todos. I'll update this thread once I have some data to share. Thanks for following up!

federicobarera · ‎23 Jan 2023

Hi, unfortunately I have been notified by support that is not possible to create custom metrics via expression either via UI nor API. I was planning to create a custom metric via the following expression in order to calculate burn rate for every service (similarly to what the SLO functionality does in the background for you, but split on splitBy).

((100) - ((100)*(builtin:service.errors.server.successCount:filter(in("dt.entity.service",entitySelector("type(~"SERVICE~")"))):splitBy("dt.entity.service"))/(builtin:service.requestCount.server:filter(in("dt.entity.service",entitySelector("type(~"SERVICE~")"))):splitBy("dt.entity.service")))) / ((100) - (99.98))

federicobarera · ‎23 Jan 2023

Upon further conversation, it's not possible to create custom `func:` metric using a metric_selector, however it's possible to put directly the metric selector into a custom metric event. Hence, we are going to trial the following (provisioned via terraform)

resource "dynatrace_metric_events" "SLO_Burn_Rate" {
  enabled                    = true
  event_entity_dimension_key = "dt.entity.service"
  summary                    = "SLO Burn Rate"
  event_template {
    description = "The {metricname} value was {alert_condition} normal behavior."
    davis_merge = true
    event_type  = "CUSTOM_ALERT"
    title       = "{dims:dt.entity.service.name} Burn Rate"
  }
  model_properties {
    type               = "STATIC_THRESHOLD"
    alert_condition    = "ABOVE"
    alert_on_no_data   = false
    dealerting_samples = 5
    samples            = 5
    threshold          = 14
    violating_samples  = 3
  }
  query_definition {
    type            = "METRIC_SELECTOR"
    metric_key      = ""
    metric_selector = <<-EOT
      ((100) - ((100)*(
	builtin:service.errors.server.successCount:filter(
		in("dt.entity.service",entitySelector("type(~"SERVICE~"),tag(~"tier:1~")"))
	):splitBy("dt.entity.service")) / 
	(builtin:service.requestCount.server:filter(
		in("dt.entity.service",entitySelector("type(~"SERVICE~"),tag(~"tier:1~")"))
	):splitBy("dt.entity.service")))) / ((100) - (99.98))
    EOT
  }
}

andreas_grabner · ‎25 Jan 2023

Thanks for the update.

Is there a reason why you are creating the "Error Budget" metrics vs an "SLO Metric"? What I mean is - instead of creating a metric that shows you, e.g: 95% success rate you are creating a metric that would say 5% error budget left because you are subtracting your target (99.98 in the example above) as part of your metric expression.

Just curious to learn why you do it this way vs the other way. I assume it is because you can then just say that those numbers just have to be above 0 and you are good instead of always having to also know the target threshold

Thanks - please keep us posted

Contact our DevRel team through devrel@dynatrace.com