Solved: Analyse Called-By Relationships in Spans Using DQL

r_weber · ‎18 Jun 2024

The "classic" service / PurePath analysis view has this nice feature of filtering by caller/callee of a service:

This is working on Purepaths/traces/spans.

I'm trying to do the same thing with a DQL query to answer the question "how good is the response time of service B when service A is calling it".
I'm unable to find a solution to that question as the DQL query to spans would need to perform sub-queries and joins which are way too expensive and only allow a very short time period (unsatisfying 1-2 minutes maybe - if at all).

Here is my query that would calculate the inter-service response time:

While theoretically possible, the limitation lies in the restrictions of join calls that makes DQL unusable for this usecase.

I think the usecase is however a valid one. E.g. if you want to build SLOs or guardians around it the validate the service quality for callers. It could also eliminate the need of defining key requests (as a key requests being a specific endppint used by a calling service).
Does anyone else think this would be a good usecase? Or have you found a solution to this problem maybe?

kr

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net

Julius_Loman · ‎18 Jun 2024

Very valid use case! "Topology filter" is a must for even more complex ones such as response time for a service where the initial caller was a particular web/mobile application and there are other services between those two. This is easily doable in MDA.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

krzysztof_hoja · ‎22 Jun 2024

Here is technique which is useful here and does not require join and lookup as they have (for a reason) limitations. In many cases you can successfully use summarize to bring records together which does not have such limitations.

Here is my query which gets data for calls between 2 ties only for selected service: it finds out what services called my service and what was the the response times of called service:

fetch spans
| filter isNotNull(dt.entity.service)
| fieldsAdd isServiceEntry = (dt.entity.service == "SERVICE-B41FA3B7CC1AD9A4" and request.is_root_span)
| fieldsAdd span.joining_id = if( isServiceEntry, span.parent_id, else: span.id  )
| summarize {
    cnt=count(),
    root_span_cnt=countIf(isServiceEntry),
    dt.entity.client_service=takeAny(if(not isServiceEntry, dt.entity.service)),
    duration=avg(if(isServiceEntry, duration))
} , by: {trace.id, span.joining_id}
| filter root_span_cnt>=1
| filter cnt==root_span_cnt+1
| summarize { duration=avg(duration), cnt=count() }, by: { dt.entity.client_service }
| fieldsAdd dt.entity.client_service_name = entityName(dt.entity.client_service, type:"dt.entity.service")
| sort cnt desc

What is does:

marks entry point spans to service
builds span.joining_id which will be use to construct group: for called service is span.parent_id and span.id for the rest (it does not have to a dedicated field, condition can be used in by:{} clause, but help with query readability)
creates group for each trace.id and span.joining_id, which make calling and called spans in same group
- collect spans count, called spans count,
- calling service id
- and average duration of callede spans
filters only groups with at least 1 span from called service (single client span may call same service mor than once) and exactly 1 span from calling service.
- if we was not done, span calling other services would be in result set
- if we was not done, server span without parent would be in result set
in the last step just aggregates duration by calling service

Of course this query goes over all spans in selected timeframe, but can safely run on large sets.

Now when calling services are known and we ca include them in the query we can easily go over even larger time spans to get more details, i.e.:

fetch spans
| filter in(dt.entity.service, {"SERVICE-B41FA3B7CC1AD9A4", "SERVICE-65BDC31767096F4D", "SERVICE-DAD9C23562B70097", "SERVICE-28578726C5AAE5C2" } )
| fieldsAdd isServiceEntry = (dt.entity.service == "SERVICE-B41FA3B7CC1AD9A4" and request.is_root_span)
| fieldsAdd span.joining_id = if( isServiceEntry, span.parent_id, else: span.id  )
| summarize {
    cnt=count(),
    root_span_cnt=countIf(isServiceEntry),
    dt.entity.client_service=takeAny(if(not isServiceEntry, dt.entity.service)),
    duration=avg(if(isServiceEntry, duration)),
    timestamp = takeMin( if(isServiceEntry, start_time) )
} , by: {trace.id, span.joining_id}
| filter root_span_cnt>=1
| filter cnt==root_span_cnt+1
| makeTimeseries duration=avg(duration), by: { dt.entity.client_service }
| fieldsAdd dt.entity.client_service_name = entityName(dt.entity.client_service, type:"dt.entity.service")

r_weber · ‎22 Jun 2024

Thanks @krzysztof_hoja! That is a creative approach and of course I tested it, this could be really useful.
Not sure why, but the count numbers (service throughput) seems a bit low here, compared to what I'd get from PurePath analysis:

The DQL gives me 255 root count:

And the PP analysis gives me a few thousands:

To include the calling services in the filter is probably a really good idea, even that could be determined from the entity relationships on the fly upfront.

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net

patrick_thurner · ‎08 Jul 2024

Hi @r_weber,

please keep in mind, span data are subject to sampling factors, which you have to include into your calculations. Please see a example how to bring in the sampling factor from our documentation Service metrics migration guide - Dynatrace Docs:

fetch spans, samplingRatio:1

// get only database client span
| filter span.kind == "client" and isNotNull(db.statement)

// calculate how frequently each span is sampled
| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)
| fieldsAdd sampling.multiplicity = 1/sampling.probability

// calculate the number of database spans after sampling
| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)
                         * coalesce(aggregation.count, 1)
                         * dt.system.sampling_ratio

// calculate the duration of database spans after sampling
| fieldsAdd duration = coalesce(aggregation.duration_sum / aggregation.count, duration)

// aggregate records with the same values, by service ID
| summarize {
    operation_count_extrapolated = sum(multiplicity),
    operation_duration_avg_extrapolated = sum(duration * multiplicity) / sum(multiplicity)
}, by: { entityName(dt.entity.service), db.name }

m_zol · ‎23 Mar 2025

Hello krzysztof,

Is it possible to follow similiar logic for Called Services, that means the current service calling other servicer (outgoing calls)?

Thanks

sinisa_zubic · ‎24 Jun 2024

I have used some technique than described by Kris:

fetch spans
| filter dt.entity.service == "SERVICE-939BD79A70E3B49F" and isNotNull(span.parent_id)
| fieldsAdd child = record(
    end_time,
    start_time,
    response_time = end_time-start_time  
  )
| fieldsAdd key = span.parent_id

| append [
    fetch spans
    | filter dt.entity.service == "SERVICE-8C3C0F907E8AF45B"
    | fieldsAdd key = span.id
    | fieldsAdd {
      parent = record(
        span.id
      )
    }
  ]
| summarize {
      child = takeAny(child),
      parent = takeAny(parent)
    },
    by: { key }
| filter isNotNull(child[response_time]) and isNotNull(parent)
| makeTimeseries avg(child[response_time]), time:child[start_time]

you can try the query out in discover dynatrace.

Some explanation:

* Service A (SERVICE-8C3C0F907E8AF45B) calls service B (SERVICE-939BD79A70E3B49F)

* select all spans from service B

* append all spans from service A (union)

* summarize by common key (span_id)

r_weber · ‎24 Jun 2024

Thanks @krzysztof_hoja and @sinisa_zubic !
Since a typical end-user usually doesn't want to fiddle with SERVICE-IDs (or knows them). I tried something different based on your queries. It's a bit of a hacky workaround and it surfaces some limitations of dashboards that I will create an RFE for.

I'm using variables on a dashboard now to determine the service ids from service names and from there the calling service IDs from the entity model.
This allows me to dynamically create @krzysztof_hoja 's second query, which is a lot more performant in large span sets.
(still have to try @sinisa_zubic 's solution with the union, to see how that performs).

The dashboard uses "cascading" variables to determine the service ID from a service name and from that determines the calling service's IDs in a multiselect variable:

These variables then can be used in the DQL queries to filter spans:

That query is a lot faster and scans about 1/6th of the data in grail.

However what I cannot explain (looks like a bug?) is that the makeTimeseries command creates multiple buckets for the same client_service_name (see below). The query is identical apart from an additional filter in the beginning, but shows a lot of splittings.

Enhancements regarding Dashboard variables:

Support for hidden variables, no user want's to see the SERVICE-IDs and select based on them
Support for variable value and variable display name (e.g. one could display the service name but the variable's value is the service id)
Support for auto-updating variables whose query is using another variable (change of the service name automatically updates the depending queries for caller service IDs)
Option to auto-select variable values (e.g. automatically pick all caller services)

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net

gdalessandro · ‎08 Apr 2025

I am attempting to do something similar with DQL. I want to look at the count of requests to a specific service so that I can determine if any specific consumer is making more requests than others. Using the "called_by" command or filter would be an easy way to do this but I have not yet been able to make it work as expected.