18 Jun 2024 12:18 PM
The "classic" service / PurePath analysis view has this nice feature of filtering by caller/callee of a service:
This is working on Purepaths/traces/spans.
I'm trying to do the same thing with a DQL query to answer the question "how good is the response time of service B when service A is calling it".
I'm unable to find a solution to that question as the DQL query to spans would need to perform sub-queries and joins which are way too expensive and only allow a very short time period (unsatisfying 1-2 minutes maybe - if at all).
Here is my query that would calculate the inter-service response time:
While theoretically possible, the limitation lies in the restrictions of join calls that makes DQL unusable for this usecase.
I think the usecase is however a valid one. E.g. if you want to build SLOs or guardians around it the validate the service quality for callers. It could also eliminate the need of defining key requests (as a key requests being a specific endppint used by a calling service).
Does anyone else think this would be a good usecase? Or have you found a solution to this problem maybe?
kr
18 Jun 2024 01:45 PM
Very valid use case! "Topology filter" is a must for even more complex ones such as response time for a service where the initial caller was a particular web/mobile application and there are other services between those two. This is easily doable in MDA.
22 Jun 2024 12:02 PM
Here is technique which is useful here and does not require join and lookup as they have (for a reason) limitations. In many cases you can successfully use summarize to bring records together which does not have such limitations.
Here is my query which gets data for calls between 2 ties only for selected service: it finds out what services called my service and what was the the response times of called service:
fetch spans
| filter isNotNull(dt.entity.service)
| fieldsAdd isServiceEntry = (dt.entity.service == "SERVICE-B41FA3B7CC1AD9A4" and request.is_root_span)
| fieldsAdd span.joining_id = if( isServiceEntry, span.parent_id, else: span.id )
| summarize {
cnt=count(),
root_span_cnt=countIf(isServiceEntry),
dt.entity.client_service=takeAny(if(not isServiceEntry, dt.entity.service)),
duration=avg(if(isServiceEntry, duration))
} , by: {trace.id, span.joining_id}
| filter root_span_cnt>=1
| filter cnt==root_span_cnt+1
| summarize { duration=avg(duration), cnt=count() }, by: { dt.entity.client_service }
| fieldsAdd dt.entity.client_service_name = entityName(dt.entity.client_service, type:"dt.entity.service")
| sort cnt desc
What is does:
Of course this query goes over all spans in selected timeframe, but can safely run on large sets.
Now when calling services are known and we ca include them in the query we can easily go over even larger time spans to get more details, i.e.:
fetch spans
| filter in(dt.entity.service, {"SERVICE-B41FA3B7CC1AD9A4", "SERVICE-65BDC31767096F4D", "SERVICE-DAD9C23562B70097", "SERVICE-28578726C5AAE5C2" } )
| fieldsAdd isServiceEntry = (dt.entity.service == "SERVICE-B41FA3B7CC1AD9A4" and request.is_root_span)
| fieldsAdd span.joining_id = if( isServiceEntry, span.parent_id, else: span.id )
| summarize {
cnt=count(),
root_span_cnt=countIf(isServiceEntry),
dt.entity.client_service=takeAny(if(not isServiceEntry, dt.entity.service)),
duration=avg(if(isServiceEntry, duration)),
timestamp = takeMin( if(isServiceEntry, start_time) )
} , by: {trace.id, span.joining_id}
| filter root_span_cnt>=1
| filter cnt==root_span_cnt+1
| makeTimeseries duration=avg(duration), by: { dt.entity.client_service }
| fieldsAdd dt.entity.client_service_name = entityName(dt.entity.client_service, type:"dt.entity.service")
22 Jun 2024 07:22 PM
Thanks @krzysztof_hoja! That is a creative approach and of course I tested it, this could be really useful.
Not sure why, but the count numbers (service throughput) seems a bit low here, compared to what I'd get from PurePath analysis:
The DQL gives me 255 root count:
And the PP analysis gives me a few thousands:
To include the calling services in the filter is probably a really good idea, even that could be determined from the entity relationships on the fly upfront.
08 Jul 2024 03:23 PM
Hi @r_weber,
please keep in mind, span data are subject to sampling factors, which you have to include into your calculations. Please see a example how to bring in the sampling factor from our documentation Service metrics migration guide - Dynatrace Docs:
fetch spans, samplingRatio:1
// get only database client span
| filter span.kind == "client" and isNotNull(db.statement)
// calculate how frequently each span is sampled
| fieldsAdd sampling.probability = (power(2, 56) - coalesce(sampling.threshold, 0)) * power(2, -56)
| fieldsAdd sampling.multiplicity = 1/sampling.probability
// calculate the number of database spans after sampling
| fieldsAdd multiplicity = coalesce(sampling.multiplicity, 1)
* coalesce(aggregation.count, 1)
* dt.system.sampling_ratio
// calculate the duration of database spans after sampling
| fieldsAdd duration = coalesce(aggregation.duration_sum / aggregation.count, duration)
// aggregate records with the same values, by service ID
| summarize {
operation_count_extrapolated = sum(multiplicity),
operation_duration_avg_extrapolated = sum(duration * multiplicity) / sum(multiplicity)
}, by: { entityName(dt.entity.service), db.name }
24 Jun 2024 06:32 PM
I have used some technique than described by Kris:
fetch spans
| filter dt.entity.service == "SERVICE-939BD79A70E3B49F" and isNotNull(span.parent_id)
| fieldsAdd child = record(
end_time,
start_time,
response_time = end_time-start_time
)
| fieldsAdd key = span.parent_id
| append [
fetch spans
| filter dt.entity.service == "SERVICE-8C3C0F907E8AF45B"
| fieldsAdd key = span.id
| fieldsAdd {
parent = record(
span.id
)
}
]
| summarize {
child = takeAny(child),
parent = takeAny(parent)
},
by: { key }
| filter isNotNull(child[response_time]) and isNotNull(parent)
| makeTimeseries avg(child[response_time]), time:child[start_time]
you can try the query out in discover dynatrace.
Some explanation:
* Service A (SERVICE-8C3C0F907E8AF45B) calls service B (SERVICE-939BD79A70E3B49F)
* select all spans from service B
* append all spans from service A (union)
* summarize by common key (span_id)
24 Jun 2024 09:19 PM
Thanks @krzysztof_hoja and @sinisa_zubic !
Since a typical end-user usually doesn't want to fiddle with SERVICE-IDs (or knows them). I tried something different based on your queries. It's a bit of a hacky workaround and it surfaces some limitations of dashboards that I will create an RFE for.
I'm using variables on a dashboard now to determine the service ids from service names and from there the calling service IDs from the entity model.
This allows me to dynamically create @krzysztof_hoja 's second query, which is a lot more performant in large span sets.
(still have to try @sinisa_zubic 's solution with the union, to see how that performs).
The dashboard uses "cascading" variables to determine the service ID from a service name and from that determines the calling service's IDs in a multiselect variable:
These variables then can be used in the DQL queries to filter spans:
That query is a lot faster and scans about 1/6th of the data in grail.
However what I cannot explain (looks like a bug?) is that the makeTimeseries command creates multiple buckets for the same client_service_name (see below). The query is identical apart from an additional filter in the beginning, but shows a lot of splittings.
Enhancements regarding Dashboard variables: