12 Mar 2024 04:34 PM - last edited on 02 Oct 2024 01:31 PM by MaciejNeumann
Hey,
I'm attempting to optimize DQL queries for some SRG objectives as I have hit "execution timeouts" on the validation when triggered from workflows.
Though the query goal is not SRG related itself but could also be used for other purposes.
What I want to do
I want to use entry point spans (traces) and identify e.g. how much time is being spent on specific services down the path. e.g. how much time is spent on calls to a DB, an external service or any other service. For that I created a query that:
a) filters on the entry service by some criteria
b) performs various lookups using the trace.id and filters on other criteria for the called service
c) summarize duration of sub-calls
To do so I've created this DQL:
fetch spans
| filter (istio.canonical_service == "istio-egressgateway"
and startsWith(server.address,"api.p.eco")
and contains(http.url,"products/suggest") and code.namespace == "Envoy")
// get span info for hybris call
| lookup [fetch spans
| filter (dt.host_group.id == "lzapp"
and contains(endpoint.name,"getSearchResults"))
],
sourceField: trace.id, lookupField: trace.id, prefix: "hybris."
// get span info of proxy call
| lookup [fetch spans
| filter (contains(host.name,"nclutz")
and dt.host_group.id == "proxy")
and contains(endpoint.name,"products/suggest")
and contains(server.address,"api.p.eco")
],
sourceField: trace.id, lookupField: trace.id, prefix: "proxy."
// get span info of DB calls
| lookup [fetch spans
| filter contains(code.namespace,"de.hybris.platform.jdbcwrapper.")
and dt.host_group.id == "lzapp"
and contains(code.function,"execute")
| summarize duration=sum(duration), by: {trace.id}
],
sourceField: trace.id, lookupField: trace.id, prefix: "db."
| fields trace.id, duration, hybris.duration, proxy.duration, db.duration
This works (mostly), however it could happen that the query fails because the "lookup. table is too big", my other concern is that it is not performant? (thus the mentioned timeout on the guardian verification?
I think the usecase isn't that special.
How would one optimize this query? I tried with only one lookup statement and multiple filter parameters but that is even worse...
Any ideas?
20 Nov 2024 11:41 AM
Hello @r_weber,
I've got some information about this from our developers:
" The timeout is due to limitations of serverless functions and workflow actions. In short, a guardian would need to validate and ingest results within two minutes. There is information about this in the documentation: AppEngine Functions (Serverless Functions)". Our Devs also asked, what is the timeframe for this validation?
Some improvements around this topic are planned for the future, but we cannot share anything specific at the moment.
As for optimizing this query, I'll share this question and topic at the meeting with the DQL team, and we'll see if we can add something to the DQL Best Practices article.