30 Aug 2024 01:06 PM - last edited on 02 Sep 2024 07:17 AM by MaciejNeumann
We are ingesting traces from Nvidia Triton. As there is no way to do it via OneAgent on OpenShift, we have to use an ActiveGate. Thus, there is no Topology information to match the resulting service to a Process Group.
I've taken a look at the custom topology model but for this one of the entities has to be generic one. The docs aren't that great on more komplex examples.
There also is this docs page about enrichment, but it seems to be geared more towards metrics, also the file mentioned there only contains the Host information, nothing about PGs.
There also the Dynatrace/OpenTelemetry Collector, which might be an option via a Processor?
If anyone has managed to assign a service, from traces ingested via AG, to a PG any pointers would be much appreciated 🙂
17 Oct 2024 02:18 PM
Want to push this again, still haven't found a solution.
So, if anybody has an idea if there is an option to match a span service to a corresponding process group, any insights would be appreciated.
17 Oct 2024 02:26 PM
@pahofmann just add resource attributes (dt.entity.host and dt.entity.process_group_instance) for traces. For example with Otel Collector and Dynatrace automatically creates corresponding span service on the process group.
See for example my answer here. Is this what you need?
Depending on the implementation on the trace producer side, you can do that using environment variables too (if the sender is using OTEL SDK in a proper way 😁 ).
17 Oct 2024 03:42 PM - edited 17 Oct 2024 04:14 PM
That looks promising, so it is possible to map the span services to a PG.
We are using a OTEL collector already, but I'm not sure how I could do the mapping in this use case.
Triton runs on OpenShift, currently we have one OTEL collector per cluster. If we'd pin one to each worker node, the mapping to a host would be possible. But there are multiple triton instances per worker, so I can't really differentiate the process group instances.
But thanks a lot for the pointer, I'll dig deeper into the collector to check if I can do a mapping there.
I'll post results if I find anything.
17 Oct 2024 05:39 PM
The trick is just to find the right process group instance IDs - you have an example in my post how to add those resource attributes by the collector. Best way is to grab them in the monitored process, but that might not be feasible. I don't think there is another way of doing it of the app cannot add it automatically to resource attributes.
17 Oct 2024 05:48 PM
Sure, but the App doesn't really have any idea of the PGI ids so it can't add them directly. A lookup there would probably be to costly in terms of overhead.
So, I would have to add information there that the app has access to, like the pod name, and translate it to an PGI ID in the Collector.
17 Oct 2024 07:56 PM
Seems you can add it through OTEL_RESOURCE_ATTRIBUTES env variable / or the resource setting - https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/trace.html#o...
Not sure about what the deployment of triton looks like, I assume you have no option to add your code to the process to capture the pgi from the magic file for enrichment - see this for example.
Probably then the safest method to to periodically update the lookup in the collector.
Btw. Dynatrace Operator does already some metadata enrichment for you in the pod - see files in /var/lib/dynatrace/enrichment (available in cloudNativeFullStack and applicationOnly). You can have at least some of the metadata such as pod id or cluster id which you can use then in the lookup. Can be useful in your case.
17 Oct 2024 09:02 PM
Yep, we already added other Metadata to the traces via the OTEL_RESOURCE_ATTRIBUTES, that worked just fine.
Only problem now is getting/matching the right metadata, as you guess there is no way to add our own code on the applitcation side here.
Unfortunately, the clusters are still in classic FullStack and will be for a while.
Thanks for all the input, now it's time for digging into the collector and finding a way for dynamic lookups.
18 Oct 2024 06:59 AM
I would also explore options for having a static PGI. If you have a "predictable" number of instances (like one pod per node) in the cluster, you can do that using process group detection rules or DT_CLUSTER_ID/DT_NODE_ID and you don't have to bother with updating the lookup for the collector.
22 Oct 2024 02:37 PM
Hey there,
Just to make sure you aren't having trouble finding your process group id right? And you're able to add that process group id as an env var attribute as others have suggested?
If you haven't found the PGI yet just go to your service in the UI services tab and click on the process group. then the PGI should be in the URL in the next webpage
23 Oct 2024 12:30 PM
No that's not the issue, thanks 🙂
Mapping traces to the PGI IDs where they are coming from is the issue, if there are multiple Pods per host and only one collector.
17 Dec 2024 03:09 PM
@_Alexander_ I heared today that there will be improvement on this topic in the future.
Could you share some information here already?