30 Aug 2024 01:06 PM - last edited on 02 Sep 2024 07:17 AM by MaciejNeumann
We are ingesting traces from Nvidia Triton. As there is no way to do it via OneAgent on OpenShift, we have to use an ActiveGate. Thus, there is no Topology information to match the resulting service to a Process Group.
I've taken a look at the custom topology model but for this one of the entities has to be generic one. The docs aren't that great on more komplex examples.
There also is this docs page about enrichment, but it seems to be geared more towards metrics, also the file mentioned there only contains the Host information, nothing about PGs.
There also the Dynatrace/OpenTelemetry Collector, which might be an option via a Processor?
If anyone has managed to assign a service, from traces ingested via AG, to a PG any pointers would be much appreciated 🙂
17 Oct 2024 02:18 PM
Want to push this again, still haven't found a solution.
So, if anybody has an idea if there is an option to match a span service to a corresponding process group, any insights would be appreciated.
17 Oct 2024 02:26 PM
@pahofmann just add resource attributes (dt.entity.host and dt.entity.process_group_instance) for traces. For example with Otel Collector and Dynatrace automatically creates corresponding span service on the process group.
See for example my answer here. Is this what you need?
Depending on the implementation on the trace producer side, you can do that using environment variables too (if the sender is using OTEL SDK in a proper way 😁 ).
17 Oct 2024 03:42 PM - edited 17 Oct 2024 04:14 PM
That looks promising, so it is possible to map the span services to a PG.
We are using a OTEL collector already, but I'm not sure how I could do the mapping in this use case.
Triton runs on OpenShift, currently we have one OTEL collector per cluster. If we'd pin one to each worker node, the mapping to a host would be possible. But there are multiple triton instances per worker, so I can't really differentiate the process group instances.
But thanks a lot for the pointer, I'll dig deeper into the collector to check if I can do a mapping there.
I'll post results if I find anything.
17 Oct 2024 05:39 PM
The trick is just to find the right process group instance IDs - you have an example in my post how to add those resource attributes by the collector. Best way is to grab them in the monitored process, but that might not be feasible. I don't think there is another way of doing it of the app cannot add it automatically to resource attributes.
17 Oct 2024 05:48 PM
Sure, but the App doesn't really have any idea of the PGI ids so it can't add them directly. A lookup there would probably be to costly in terms of overhead.
So, I would have to add information there that the app has access to, like the pod name, and translate it to an PGI ID in the Collector.
17 Oct 2024 07:56 PM
Seems you can add it through OTEL_RESOURCE_ATTRIBUTES env variable / or the resource setting - https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/trace.html#o...
Not sure about what the deployment of triton looks like, I assume you have no option to add your code to the process to capture the pgi from the magic file for enrichment - see this for example.
Probably then the safest method to to periodically update the lookup in the collector.
Btw. Dynatrace Operator does already some metadata enrichment for you in the pod - see files in /var/lib/dynatrace/enrichment (available in cloudNativeFullStack and applicationOnly). You can have at least some of the metadata such as pod id or cluster id which you can use then in the lookup. Can be useful in your case.
17 Oct 2024 09:02 PM
Yep, we already added other Metadata to the traces via the OTEL_RESOURCE_ATTRIBUTES, that worked just fine.
Only problem now is getting/matching the right metadata, as you guess there is no way to add our own code on the applitcation side here.
Unfortunately, the clusters are still in classic FullStack and will be for a while.
Thanks for all the input, now it's time for digging into the collector and finding a way for dynamic lookups.
18 Oct 2024 06:59 AM
I would also explore options for having a static PGI. If you have a "predictable" number of instances (like one pod per node) in the cluster, you can do that using process group detection rules or DT_CLUSTER_ID/DT_NODE_ID and you don't have to bother with updating the lookup for the collector.
22 Oct 2024 02:37 PM
Hey there,
Just to make sure you aren't having trouble finding your process group id right? And you're able to add that process group id as an env var attribute as others have suggested?
If you haven't found the PGI yet just go to your service in the UI services tab and click on the process group. then the PGI should be in the URL in the next webpage
23 Oct 2024 12:30 PM
No that's not the issue, thanks 🙂
Mapping traces to the PGI IDs where they are coming from is the issue, if there are multiple Pods per host and only one collector.
17 Dec 2024 03:09 PM
@_Alexander_ I heared today that there will be improvement on this topic in the future.
Could you share some information here already?
15 Jan 2025 01:23 PM - edited 15 Jan 2025 01:24 PM
@pahofmann , can you please provide more information about your use case so that we can better understand why a connection to the process is required in the case of OpenShift?
Thanks and best regards,
Alex
15 Jan 2025 07:40 PM
@_Alexander_ I can't speak for @pahofmann , but basically the issue has two aspects - analytics-wise (proper smartscape) and license-wise (costs).
I've also encountered a few cases already when:
Basically, then you are left with generic OTEL trace ingest. This makes it hard to establish a relation between the services and processes. However, a much larger issue is license-wise. If the app emits OpenTelemetry signals (traces) and you don't have a relation to the process group, the cost of ingesting such traces is noticeable. It does not help you have a full stack OneAgent running on the host if you don't have the relation to PG / PGI. In this case, you can use an OTEL collector to add the resource attributes, but it's difficult to obtain the individual PGI entity values, especially in containerized environments when they are more or less dynamic.
16 Jan 2025 11:13 AM
@Julius_Loman did a food summery, those are our main concerns as well. Having a unified view on all the data is one of the main benefits of using Dynatrace, and currently that is not possible with the ingested OTEL traces on OpenShift.
20 Jan 2025 07:56 AM
We're currently preparing certain k8s-related improvements to create a better topology. The Dynatrace Operator will automatically provide the following configurations to an instrumented OTEL application:
PG/PGI are not part of these improvements, but k8s-related attributes, like k8s.pod.name, k8s.pod.uid, and k8s.namespace.name, will be added to provide the detailed context.
@pahofmann, do you see this could solve your need to have the needed context for multiple pods per host? Or do you especially need PG and PGI for some reason?
21 Jan 2025 01:21 PM
For the reasons @Julius_Loman mentioned as well as what I said about the unified view and correlation we would still want a full integration with PG/PGIs.
22 Jan 2025 02:33 PM
Yes, PG/PGIs are inevitable. Without it, ingesting OTEL traces has additional price tag (DDU or Custom traces classic).
@mviitanen can you share more details on how the prepared improvement is designed to work? Will this be about injecting this metadata information into pods in the metadata injection phase?
23 Jan 2025 09:10 AM
As @mviitanen said, we're working on improving the topology mapping for those scenarios but mapping spans from external sources to PG/PGI isn't planned atm as it requires more rework.
However, the licensing part for non-OneAgent sources will be fixed in 2 phases.
Phase 1: cloudNativeFullStack
installations in K8s.
You need to define the metadata in the environment variable OTEL_RESOURCE_ATTRIBUTES
to properly enrich the data. You can look into dt_metadata.properties file in /var/lib/dynatrace/enrichment for this. Operator v1.3+ is needed to make the enrichment-file available in the pod. Later this year, the Operator will also be able to set the environment variable directly.
Phase 2: classicFullStack
and applicationMonitoring
dynakube modes.
Phase 2 is planned for CQ2.