Re: Topology Model for ingested traces

pahofmann · ‎30 Aug 2024

We are ingesting traces from Nvidia Triton. As there is no way to do it via OneAgent on OpenShift, we have to use an ActiveGate. Thus, there is no Topology information to match the resulting service to a Process Group.

I've taken a look at the custom topology model but for this one of the entities has to be generic one. The docs aren't that great on more komplex examples.

There also is this docs page about enrichment, but it seems to be geared more towards metrics, also the file mentioned there only contains the Host information, nothing about PGs.

There also the Dynatrace/OpenTelemetry Collector, which might be an option via a Processor?

If anyone has managed to assign a service, from traces ingested via AG, to a PG any pointers would be much appreciated 🙂

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

pahofmann · ‎17 Oct 2024

Want to push this again, still haven't found a solution.

So, if anybody has an idea if there is an option to match a span service to a corresponding process group, any insights would be appreciated.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

Julius_Loman · ‎17 Oct 2024

@pahofmann just add resource attributes (dt.entity.host and dt.entity.process_group_instance) for traces. For example with Otel Collector and Dynatrace automatically creates corresponding span service on the process group.
See for example my answer here. Is this what you need?
Depending on the implementation on the trace producer side, you can do that using environment variables too (if the sender is using OTEL SDK in a proper way 😁 ).

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

pahofmann · ‎17 Oct 2024

That looks promising, so it is possible to map the span services to a PG.

We are using a OTEL collector already, but I'm not sure how I could do the mapping in this use case.

Triton runs on OpenShift, currently we have one OTEL collector per cluster. If we'd pin one to each worker node, the mapping to a host would be possible. But there are multiple triton instances per worker, so I can't really differentiate the process group instances.

But thanks a lot for the pointer, I'll dig deeper into the collector to check if I can do a mapping there.

I'll post results if I find anything.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

Julius_Loman · ‎17 Oct 2024

The trick is just to find the right process group instance IDs - you have an example in my post how to add those resource attributes by the collector. Best way is to grab them in the monitored process, but that might not be feasible. I don't think there is another way of doing it of the app cannot add it automatically to resource attributes.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

pahofmann · ‎17 Oct 2024

Sure, but the App doesn't really have any idea of the PGI ids so it can't add them directly. A lookup there would probably be to costly in terms of overhead.

So, I would have to add information there that the app has access to, like the pod name, and translate it to an PGI ID in the Collector.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

Julius_Loman · ‎17 Oct 2024

Seems you can add it through OTEL_RESOURCE_ATTRIBUTES env variable / or the resource setting - https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/trace.html#o...

Not sure about what the deployment of triton looks like, I assume you have no option to add your code to the process to capture the pgi from the magic file for enrichment - see this for example.

Probably then the safest method to to periodically update the lookup in the collector.

Btw. Dynatrace Operator does already some metadata enrichment for you in the pod - see files in /var/lib/dynatrace/enrichment (available in cloudNativeFullStack and applicationOnly). You can have at least some of the metadata such as pod id or cluster id which you can use then in the lookup. Can be useful in your case.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

pahofmann · ‎17 Oct 2024

Yep, we already added other Metadata to the traces via the OTEL_RESOURCE_ATTRIBUTES, that worked just fine.

Only problem now is getting/matching the right metadata, as you guess there is no way to add our own code on the applitcation side here.

Unfortunately, the clusters are still in classic FullStack and will be for a while.

Thanks for all the input, now it's time for digging into the collector and finding a way for dynamic lookups.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

Julius_Loman · ‎18 Oct 2024

I would also explore options for having a static PGI. If you have a "predictable" number of instances (like one pod per node) in the cluster, you can do that using process group detection rules or DT_CLUSTER_ID/DT_NODE_ID and you don't have to bother with updating the lookup for the collector.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Nick-Montana · ‎22 Oct 2024

Hey there,

Just to make sure you aren't having trouble finding your process group id right? And you're able to add that process group id as an env var attribute as others have suggested?

If you haven't found the PGI yet just go to your service in the UI services tab and click on the process group. then the PGI should be in the URL in the next webpage

Nick Montana

pahofmann · ‎23 Oct 2024

No that's not the issue, thanks 🙂

Mapping traces to the PGI IDs where they are coming from is the issue, if there are multiple Pods per host and only one collector.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

pahofmann · ‎17 Dec 2024

@_Alexander_ I heared today that there will be improvement on this topic in the future.

Could you share some information here already?

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

_Alexander_ · ‎15 Jan 2025

@pahofmann , can you please provide more information about your use case so that we can better understand why a connection to the process is required in the case of OpenShift?

Thanks and best regards,
Alex

Julius_Loman · ‎15 Jan 2025

@_Alexander_ I can't speak for @pahofmann , but basically the issue has two aspects - analytics-wise (proper smartscape) and license-wise (costs).

I've also encountered a few cases already when:

application is using OpenTelemetry SDK and allows sending OpenTelemetry signals (traces / metrics)
customer does not have any options to modify the code, the application is even pure binary and may ignore supplied OTEL_RESOURCE_ATTRIBUTES using standard means

Basically, then you are left with generic OTEL trace ingest. This makes it hard to establish a relation between the services and processes. However, a much larger issue is license-wise. If the app emits OpenTelemetry signals (traces) and you don't have a relation to the process group, the cost of ingesting such traces is noticeable. It does not help you have a full stack OneAgent running on the host if you don't have the relation to PG / PGI. In this case, you can use an OTEL collector to add the resource attributes, but it's difficult to obtain the individual PGI entity values, especially in containerized environments when they are more or less dynamic.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

pahofmann · ‎16 Jan 2025

@Julius_Loman did a food summery, those are our main concerns as well. Having a unified view on all the data is one of the main benefits of using Dynatrace, and currently that is not possible with the ingested OTEL traces on OpenShift.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

mviitanen · ‎20 Jan 2025

We're currently preparing certain k8s-related improvements to create a better topology. The Dynatrace Operator will automatically provide the following configurations to an instrumented OTEL application:

A Dynatrace endpoint URL
A valid Dynatrace token
Resource attributes for proper topology

PG/PGI are not part of these improvements, but k8s-related attributes, like k8s.pod.name, k8s.pod.uid, and k8s.namespace.name, will be added to provide the detailed context.

@pahofmann, do you see this could solve your need to have the needed context for multiple pods per host? Or do you especially need PG and PGI for some reason?

pahofmann · ‎21 Jan 2025

For the reasons @Julius_Loman mentioned as well as what I said about the unified view and correlation we would still want a full integration with PG/PGIs.

Dynatrace Certified Master - Dynatrace Partner - 360Performance.net

Julius_Loman · ‎22 Jan 2025

Yes, PG/PGIs are inevitable. Without it, ingesting OTEL traces has additional price tag (DDU or Custom traces classic).

@mviitanen can you share more details on how the prepared improvement is designed to work? Will this be about injecting this metadata information into pods in the metadata injection phase?

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

alois_mayr · ‎23 Jan 2025

As @mviitanen said, we're working on improving the topology mapping for those scenarios but mapping spans from external sources to PG/PGI isn't planned atm as it requires more rework.

However, the licensing part for non-OneAgent sources will be fixed in 2 phases.

Phase 1: cloudNativeFullStack installations in K8s.

You need to define the metadata in the environment variable OTEL_RESOURCE_ATTRIBUTES to properly enrich the data. You can look into dt_metadata.properties file in /var/lib/dynatrace/enrichment for this. Operator v1.3+ is needed to make the enrichment-file available in the pod. Later this year, the Operator will also be able to set the environment variable directly.

Phase 2: classicFullStack and applicationMonitoring dynakube modes.

Phase 2 is planned for CQ2.