Partially Failing Instrumentation for Python K8s Service with Dynatrace

azacharov · ‎19 Nov 2024

Hi,

I’m experiencing issues with Dynatrace instrumentation in my Python-based Kubernetes service

I have implemented tracing using the autodynatrace SDK in my FastAPI application, which runs on Uvicorn.

However, I see a lot of “no response” errors for my Kubernetes service endpoints in Dynatrace.
As verified through application logs, these errors do not correspond to actual failures.

Additionally, I’m receiving the following diagnostic error in Dynatrace:

“Some data could not be collected or transmitted. This is most likely due to a resource congestion on network, host, or process level in your monitored environment

(Diagnostic codes: C1, A5)"

Here are some details of my Kubernetes Python service:

Framework: FastAPI.
Server: Uvicorn.
Autodynatrace Setup:
- Installed autodynatrace in the Dockerfile.
- Set the AUTOWRAPT_BOOTSTRAP=autodynatrace environment variable

1. What could be causing these “no response” errors, given that the endpoints function correctly?

2.Could the diagnostic error codes (C1, A5) be related to this behavior?

3. Are there additional configurations or adjustments recommended for Python FastAPI services in Kubernetes to optimize tracing and prevent these issues?

Here are OneAgent analyses:

[C/Python SDK] end() order wrong or unsupported async/gevent usage (736)

Nov 17 2024, 16:10:43 - Yesterday, 15:40:38

Classification: Warning

The end() calls on the SDK tracers are in a wrong order. Can result from usage of async/await or greenlets (gevent, green threads) e.g. with Python gunicorn. Please see https://github.com/Dynatrace/OneAgent-SDK-for-Python/blob/master/README.md#tracers Can also be simply forgetting to call start (tracers are not started at creation) or end or skipping them (e.g. by exception)

Recommendation: Check you are calling end in the reverse order of start and in the correct (OS) thread. In Python, prefer with blocks to explicit start/end calls.

Especially if using async/await, green threads or other threading abstractions, check https://github.com/Dynatrace/OneAgent-SDK-for-Python/blob/master/README.md#tracers.

If using a server like gunicorn, you might be able to configure it to not use green threads (normal threads are fine) but be aware that this will have a performance impact.

christian_neumu · ‎20 Nov 2024

@azacharov wrote:
1. What could be causing these “no response” errors, given that the endpoints function correctly?

Most likely, due to the async usage, the SDK gets confused as to what is the current node and it detects the trace as corrupted before it would report the response code. See next question.

2.Could the diagnostic error codes (C1, A5) be related to this behavior?

Yes. Very likely these are caused by the issue the analysis pointed out. You can follow the link to the GitHub SDK README to get further background information.

3. Are there additional configurations or adjustments recommended for Python FastAPI services in Kubernetes to optimize tracing and prevent these issues?

Not really sensibly. The main purpose of using async in this case is to allow a single OS thread to handle multiple operations in an interleaved way, so that when one request is waiting for I/O the thread can already start the next request and then later get back to the previous request. But as the SDK uses OS-level thread local variables to track the current trace, this makes it break. This is explained in the GitHub SDK README link. It also explains a workaround that might be applicable when manually using the SDK but it is probably not usable as-is for autodynatrace. (Note that autodynatrace is also not supported through support tickets, it only has open source support via GitHub https://github.com/dynatrace-oss/OneAgent-SDK-Python-AutoInstrumentation)