Symptoms

StefanL · ‎18 Mar 2025

Self Service Summary

If you are running a Dynatrace Operator version 1.3.0 or newer it is possible that you see frequent restarts or crash loops of the CSI driver Pods in your cluster. This is caused by high system load that most likely originates from a large number of mounts per Node.

Issue	Solution	Tasks	Alternative(s)
The CSI driver Pod (server container) is crashing frequently because of a failing liveness probe.	Adjust the liveness probe parameters and resources of the server container.	Change or remove the resource requests/limits of the CSI driver in the Helm values file of the Dynatrace operator deployment. Adjust the parameters of the liveness probe in the cluster.	Limit number of simultaneous volume mounts through other means (e.g. by reducing max pods per node).

Symptoms

You might see some or all of the following symptoms:

High system load.
High CPU usage of kubelet.
Many pods/containers are waiting for volume provisioning by the CSI driver.
CSI driver server container gets killed due to failing liveness probe
- CSI driver eventually ends up in a crash loop and its startup is deferred due to back off.
csi.sock is missing
- This will be visible in FailedMount events on respective Pods.
- And in the logs of the liveness-probe container of the CSI driver Pod.

Cause

These symptoms are caused by a large number of simultaneous mount requests. Each mount puts load on the kubelet and the underlying system. When a certain threshold is reached, the CSI driver won't be able to set up the mounts for the pods due to resource contention on the node.

In version 1.3.0 of the Dynatrace Operator, the internal configuration of the liveness probe was improved to be more effective. This change made the underlying problem easier visible. In version 1.4.1, the parameters of the liveness probe were adjusted to further mitigate the problem.

Solution

Please consider updating the Dynatrace Operator to version 1.4.1 or higher as it is the easiest option to mitigate the problem. If the problem still occurs follow the guide below.

There are three things you can do to mitigate the problem:

Adjust resource limits
Tune timeouts of liveness probe
Disable liveness probe (this is the last resort and not recommended)

Adjusting Resource Limits

You can adjust the resource limits for the server container of the CSI driver by adding the following to your Helm values.yaml file:

csidriver:
  server:
    resources:
      requests:
        cpu: 250m
        memory: 200Mi
      limits:
        cpu: 1000m
        memory: 200Mi

If the CSI driver is still restarted frequently you can try to remove the resource limits completely:

csidriver:
  server:
    resources:
      requests:
        cpu: 250m
        memory: 200Mi

Apply the changes via Helm:

helm upgrade dynatrace-operator oci://public.ecr.aws/dynatrace/dynatrace-operator \
-n dynatrace --values <your-values-file.yaml>

Tune liveness probe timeouts

To tune the liveness probe timeouts, you need to adjust the DaemonSet directly, e.g. using kubectl. You will also need to adjust the timeout argument of the liveness probe container of the CSI driver pod.

For the GitOps use case, you may need post-renderers or similar features, as it is not supported to adjust the liveness probe via Helm values.

The following excerpt shows the relevant fields and values that need to be adjusted.

kind: DaemonSet
metadata:
  name: dynatrace-oneagent-csi-driver
spec:
  template:
    spec:
      containers:
      - name: server
        livenessProbe:
          initialDelaySeconds: 15
          periodSeconds: 15
          timeoutSeconds: 10
      - name: liveness-probe
        args:
        - --probe-timeout=9s

The following command adjusts the values using kubectl patch:

kubectl patch daemonset dynatrace-oneagent-csi-driver -n dynatrace --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds",
    "value": 15
  },
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/livenessProbe/periodSeconds",
    "value": 15
  },
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/0/livenessProbe/timeoutSeconds",
    "value": 10
  },
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/3/args/2",
    "value": "--probe-timeout=9s"
  }
]'

Removing the liveness probe

If increasing the timeouts isn't enough, you can completely disable the liveness probe. We don't recommend this step, but it can solve the problem.

kubectl patch daemonset dynatrace-oneagent-csi-driver -n dynatrace --type='json' -p='[
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/0/livenessProbe"
  },
  {
    "op": "remove",
    "path": "/spec/template/spec/containers/3"
  }
]'

Further Support

Please contact our support and provide the operator support archive if the problem could not be resolved by any of the measures in this guide.

Kubernetes Operator CSI driver crashes frequently

Self Service Summary

Symptoms

Cause

Solution

Adjusting Resource Limits

Tune liveness probe timeouts

Removing the liveness probe

Further Support