on
18 Mar 2025
02:02 PM
- edited on
31 Mar 2025
08:53 AM
by
Michal_Gebacki
If you are running a Dynatrace Operator version 1.3.0 or newer it is possible that you see frequent restarts or crash loops of the CSI driver Pods in your cluster. This is caused by high system load that most likely originates from a large number of mounts per Node.
Issue | Solution | Tasks | Alternative(s) |
The CSI driver Pod (server container) is crashing frequently because of a failing liveness probe. | Adjust the liveness probe parameters and resources of the server container. | Change or remove the resource requests/limits of the CSI driver in the Helm values file of the Dynatrace operator deployment. Adjust the parameters of the liveness probe in the cluster. | Limit number of simultaneous volume mounts through other means (e.g. by reducing max pods per node). |
You might see some or all of the following symptoms:
High system load.
High CPU usage of kubelet.
Many pods/containers are waiting for volume provisioning by the CSI driver.
CSI driver server container gets killed due to failing liveness probe
CSI driver eventually ends up in a crash loop and its startup is deferred due to back off.
csi.sock is missing
These symptoms are caused by a large number of simultaneous mount requests. Each mount puts load on the kubelet and the underlying system. When a certain threshold is reached, the CSI driver won't be able to set up the mounts for the pods due to resource contention on the node.
In version 1.3.0 of the Dynatrace Operator, the internal configuration of the liveness probe was improved to be more effective. This change made the underlying problem easier visible. In version 1.4.1, the parameters of the liveness probe were adjusted to further mitigate the problem.
Please consider updating the Dynatrace Operator to version 1.4.1 or higher as it is the easiest option to mitigate the problem. If the problem still occurs follow the guide below.
There are three things you can do to mitigate the problem:
Adjust resource limits
Tune timeouts of liveness probe
Disable liveness probe (this is the last resort and not recommended)
You can adjust the resource limits for the server container of the CSI driver by adding the following to your Helm values.yaml file:
csidriver:
server:
resources:
requests:
cpu: 250m
memory: 200Mi
limits:
cpu: 1000m
memory: 200Mi
If the CSI driver is still restarted frequently you can try to remove the resource limits completely:
csidriver:
server:
resources:
requests:
cpu: 250m
memory: 200Mi
Apply the changes via Helm:
helm upgrade dynatrace-operator oci://public.ecr.aws/dynatrace/dynatrace-operator \
-n dynatrace --values <your-values-file.yaml>
To tune the liveness probe timeouts, you need to adjust the DaemonSet directly, e.g. using kubectl. You will also need to adjust the timeout argument of the liveness probe container of the CSI driver pod.
For the GitOps use case, you may need post-renderers or similar features, as it is not supported to adjust the liveness probe via Helm values.
The following excerpt shows the relevant fields and values that need to be adjusted.
kind: DaemonSet
metadata:
name: dynatrace-oneagent-csi-driver
spec:
template:
spec:
containers:
- name: server
livenessProbe:
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 10
- name: liveness-probe
args:
- --probe-timeout=9s
kubectl patch daemonset dynatrace-oneagent-csi-driver -n dynatrace --type='json' -p='[
{
"op": "replace",
"path": "/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds",
"value": 15
},
{
"op": "replace",
"path": "/spec/template/spec/containers/0/livenessProbe/periodSeconds",
"value": 15
},
{
"op": "replace",
"path": "/spec/template/spec/containers/0/livenessProbe/timeoutSeconds",
"value": 10
},
{
"op": "replace",
"path": "/spec/template/spec/containers/3/args/2",
"value": "--probe-timeout=9s"
}
]'
If increasing the timeouts isn't enough, you can completely disable the liveness probe. We don't recommend this step, but it can solve the problem.
kubectl patch daemonset dynatrace-oneagent-csi-driver -n dynatrace --type='json' -p='[
{
"op": "remove",
"path": "/spec/template/spec/containers/0/livenessProbe"
},
{
"op": "remove",
"path": "/spec/template/spec/containers/3"
}
]'
Please contact our support and provide the operator support archive if the problem could not be resolved by any of the measures in this guide.