We have been hit with an outage last week and the root cause was unfortunately identified as Dynatrace One agent ☹ which raised number of concerns among senior stakeholders. I've raised a support ticket but any inputs on this will be massively helpful to understand if Dynatrace was the cause.
We have Dynatrace enabled on our Kubernetes Cluster to monitor Strapi application with Mongo DB database. It was broken last week since Wednesday night 1:00 AM when google did some patching and upgraded the cluster nodes. The Cause identified as Dynatrace OneAgent preventing the Mongo Ops-Manager from starting. A chain of events caused an unexpected upgrade which we suspect introduced the issue. We involved support from Google and Mongo DB in the morning and below is a particular error line which they have found it from the serial logs (segfault) The library file liboneagentgo.so seems to be part of the Dynatrace OneAgent monitoring service, and the segfault issue is likely related to this monitoring service.
[33454.936358] mmsconfiguratio: segfault at 4 ip 00007fd75cb01b1b sp 00007ffdf64b1c80 error 4 in liboneagentgo.so[7fd75c9be000+aaa000]
After this Dynatrace has been removed from the Kubernetes cluster which has allowed Ops-Manager to successfully start which brought our application back.
We don’t know the exact root cause yet but people from our DevOps team have a working theory and they are saying that during the Cluster node upgrade, all the pods were recycled and during the restart or rescheduling of Pods, a new One agent image (latest) was pulled in which caused this whole impact
I am not sure how true is that, but I don’t have anything yet to prove them wrong because disabling Dynatrace have fixed the issue.
Anyone has any idea on this what could have caused this as the cause is pointing at Dynatrace. Has anyone seen this before? Could it be the latest image or something else which triggered this?
Hello @agrawal_shashan ,
Without more information (logs, etc) it is difficult to say why this is happening. Certainly a support issue more than a community post, though if we discover via support that there is a bug of some sort that might impact other users we should definitely come back and respond so the community is aware.
But at first blush it's hard to say what's happening here. Hopefully engaging with the support team can reveal an answer.
We have faced this kind of issue on quite a few of our k8s deployments, mostly around Open Shift, however it's not necessarily specific to Dynatrace.
What it is most likely related to, is the way that Dynatrace is deployed in the 'classic full stack' deployment method. This will automatically inject into every namespace and should any incompatibility issues occur , then .... its going to fail to start.
There's no right answer here, Dynatrace is incompatible with their code , but they probably have an issue in their code that is being triggered by Dynatrace.
What we have also found in the past is that the operator SELinux policy is 400.
most application level SELinux policy are 100,
There were questions raised around why the Dynatrace module would come in at a priority of 400, as it takes over a lot of the default policies (kube level).
[ ~]# semodule --list-modules=full| head -10
400 dynatrace_oneagent_Policy pp
400 openvswitch-custom pp
200 container pp
100 abrt pp
100 accountsd pp
100 acct pp
100 afs pp
100 aiccu pp
100 aide pp
100 ajaxterm pp
[~]# semodule --list-modules=full| grep dynatrace
400 dynatrace_oneagent_Policy pp
This issue can potentially cause Application and Kubernetes level components to start failing, during operations or upgrades etc.
To get around this, we have had to add the following container exclusion rules
- Do not monitor containers if Kubernetes namespace begins with twistlock (incompatible application)
- Do not monitor containers if Kubernetes namespace begins with smp
- Do not monitor containers if Kubernetes namespace begins with dnaenablement
- Do not monitor containers if Kubernetes namespace begins with kube
- Do not monitor containers if Kubernetes namespace begins with openshift
Overall, the best solution to avoid this in future is to move to the cloud native (CSI Driver) approach where you specify which namespaces you monitor rather than an 'everything by default'. Otherwise you really need to test first in non-prod, then move to prod.
hope this helps.