14 Mar 2022 09:53 AM - edited 15 Mar 2022 04:06 PM
Reaching out to the community to maybe get feedback on an observation with OA 1.233:
Since OA was upgraded to 1.233 at one of my customers we are seeing erratic memory consumption on many hosts that run Debian 10, virtualized on VMWare.
What we observe is that OA locks an ever increasing amount of linux unreclaimable SLAB memory . In dynatrace itself this is reported on a host as memory usage of "other processes". We can clearly track this down to being related to OA 1.233 as we still h ave hosts with OA 1.227 running where there is no such issue.
It manifests in a slow memory leak as it takes about 1-2 weeks until the host gets into memory issues:
The slab memory is freed immediately as soon as we terminate the "oneagentos" process either through an oom reaper or manually and then grows back again.
Upon analyzing the issue we also found abnormalities in the systems dbus-daemon logs that look unhealthy (UID 998 is dtuser):
dbus-daemon[864]: [system] The maximum number of active connections for UID 998 has been reached (max_connections_per_user=256)
dbus-daemon[864]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30000ms)
dbus-daemon[864]: [system] The maximum number of active connections for UID 998 has been reached (max_connections_per_user=256)
When executing busctl we also see that oneagent seems to hold lots (maximum 256) of dbus connections, which doesn't look good and could be a pointer to the source of it:
# busctl
NAME PID PROCESS USER CONNECTION UNIT SESSION DESCRIPTION
:1.0 1 systemd root :1.0 init.scope - -
:1.1 849 systemd-logind root :1.1 systemd-logind.service - -
:1.248267 4384 oneagentos dtuser :1.248267 oneagent.service - -
:1.248383 4384 oneagentos dtuser :1.248383 oneagent.service - -
:1.248497 4384 oneagentos dtuser :1.248497 oneagent.service - -
:1.248613 4384 oneagentos dtuser :1.248613 oneagent.service - -
:1.248726 4384 oneagentos dtuser :1.248726 oneagent.service - -
:1.248842 4384 oneagentos dtuser :1.248842 oneagent.service - -
:1.248955 4384 oneagentos dtuser :1.248955 oneagent.service - -
:1.249069 4384 oneagentos dtuser :1.249069 oneagent.service - -
:1.249184 4384 oneagentos dtuser :1.249184 oneagent.service - -
:1.249300 4384 oneagentos dtuser :1.249300 oneagent.service - -
:1.249413 4384 oneagentos dtuser :1.249413 oneagent.service - -
:1.249527 4384 oneagentos dtuser :1.249527 oneagent.service - -
:1.249645 4384 oneagentos dtuser :1.249645 oneagent.service - -
:1.249758 4384 oneagentos dtuser :1.249758 oneagent.service - -
:1.249872 4384 oneagentos dtuser :1.249872 oneagent.service - -
:1.249987 4384 oneagentos dtuser :1.249987 oneagent.service - -
Has anyone observed something like this with OA 1.233? I'm trying to track this down with DT support for 3 weeks now but we haven't found a solution yet, so I'm trying swarm-intelligence 🙂
Thanks!
Solved! Go to Solution.
14 Mar 2022 11:59 AM
We don't have Debian hosts among our customers, but I have a Debian 10 host with OneAgent running in our lab. I did not observe this behaviour here, but this one is not virtualized, it's a physical host. Currently, it's already on 1.235 but I looked in the historical data and I don't see any pattern matching your scenario.
14 Mar 2022 12:05 PM
Thanks! We are also not seeing this behavior on all Debian 10 machines, but still on quite few of them, and there is nothing very obvious what they have in common. Seeing this in fullstack mode and in infra-only mode. What I know is that there are two identical hosts (2 MySQL DB nodes in infra mode). one with 1.233 showing this issue, the other one with 1.227 not showing this issue.
If we switch the agent versions on the machines the issue becomes visible on the previously fine host and disappears on the other.
So it definitely has to do with the agent.
14 Mar 2022 12:39 PM
I also run MySQL DB and I forgot about another one running vmware, but it's running Debian sid, but it's not updated much. So I'd say it's somewhere in between of Debian 10 and Debian 11. It also does not suffer from the situation you describe.
I'd check if you are not running some hardening tool. Debian 10 and newer have AppArmor by default and I already had some issues with Dynatrace OneAgent and AppArmor.
14 Mar 2022 05:19 PM - edited 14 Mar 2022 05:39 PM
We see probably the same problems on some of our sles linux systems. The dbus-daemon uses a lot of memory (up to about 7.7 GB slowly increasing) and cpu usage of about up to 2%.
I only see the daemon using much memory since about three weeks. Maybe it came with a recent dynatrace update?
With the command 'journalctl -u dbus.service | less' we see on some hosts output like:
Mar 14 18:08:20 lnx52610 dbus[1337]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30001ms)
Mar 14 18:09:00 lnx52610 dbus[1337]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30007ms)
or
Mar 14 15:59:34 lnx52607 dbus[1234]: [system] Rejected: destination has a full message queue, 0 matched rules; type="signal", sender=":1.2467692" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --system --deserialize 17") interface="org.freedesktop.systemd1.Manager" member="JobRemoved" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=103 pid=12928 comm="oneagentos -Dcom.compuware.apm.WatchDogPort=50000 ")
Mar 14 15:59:34 lnx52607 dbus[1234]: [system] Rejected: destination has a full message queue, 0 matched rules; type="signal", sender=":1.2467692" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --system --deserialize 17") interface="org.freedesktop.systemd1.Manager" member="JobRemoved" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=103 pid=12928 comm="oneagentos -Dcom.compuware.apm.WatchDogPort=50000 ")
comm="oneagentos -Dcom.compuware.apm.WatchDogPort=50000 shows, that this comes from dynatrace.
14 Mar 2022 05:41 PM
Thanks @matthias_dillie for sharing! This smells indeed similar.
Do you maybe have the output of “busctl” comman (if available)? would be interesting if that also shows 256 (default max) connections by oneagent service.
Or a “cat /proc/meminfo”?
15 Mar 2022 04:53 PM
Hi,
we know about the issue, we have hotfix ready. If possible please upgrade OneAgent to version 1.233.190.20220314-111259 or 1.235.200.20220314-114145.
Best Regards
Mateusz
15 Mar 2022 03:02 PM
Hmm, maybe we have a me too here.
Noticed a Dynatrace cluster node going down with out of memory and not coming back up until restart of the whole box. Couldn't free the memory. Looks like the self-monitoring agent. Version 1.232.
Linux xxx 4.12.14-122.110-default #1 SMP Tue Feb 1 10:09:15 UTC 2022 (0369cb6) x86_64 x86_64 x86_64 GNU/Linux
520 cat /proc/meminfo
MemTotal: 36859688 kB
MemFree: 1903028 kB
MemAvailable: 7438300 kB
Buffers: 509836 kB
Cached: 5726568 kB
SwapCached: 2380 kB
Active: 20100716 kB
Inactive: 5791508 kB
Active(anon): 16349704 kB
Inactive(anon): 3722240 kB
Active(file): 3751012 kB
Inactive(file): 2069268 kB
Unevictable: 4558152 kB
Mlocked: 4558152 kB
SwapTotal: 522236 kB
SwapFree: 501492 kB
Dirty: 2160 kB
Writeback: 0 kB
AnonPages: 24129376 kB
Mapped: 380568 kB
Shmem: 364936 kB
Slab: 3963752 kB
SReclaimable: 299132 kB
SUnreclaim: 3664620 kB
KernelStack: 42864 kB
PageTables: 66308 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18952080 kB
Committed_AS: 31964524 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 22566912 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 4071232 kB
DirectMap2M: 25288704 kB
DirectMap1G: 10485760 kB
busctl shows 103 lines with oneagentos...
15 Mar 2022 04:54 PM
Hi,
we know about the issue, we have hotfix ready. If possible please upgrade OneAgent to version 1.233.190.20220314-111259 or 1.235.200.20220314-114145.
Is this also Debian or Debian-based distribution, like in other cases or different distro?
Best Regards
Mateusz
15 Mar 2022 05:49 PM
It's a SLES 12.5.
15 Mar 2022 03:29 PM
UPDATE
As we are seeing this in multiple independent occasions and deployments I changed the title of this post to a WARNING. I haven't found a official statement in release notes or from Dynatrace Support, yet.
But our observation and the feedback here, seems to confirm that there is a problem. Please be careful when upgrading your agents or perform these checks if you have already upgraded and see high memory usage. Note that these are not remediation steps, it's just how I have verified the problem:
Check unreclaimable SLAB Memory:
# cat /proc/meminfo | grep SUnreclaim
If this is exceptionally high, check dbus connections:
Check dbus sessions:
# busctl
if you see lots of sessions from oneagentos, then you might have the same issue.
Hope this prevents you from stumbling into issues.
15 Mar 2022 03:51 PM
I would be very interested to know if anyone has seen this issue occurring on RHEL 7 or 8.
15 Mar 2022 04:50 PM
Hi,
issue is caused by gathering data via sdbus in OneAgent. Hotfix is already available (version 1.233.190.20220314-111259 or 1.235.200.20220314-114145). We have released it as hotfix to few customers and we will release it globally as soon as possible.
If you provide support ticket id, I will check if version with hotfix is already available for that environment.
Best Regards
Mateusz
16 Mar 2022 08:53 AM - last edited on 25 Mar 2022 08:55 AM by MaciejNeumann
Thanks Mateusz!
Dynatrace Support Team
it would be REALLY helpful to get such information more proactively. We have reported this issue 3 weeks ago and got almost no information or details on our support ticket, although this is definitely a SEVERE case.
Even if it can't be solved right away, some communication like "we have pointers that this was introduced wit 1.xxx and is caused by yyyy. - a hotfix is in progress..." would be helpful.
Instead we got silence for 6 days, and I had to trigger my personal contacts into upper Dynatrace Management to get an (then almost immediate) answer, update and hotfix. That is not very satisfying.
kr,
Reinhard
16 Mar 2022 10:41 AM - last edited on 25 Mar 2022 08:56 AM by MaciejNeumann
Thanks Reinhard!
Dynatrace Support Team
do we have this problem also with version 1.231.288 ?
If so a hotfix will be provided for this as well?
kr
Christian
17 Mar 2022 11:43 AM
Hi Chris,
no, agent 1.231 isn't affected. Only 1.233 and 235.
Best Regard
Mateusz
16 Mar 2022 10:11 AM
Agreed. We will contact our CSM - this is not acceptable and needs to be addressed. Information needs to be pushed to the customers, we can not read all the posts, blogs etc. to get noticed about this.
Greetings
Markus