cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

[WARNING] OneAgent 1.232/1.233 can cause a slow memory leak on Linux - be careful when upgrading!

r_weber
DynaMight Champion
DynaMight Champion

Reaching out to the community to maybe get feedback on an observation with OA 1.233:

 

Since OA was upgraded to 1.233 at one of my customers we are seeing erratic memory consumption on many hosts that run Debian 10, virtualized on VMWare.

What we observe is that OA locks an ever increasing amount of linux unreclaimable SLAB memory . In dynatrace itself this is reported on a host as memory usage of "other processes". We can clearly track this down to being related to OA 1.233 as we still h ave hosts with OA 1.227 running where there is no such issue.

 

It manifests in a slow memory leak as it takes about 1-2 weeks until the host gets into memory issues:

r_weber_0-1647249905514.png

The slab memory is freed immediately as soon as we terminate the "oneagentos" process either through an oom reaper or manually and then grows back again.

 

Upon analyzing the issue we also found abnormalities in the systems dbus-daemon logs that look unhealthy (UID 998 is dtuser):

 

 

 

dbus-daemon[864]: [system] The maximum number of active connections for UID 998 has been reached (max_connections_per_user=256)
dbus-daemon[864]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30000ms)
dbus-daemon[864]: [system] The maximum number of active connections for UID 998 has been reached (max_connections_per_user=256)

 

 

 

 

When executing busctl we also see that oneagent seems to hold lots (maximum 256) of dbus connections, which doesn't look good and could be a pointer to the source of it:

 

 

 

 

# busctl
NAME                             PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION        
:1.0                               1 systemd         root             :1.0          init.scope                -          -                  
:1.1                             849 systemd-logind  root             :1.1          systemd-logind.service    -          -                  
:1.248267                       4384 oneagentos      dtuser           :1.248267     oneagent.service          -          -                  
:1.248383                       4384 oneagentos      dtuser           :1.248383     oneagent.service          -          -                  
:1.248497                       4384 oneagentos      dtuser           :1.248497     oneagent.service          -          -                  
:1.248613                       4384 oneagentos      dtuser           :1.248613     oneagent.service          -          -                  
:1.248726                       4384 oneagentos      dtuser           :1.248726     oneagent.service          -          -                  
:1.248842                       4384 oneagentos      dtuser           :1.248842     oneagent.service          -          -                  
:1.248955                       4384 oneagentos      dtuser           :1.248955     oneagent.service          -          -                  
:1.249069                       4384 oneagentos      dtuser           :1.249069     oneagent.service          -          -                  
:1.249184                       4384 oneagentos      dtuser           :1.249184     oneagent.service          -          -                  
:1.249300                       4384 oneagentos      dtuser           :1.249300     oneagent.service          -          -                  
:1.249413                       4384 oneagentos      dtuser           :1.249413     oneagent.service          -          -                  
:1.249527                       4384 oneagentos      dtuser           :1.249527     oneagent.service          -          -                  
:1.249645                       4384 oneagentos      dtuser           :1.249645     oneagent.service          -          -                  
:1.249758                       4384 oneagentos      dtuser           :1.249758     oneagent.service          -          -                  
:1.249872                       4384 oneagentos      dtuser           :1.249872     oneagent.service          -          -                  
:1.249987                       4384 oneagentos      dtuser           :1.249987     oneagent.service          -          -                  

 

 

 

 

Has anyone observed something like this with OA 1.233? I'm trying to track this down with DT support for 3 weeks now but we haven't found a solution yet, so I'm trying swarm-intelligence 🙂

 

Thanks!

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net
16 REPLIES 16

Julius_Loman
DynaMight Guru
DynaMight Guru

We don't have Debian hosts among our customers, but I have a Debian 10 host with OneAgent running in our lab. I did not observe this behaviour here, but this one is not virtualized, it's a  physical host. Currently, it's already on 1.235 but I looked in the historical data and I don't see any pattern matching your scenario.

Certified Dynatrace Master | TEMPEST a.s., Slovakia, Dynatrace Master Partner

Thanks! We are also not seeing this behavior on all Debian 10 machines, but still on quite few of them, and there is nothing very obvious what they have in common. Seeing this in fullstack mode and in infra-only mode. What I know is that there are two identical hosts (2 MySQL DB nodes in infra mode). one with 1.233 showing this issue, the other one with 1.227 not showing this issue.
If we switch the agent versions on the machines the issue becomes visible on the previously fine host and disappears on the other.
So it definitely has to do with the agent. 

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net

I also run MySQL DB and I forgot about another one running vmware, but it's running Debian sid, but it's not updated much. So I'd say it's somewhere in between of Debian 10 and Debian 11. It also does not suffer from the situation you describe.

I'd check if you are not running some hardening tool. Debian 10 and newer have AppArmor by default and I already had some issues with Dynatrace OneAgent and AppArmor. 

Certified Dynatrace Master | TEMPEST a.s., Slovakia, Dynatrace Master Partner

matthias_dillie
Advisor

We see probably the same problems on some of our sles linux systems. The dbus-daemon uses a lot of memory (up to about 7.7 GB slowly increasing) and cpu usage of about up to 2%.
I only see the daemon using much memory since about three weeks. Maybe it came with a recent dynatrace update?

With the command 'journalctl -u dbus.service | less' we see on some hosts output like:

Mar 14 18:08:20 lnx52610 dbus[1337]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30001ms)
Mar 14 18:09:00 lnx52610 dbus[1337]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30007ms)

or

Mar 14 15:59:34 lnx52607 dbus[1234]: [system] Rejected: destination has a full message queue, 0 matched rules; type="signal", sender=":1.2467692" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --system --deserialize 17") interface="org.freedesktop.systemd1.Manager" member="JobRemoved" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=103 pid=12928 comm="oneagentos -Dcom.compuware.apm.WatchDogPort=50000 ")
Mar 14 15:59:34 lnx52607 dbus[1234]: [system] Rejected: destination has a full message queue, 0 matched rules; type="signal", sender=":1.2467692" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --system --deserialize 17") interface="org.freedesktop.systemd1.Manager" member="JobRemoved" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=103 pid=12928 comm="oneagentos -Dcom.compuware.apm.WatchDogPort=50000 ")

 comm="oneagentos -Dcom.compuware.apm.WatchDogPort=50000 shows, that this comes from dynatrace.

Thanks @matthias_dillie for sharing! This smells indeed similar.

Do you maybe have the output of “busctl” comman (if available)? would be interesting if that also shows 256 (default max) connections by oneagent service.

Or a “cat /proc/meminfo”?

 

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net

Hi,

 

we know about the issue, we have hotfix ready. If possible please upgrade OneAgent to version 1.233.190.20220314-111259 or 1.235.200.20220314-114145.

 

Best Regards

Mateusz

TorstenHellwig
Organizer

Hmm, maybe we have a me too here.

Noticed a Dynatrace cluster node going down with out of memory and not coming back up until restart of the whole box. Couldn't free the memory. Looks like the self-monitoring agent. Version 1.232.

Linux xxx 4.12.14-122.110-default #1 SMP Tue Feb 1 10:09:15 UTC 2022 (0369cb6) x86_64 x86_64 x86_64 GNU/Linux

 

520 cat /proc/meminfo
MemTotal: 36859688 kB
MemFree: 1903028 kB
MemAvailable: 7438300 kB
Buffers: 509836 kB
Cached: 5726568 kB
SwapCached: 2380 kB
Active: 20100716 kB
Inactive: 5791508 kB
Active(anon): 16349704 kB
Inactive(anon): 3722240 kB
Active(file): 3751012 kB
Inactive(file): 2069268 kB
Unevictable: 4558152 kB
Mlocked: 4558152 kB
SwapTotal: 522236 kB
SwapFree: 501492 kB
Dirty: 2160 kB
Writeback: 0 kB
AnonPages: 24129376 kB
Mapped: 380568 kB
Shmem: 364936 kB
Slab: 3963752 kB
SReclaimable: 299132 kB
SUnreclaim: 3664620 kB
KernelStack: 42864 kB
PageTables: 66308 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18952080 kB
Committed_AS: 31964524 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 22566912 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 4071232 kB
DirectMap2M: 25288704 kB
DirectMap1G: 10485760 kB

 

busctl shows 103 lines with oneagentos...

 

Hi,

 

we know about the issue, we have hotfix ready. If possible please upgrade OneAgent to version 1.233.190.20220314-111259 or 1.235.200.20220314-114145.

 

Is this also Debian or Debian-based distribution, like in other cases or different distro?

 

Best Regards

Mateusz

It's a SLES 12.5. 

r_weber
DynaMight Champion
DynaMight Champion

UPDATE

 

As we are seeing this in multiple independent occasions and deployments I changed the title of this post to a WARNING. I haven't found a official statement in release notes or from Dynatrace Support, yet.
But our observation and the feedback here, seems to confirm that there is a problem. Please be careful when upgrading your agents or perform these checks if you have already upgraded and see high memory usage. Note that these are not remediation steps, it's just how I have verified the problem:

 

Check unreclaimable SLAB Memory:

# cat /proc/meminfo | grep SUnreclaim

If this is exceptionally high, check dbus connections:

 

Check dbus sessions:

# busctl

 if you see lots of sessions from oneagentos, then you might have the same issue.

 

Hope this prevents you from stumbling into issues.

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net

Enrico_F
DynaMight Pro
DynaMight Pro

I would be very interested to know if anyone has seen this issue occurring on RHEL 7 or 8.

mateusz_marek
Dynatrace Enthusiast
Dynatrace Enthusiast

Hi,

 

issue is caused by gathering data via sdbus in OneAgent. Hotfix is already available (version 1.233.190.20220314-111259 or 1.235.200.20220314-114145). We have released it as hotfix to few customers and we will release it globally as soon as possible.

 

If you provide support ticket id, I will check if version with hotfix is already available for that environment.

 

Best Regards

Mateusz

Thanks Mateusz!

Dynatrace Support Team

it would be REALLY helpful to get such information more proactively. We have reported this issue 3 weeks ago and got almost no information or details on our support ticket, although this is definitely a SEVERE case.

Even if it can't be solved right away, some communication like "we have pointers that this was introduced wit 1.xxx and is caused by yyyy. - a hotfix is in progress..." would be helpful.

 

Instead we got silence for 6 days, and I had to trigger my personal contacts into upper Dynatrace Management to get an (then almost immediate) answer, update and hotfix. That is not very satisfying.

 

kr,

Reinhard

Certified Dynatrace Master, Dynatrace Partner - 360Performance.net

Thanks Reinhard!


Dynatrace Support Team

 

do we have this problem also with version 1.231.288 ?
If so a hotfix will be provided for this as well?


kr
Christian

Hi Chris,

 

no, agent 1.231 isn't affected. Only 1.233 and 235.

 

Best Regard

Mateusz

Markus
Visitor

Agreed. We will contact our CSM - this is not acceptable and needs to be addressed. Information needs to be pushed to the customers, we can not read all the posts, blogs etc. to get noticed about this.

 

Greetings

Markus