Re: Sudden jump in unmonitored hosts causing purepath chain breakages - what happened?

john_kennedy1 · ‎25 Jan 2021

We recently ran into a situation here the requests to unmonitored hosts jumped and has stayed there. This was noticed by some folks in AppDev when they were chasing some purepaths and found many broken chains. Some research indicated that "something" happened on 12/10/20 around 4:00AM. See chart below. Has anyone run into this type of situation? Of course "nothing" has changed. 🙂

ChadTurner · ‎25 Jan 2021

Was any host monitoring turned off from the Dynatrace perspective? What about the level of monitoring? Was the oneagent reduced to infrastructure only monitoring on some hosts?

-Chad

john_kennedy1 · ‎25 Jan 2021

Good question. We have reviewed both situations and this has not been identified as the issue. We look good from a monitoring perspective as well as what we have in Full Stack vs. Infra Mode.

john_kennedy1 · ‎26 Jan 2021

Hey Chad. Thanks for the response. Full stack monitoring is enabled on all servers and in fact we did patching this past weekend and full reboots were done.

kalle_lahtinen · ‎25 Jan 2021

Could it be that there's a component (LB/proxy/integration service/etc.) that's stripping the x-dynatrace headers, in effect terminating the PurePaths and making those requests look like they're towards unmonitored hosts? So for example connections towards domain names which resolve to load balancer VIPs and as such do not correspond to any specific server with OneAgent running..?

john_kennedy1 · ‎25 Jan 2021

Another solid question and that is where we have been concentrating. The issue seems to be surrounding calls in/out of our Web Service Managed infrastructure that is front-ended by a LB. We are full y running in AWS on a mix of EC2 and Native Svcs and the ELB in this case is a CLB. We have reviewed changes at the time of the spike (approx. 12/10 at 4:00AM ) and there were no scheduled changes by our organization at that time.

Anonymous · ‎26 Jan 2021

Might this be new traffic? if you go back at the back call before that time. The number of calls is the same? do the calls that before went to injected service drop and go here now?

john_kennedy1 · ‎26 Jan 2021

Hi Dante. This is an interesting thought. I 'll work with the AppDev team on this. We only keep 10 days of traces in Prod so going back further for comparison sake is problematic.

Babar_Qayyum · ‎26 Jan 2021

Hello John K.

Can you check the backtrace for this traffic to know who made these calls?

Regards,

Babar

john_kennedy1 · ‎26 Jan 2021

Hi Babar. Good thought The usual and expected suspects are the callers in all of our situations. We don't see any outliers.

john_kennedy1 · ‎26 Jan 2021

HI Sanders. Thanks for the response. Full stack monitoring is enabled on all servers and in fact we did patching this past weekend and full reboots were done.

ChadTurner · ‎26 Jan 2021

I would recommend opening a support ticket for this. That way support can dig deeper into the issue and hopefully find a resolution or explanation for this.

-Chad

john_kennedy1 · ‎26 Jan 2021

This has been done.

SUP-64416