Solved: OA unable to connect to SaaS with app-only and CSI driver on OpenShift

Enrico_F · ‎25 Jan 2023

We have encountered a puzzling issue when using app-only + CSI driver on OpenShift with SaaS...

Context:

On our OpenShift cluster the targeted namespaces are not allowed internet access by default and therefore the injected OA module is absolutely required to be able to establish a successful initial connection to the routing AG deployed in the operator namespace as there is nothing else it could fall back to in order to connect to the SaaS environment.

The DynaKube's routing AG is running fine and successfully connected with the environment through an egress HTTP proxy running exclusively in the operator namespace.

We have annotated the DynaKube with

    feature.dynatrace.com/oneagent-ignore-proxy: "true"

to make sure that the OA module doesn't attempt to reach the environment via the configured proxy in the DynaKube as there is no egress proxy available in the targeted app namespaces.

The issue we're facing is that the injected OA module does not know about the DynaKube's routing OA and is therefore unable to reach the environment for its initial connect (the agent log contains an info message "Initial connect: connection to initial gateways failed" and several recurring warnings containing text "heartbeat failed").

We were able to confirm that the "serverAddress" property in the "standalone" OA config in file

 /opt/dynatrace/oneagent-paas/agent/ruxitagentproc.conf

does not contain the routing AG's address, only the public remote SaaS endpoints exposed towards the internet. This is unexpected as that file is injected into the container's file system during execution of the "install-oneagent" init container and is expected to contain all OA communication endpoints known to the environment, including all connected routing AG's.

We could also confirm that the routing AG's address is perfectly reachable directly from within the monitored container via curl (hence, not a networking issue).

Last but not least, we could also confirm that the DynaKube object has a field "status.connectionInfo.connectionHosts" containing the correct address of the routing AG along with all the public endpoints on SaaS...

It's the same behavior with different operator versions (we tested 0.8.0 and 0.10.2) and also with different SaaS environments (e.g. confirmed on a brand new demo environment with no previous config) and there's really nothing immediately suspicious in the operator or CSI pod logs.

We are currently running OpenShift 4.10.

Dynatrace support has done an initial analysis of the situation and agreed the behavior seems unexpected but is currently reaching out to R&D for further feedback.

TL;DR I wanted to reach out to the community to see if others might have encountered similar problems when deploying app-only + CSI on K8s/OpenShift clusters that don't allow egress connectivity to SaaS in the targeted namespaces. If so, I would appreciate if you can come forward as I think it could be helpful for tracking down the root cause - assuming it's not some serious bug in the operator itself (for now I will try to give it the benefit of a doubt 🙂 ).

Julius_Loman · ‎25 Jan 2023

Does by any chance your k8s activegate has routing disabed? But based on your description the issue is with propagating the server address to the downloaded paas agent. What does the Deployment API say about your connection info?

Something similar happened even with Classic Full Stack Deployment for us during disaster recovery. OA in K8S use classic ActiveGates. At the time of the DR procedure, the ActiveGates were not yet known to the cluster (where shut down), but K8S had been started. So the OA in K8S got destinations for the cluster only where they did not have a network connection to. So ActiveGates were unknown to OA. Deleting OA pods and restarting apps helped - after the AG came back online.
I'm not familiar with the details of how exactly the init-container handles the PaaS agent deployment. If it's cached somehow, maybe a race condition happened? Downloading the agents before the k8s activegate was actually known to the cluster.

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

Enrico_F · ‎26 Jan 2023

Thanks for chiming in @Julius_Loman - appreciated!

However, I think your described case is fundamentally different as in our case the routing AG in OpenShift is already successfully connected to the environment long before the app pods are auto-injected. Besides, the AG's communication endpoint seems to be known to the operator judging from the information it writes back into the status field of the DynaKube. Also, I can confirm the connected AG on OpenShift has the routing module enabled (verified via deployment status in the UI).

Also, in another case (with Managed and additional classic AG's outside of OpenShift) we observed that the initial connection info did indeed contain the endpoints for the classic AG's but was also missing all routing AG's specified in the DynaKube and running in the operator namespace... Hence in that case the OA module always connected to one of our classic AG's for its initial connection (as they were reachable) but then immediately switched to the AG that is part of the network zone provided in the DynaKube and running in the operator namespace after it received an updated/correct communication endpoint list from the environment...

It looks like the issue is specific to AG's governed by the operator as their communication endpoints are not included in the standalone/initial config of the injected OA module for some unknown reason... And I doubt this is by design as that would completely defy the purpose of having routing AG's deployed via the DynaKube CR and associated with a network zone...

The API /deployment/installer/agent/connectioninfo/endpoints always provides the correct connection info including all connected routing AG's running in the operator namespace in correct order as per the (optional) network zone parameter.

I doubt it's a race condition as we deleted and re-deployed everything several times and also did several restarts of all involved pods in varying order (operator, CSI, webhook, AG, apps).

One other thing we noticed after updating the operator from 0.8.0 to 0.10.2 is that in the operator namespace there are new ConfigMap objects created with names

<dynakube-name>-activegate-connection-info
<dynakube-name>-oneagent-connection-info

which seem to hold the communication endpoint info for each DynaKube instance... But there again, we could verify the info is correct as it contains the routing AG's from the DynaKube. However, when deleting the DynaKube the associated ConfigMaps are never cleaned up so we had to do that step manually in order to ensure a clean uninstall/reinstall of the operator (UPDATE: Looks like this is being addressed with PR#1480 for the operator master branch).

Enrico_F · ‎08 Feb 2023

After a long discussion with support it turns out that app-only + CSI requires either egress connectivity to the environment/tenant from targeted namespaces or a separate, dockerized AG that isn't governed by the operator running on the same cluster in order for any auto-injected pods to establish an initial connection to the environment.

Why this isn't mentioned anywhere in the docs and why we seem to be the first ones stumbling over it is a bit beyond me....

Anyway, RFE raised:

RFE: Improve OA initial-connect strategy for auto-injected pods in air-gapped target namespaces - Dy...