We have encountered a puzzling issue when using app-only + CSI driver on OpenShift with SaaS...
On our OpenShift cluster the targeted namespaces are not allowed internet access by default and therefore the injected OA module is absolutely required to be able to establish a successful initial connection to the routing AG deployed in the operator namespace as there is nothing else it could fall back to in order to connect to the SaaS environment.
The DynaKube's routing AG is running fine and successfully connected with the environment through an egress HTTP proxy running exclusively in the operator namespace.
We have annotated the DynaKube with
Solved! Go to Solution.
Does by any chance your k8s activegate has routing disabed? But based on your description the issue is with propagating the server address to the downloaded paas agent. What does the Deployment API say about your connection info?
Something similar happened even with Classic Full Stack Deployment for us during disaster recovery. OA in K8S use classic ActiveGates. At the time of the DR procedure, the ActiveGates were not yet known to the cluster (where shut down), but K8S had been started. So the OA in K8S got destinations for the cluster only where they did not have a network connection to. So ActiveGates were unknown to OA. Deleting OA pods and restarting apps helped - after the AG came back online.
I'm not familiar with the details of how exactly the init-container handles the PaaS agent deployment. If it's cached somehow, maybe a race condition happened? Downloading the agents before the k8s activegate was actually known to the cluster.
Thanks for chiming in @Julius_Loman - appreciated!
However, I think your described case is fundamentally different as in our case the routing AG in OpenShift is already successfully connected to the environment long before the app pods are auto-injected. Besides, the AG's communication endpoint seems to be known to the operator judging from the information it writes back into the status field of the DynaKube. Also, I can confirm the connected AG on OpenShift has the routing module enabled (verified via deployment status in the UI).
Also, in another case (with Managed and additional classic AG's outside of OpenShift) we observed that the initial connection info did indeed contain the endpoints for the classic AG's but was also missing all routing AG's specified in the DynaKube and running in the operator namespace... Hence in that case the OA module always connected to one of our classic AG's for its initial connection (as they were reachable) but then immediately switched to the AG that is part of the network zone provided in the DynaKube and running in the operator namespace after it received an updated/correct communication endpoint list from the environment...
It looks like the issue is specific to AG's governed by the operator as their communication endpoints are not included in the standalone/initial config of the injected OA module for some unknown reason... And I doubt this is by design as that would completely defy the purpose of having routing AG's deployed via the DynaKube CR and associated with a network zone...
The API /deployment/installer/agent/connectioninfo/endpoints always provides the correct connection info including all connected routing AG's running in the operator namespace in correct order as per the (optional) network zone parameter.
I doubt it's a race condition as we deleted and re-deployed everything several times and also did several restarts of all involved pods in varying order (operator, CSI, webhook, AG, apps).
One other thing we noticed after updating the operator from 0.8.0 to 0.10.2 is that in the operator namespace there are new ConfigMap objects created with names
which seem to hold the communication endpoint info for each DynaKube instance... But there again, we could verify the info is correct as it contains the routing AG's from the DynaKube. However, when deleting the DynaKube the associated ConfigMaps are never cleaned up so we had to do that step manually in order to ensure a clean uninstall/reinstall of the operator (UPDATE: Looks like this is being addressed with PR#1480 for the operator master branch).
After a long discussion with support it turns out that app-only + CSI requires either egress connectivity to the environment/tenant from targeted namespaces or a separate, dockerized AG that isn't governed by the operator running on the same cluster in order for any auto-injected pods to establish an initial connection to the environment.
Why this isn't mentioned anywhere in the docs and why we seem to be the first ones stumbling over it is a bit beyond me....
Anyway, RFE raised: