03 Jan 2023 07:46 AM - last edited on 03 Jan 2023 07:50 AM by MaciejNeumann
FYI, SSO for SaaS Customers are getting a 504 Error - Gateway Time-out
Solved! Go to Solution.
Update - While SSO is broken, Alert notifications are still being generated. - User UI Access is the only aspect affected.
The status portal is already showing the status of this issue:
We are aware of an issue causing an outage for logging in to our SaaS clusters. We're currently working to resolve this issue and will update here as soon as we have more information. This outage does not affect data processing, and there is no expected data loss.
The SSO team is aware of this issue and already working on a solution. As soon as we have more information, I'll post it here.
Here is an official information from the Dynatrace:
The newest information from the Dynatrace Saas Status:
[Identified] Latest update: We are using all resources available to come to a resolution on this accessibility issue, however, the rebuild process is still working on completing. Data is still processeing into respective tenants, and problems/notifications will still be triggered. If you have an API token already set you can reference this page below to access your problems list to not miss any important issues.
Access has been restored
Access is restored but dynatrace.status.io is still all red 😞
Not a good start to the year 😞
Dynatrace Saas Status is green again. Here is the latest update:
[Monitoring] Services have been restored, and you should be able to log in to see your data again. We will continue to monitor this situation to ensure stability as we return to normal usage levels. We appreciate your patience while we worked to resolve this issue and apologize for the inconvenience it caused.
Login was unavailable during these times: 15:26 - 19:00 UTC on 1/3
Our web and mobile applications that have the OneAgent monitors experienced an authentication outage during the entire Dynatrace outage and only became available once Dynatrace fixed their issue. We didn't expect an agent to impact our systems like this. Did anyone else experience issues with systems monitored by Dynatrace during the outage??
We are a Dynatrace partner and have access to multiple client tenants. We did not see any problem from the Dynatrace monitoring, at least until now.
Just to share in reply: we did not have any ActiveGate or OneAgent outages during this timeframe. We even went so far as to check the ActiveGate and OneAgent logs themselves - just to see if anything was giving an exception, or retries or errors. We didn't see any issues there at all. From the Kbps throughput on the ActiveGate egress - we knew there was data still flowing.
I think the outage was more than only in the Dynatrace perimeter. I hope we hear the real root cause.
During the exact same periode the booking system of an european airline, was unavailable.
Thanks for the response. Our outage window matched the Dynatrace window so we are going to be pushing for detail on root cause but also confirmation of real impact during their outage
From the explanation available at dynatrace.status.io, and emails received, it seems the problem originated in a update to the SSO service. I can't even imagine how that would relate to a problem in Ryanair, as Ryanair seems to be a New Relic client (checked out their RUM data).
BTW, the explanation is consistent with what we observed during the whole episode. Tenants were responding correctly, with multiple objects & XHR being served. We were also able to interact with data through our programs using APIs, both exporting & ingesting data. Also, everything involving alarms kept on flowing, and the suggestion that was put in dynatrace.status.io about using the problem API was very interesting...
must have been a bad coincidence (for me), or use of the same change/maintenance window 🙂
For a moment I had a flash back to the disruption caused by the Cloudflare outage in June.
There are really some strange coincidences out there...
But correlation is not causation 😂