03 Jan 2023 03:46 PM - last edited on 03 Jan 2023 03:50 PM by MaciejNeumann
FYI, SSO for SaaS Customers are getting a 504 Error - Gateway Time-out
Solved! Go to Solution.
03 Jan 2023 03:58 PM
Update - While SSO is broken, Alert notifications are still being generated. - User UI Access is the only aspect affected.
03 Jan 2023 04:02 PM - edited 03 Jan 2023 04:03 PM
This will be updated in the status portal ?
updated:
nevermind..
03 Jan 2023 04:05 PM
For anyone who doesn't have the Status IO page: https://dynatrace.status.io/
03 Jan 2023 04:06 PM - edited 03 Jan 2023 04:26 PM
Hello everybody,
The status portal is already showing the status of this issue:
We are aware of an issue causing an outage for logging in to our SaaS clusters. We're currently working to resolve this issue and will update here as soon as we have more information. This outage does not affect data processing, and there is no expected data loss.
The SSO team is aware of this issue and already working on a solution. As soon as we have more information, I'll post it here.
03 Jan 2023 05:10 PM
Here is an official information from the Dynatrace:
03 Jan 2023 06:39 PM - edited 03 Jan 2023 06:41 PM
The newest information from the Dynatrace Saas Status:
[Identified] Latest update: We are using all resources available to come to a resolution on this accessibility issue, however, the rebuild process is still working on completing. Data is still processeing into respective tenants, and problems/notifications will still be triggered. If you have an API token already set you can reference this page below to access your problems list to not miss any important issues.
dynatrace.com/support/help/dynatrace-api/environment-api/problems-v2/problems/get-problems-list
03 Jan 2023 06:42 PM
Access has been restored
03 Jan 2023 06:49 PM
Access is restored but dynatrace.status.io is still all red 😞
03 Jan 2023 06:55 PM
Not a good start to the year 😞
03 Jan 2023 07:35 PM
Dynatrace Saas Status is green again. Here is the latest update:
[Monitoring] Services have been restored, and you should be able to log in to see your data again. We will continue to monitor this situation to ensure stability as we return to normal usage levels. We appreciate your patience while we worked to resolve this issue and apologize for the inconvenience it caused.
Login was unavailable during these times: 15:26 - 19:00 UTC on 1/3
03 Jan 2023 08:03 PM
Our web and mobile applications that have the OneAgent monitors experienced an authentication outage during the entire Dynatrace outage and only became available once Dynatrace fixed their issue. We didn't expect an agent to impact our systems like this. Did anyone else experience issues with systems monitored by Dynatrace during the outage??
03 Jan 2023 08:55 PM
We are a Dynatrace partner and have access to multiple client tenants. We did not see any problem from the Dynatrace monitoring, at least until now.
03 Jan 2023 08:57 PM
Just to share in reply: we did not have any ActiveGate or OneAgent outages during this timeframe. We even went so far as to check the ActiveGate and OneAgent logs themselves - just to see if anything was giving an exception, or retries or errors. We didn't see any issues there at all. From the Kbps throughput on the ActiveGate egress - we knew there was data still flowing.
03 Jan 2023 08:38 PM - edited 04 Jan 2023 11:08 AM
I think the outage was more than only in the Dynatrace perimeter. I hope we hear the real root cause.
During the exact same periode the booking system of an european airline, was unavailable.
03 Jan 2023 08:43 PM
Thanks for the response. Our outage window matched the Dynatrace window so we are going to be pushing for detail on root cause but also confirmation of real impact during their outage
03 Jan 2023 09:16 PM
From the explanation available at dynatrace.status.io, and emails received, it seems the problem originated in a update to the SSO service. I can't even imagine how that would relate to a problem in Ryanair, as Ryanair seems to be a New Relic client (checked out their RUM data).
03 Jan 2023 09:20 PM
BTW, the explanation is consistent with what we observed during the whole episode. Tenants were responding correctly, with multiple objects & XHR being served. We were also able to interact with data through our programs using APIs, both exporting & ingesting data. Also, everything involving alarms kept on flowing, and the suggestion that was put in dynatrace.status.io about using the problem API was very interesting...
04 Jan 2023 11:14 AM
must have been a bad coincidence (for me), or use of the same change/maintenance window 🙂
For a moment I had a flash back to the disruption caused by the Cloudflare outage in June.
04 Jan 2023 02:17 PM
There are really some strange coincidences out there...
But correlation is not causation 😂