17 Sep 2024 12:44 PM - edited 17 Sep 2024 12:44 PM
17 Sep 2024 12:53 PM - edited 17 Sep 2024 12:54 PM
Hello,
(First 1 replaying, so reserve me a voucher )
One of the big success story happened with 1 of our customers when much trace consultants tried to convince him and build something.
It was requested to build complex Dashboard and some automation.
So i was the last one and the result was awesome.
Presented to the CTO,
The Dashboard was complete, complex and huge 😂.
It was displayed in all screens and even monitoring area in the aeroport.
Second this was witht the tmation, by using Pythinm APIs...
This is one of a big list of success stories.
17 Sep 2024 01:13 PM - edited 17 Sep 2024 01:25 PM
What a perfect Challenge!
Creative thinking - Success story 1: https://www.dynatrace.com/news/blog/monitoring-windows-workstations-with-dynatrace-an-it-helpdesk-ca...
Basically the PC Techs were working on a crashing process for 2 weeks. So I tossed Dynatrace on the laptop, grabbed lunch, came back and reviewed the crash mini dump. Provided them the results that the processes was running out of allocated memory. They raise the limit and the issue went away! 2 weeks cut down to a 1 hour lunch 🙂
Success story 2 - Using Synthetics to keep tabs on the competition 🙂 https://www.dynatrace.com/news/blog/staying-ahead-of-your-competition-with-dynatrace-synthetics/
Summary - we leveraged synthetics to test out the functionality of our direct competition. These tests provided a clear result as to who provided a faster customer experience. If we found our site to be slower, we then had the waterfall analysis as to how the competition did it 🙂
Success story 3 - Main Frame issues. https://www.dynatrace.com/news/blog/davis-diaries-mainframe-error-to-resolution-in-minutes/
Summary: Dynatrace detected failures immediately and provided the data required to pin point the code that was causing the conflict and allowed us to communicate within the problem card as the issue was reviewed, and solved.
Success Story 4 - Dynatrace Davis is my Copilot!
As an employee of an insurance company, safe driving is paramount. Our commutes ranged from 30 mins to 1 hour for PC techs to the CEO. That got me thinking one day, With Davis and the voice interaction, why not build it into the car? This way as I spend 30 mins or more driving into work, I can be briefed on the events that transpired overnight. https://community.dynatrace.com/t5/Dynatrace-tips/Dynatrace-Davis-Power-in-your-Car/m-p/117000
These stories are just the tip, but are what made my name synonymous with Dynatrace.
17 Sep 2024 08:09 PM
This is always a great topic! Have gave some testimonies in other challenges (here and here), so will give others. I'm going to focus really on versatility and agility:
17 Sep 2024 10:18 PM - edited 25 Sep 2024 09:42 AM
So many to choose from, which one should I pick…
I’ll go with a story from 7 years ago with Dynatrace version 140 or so. I was working with a company in The Netherlands and there was a mission critical application which had to be monitored, Peoplesoft Tuxedo.
As this application was written in a compiled language we were not able to get transactional insights into it without using something like our SDKs, but that was not possible to add to the application as the company did not own the code.
By using the CPU Profiling in Dynatrace I could see where the time was being spent, to pinpoint which methods were responsible for the outgoing call. The next step was to get the Java libraries that communicated with Peoplesoft Tuxedo and then cross checking the methods and parameters with what I saw in Dynatrace to find out how to get where the service call started, if the call was successful, and what service in PeopleSoft Tuxedo was called.
By entering this information into custom services, request attributes, error detection and service naming, it was possible to visualize the transactions within the service flow, and Davis could start baselining the calls.
The setup has since been reused by 10 or so customers to get insight into the calls to PeopleSoft Tuxedo. I've attached the configuration you need to make if you're using Peoplesoft Tuxedo and wants to improve your insights.
A bonus to the story is that this took 2-3 hours to set up, after which the product owner tells me that a large Dynatrace competitor at the time had spent 3 days into attempting to provide any level of visibility, and then given up.
18 Sep 2024 07:08 AM
For me it is a story how Dynatrace helped me with a private project. Monitoring a mobile app 📱 and everything runs smooth, there was no recent release, basically no changes in the app for some time. Then I got an unexpected crash rate increase notification 💥 in our Dynatrace Notification mobile app. I was curious what could cause a crash spike 📈 in my app without any changes... So started up my laptop 👨💻, logged into Dynatrace and took a look at the crash reports sent in by the mobile agent 😎 monitoring the app. When analyzing the crash report and looking it up in the mobile app source code I realized it was crashing because of a null value fetched from the backend Database where it was not expected... So seems that was some bordercase from data entered on the website that never showed up until then and made all apps crash. I was scared that I need to rush and implement a mobile app fix immediately 😱, push it through review and have it released before all the users get annoyed and drop bad reviews... But luckily I just fixed the bad DB entry in the backend and crashes went back to none - so I had enough time to hunt down the origin of the bug later and it did not ruin my weekend 🧘. The whole process of this from notification to analyzing the crash and mitigating it took me less than an hour. Thats where I was really prod to have used something I helped developing.
18 Sep 2024 11:00 AM
A couple of years ago, a customer called me and the AE alarmed that there was recently detected unusual activity within their network. The company noticed that an unknown individual or group appeared to be entering their system, though no immediate harm had been identified. Despite attempts to investigate, the root cause or entry point remained elusive. The company was concerned about potential security breaches and needed a solution to identify and eliminate the vulnerability quickly.
We proposed the use of Dynatrace Application Security, a tool that integrates seamlessly with existing environments to detect vulnerabilities at the code and runtime level. The customer agreed to a trial of this solution, and our team provided immediate support to help them integrate and monitor their application security more effectively. Within a short period, Dynatrace Application Security was able to detect a previously unknown exploit in the customer’s application. This vulnerability was the likely entry point for the unauthorized network access. The customer was very grateful and still recalls the incident from time to time in our meetings.
27 Sep 2024 01:13 PM
One of the most impactful use cases I’ve handled was for a client with 40,000 employees, where generating payroll was becoming a major issue. The process of handling timesheets, leave, bonus payments, and taxes was taking up to 42 hours, which resulted in delays in payment processing. By leveraging Dynatrace, we quickly identified the bottlenecks.
Our first discovery showed that the Tomcat servers were not optimized, causing the heap to vanish within an hour of execution. This led to massive garbage collection (GC) pauses of up to 40 minutes, freezing the process repeatedly. With Davis AI, alerts were generated, and the root issue was identified. The fix involved heap optimization, improving GC times and overall efficiency.
Next, by analyzing the distributed traces, we discovered that audit table updates during SQL operations were taking longer than the initial transactions themselves. We passed this to the DBAs, who optimized the queries, and Dynatrace also flagged that the audit and primary databases were sharing the same slow disk, compounding the issue.
After implementing further improvements—such as fine-tuning the JVMs, configuring the processes to avoid unnecessary security scans, and reducing excessive logging, rescheduling database and server backups —we reran the payroll. This time, it completed in just 8 hours, with no memory issues or audit-related delays.
02 Oct 2024 07:08 PM
I love to hear these stories. Here are my two stories that I can think of. I know there are more but these will always be remembered.
Story 1:
During the AppMon days, we noticed that VCT dropped by about 0.5 to 0.75 seconds. While this seemed like a positive change, the sudden speed increase without any changes was puzzling. Many teams collaborated to investigate the issue, but no one could pinpoint the cause.
I began digging into the data in AppMon, focusing on VCT and the reasons behind its drop. As I reviewed various dashboards, one caught my attention—it displayed 3rd party calls. I noticed that a specific 3rd party call coincided with the decline in VCT. This call was so frequent that it overwhelmed VCT, leading to the decrease.
I presented my findings to management, who then discussed the situation with the individual responsible for enabling the 3rd party call. To test my hypothesis, they disabled the 3rd party call, and VCT returned to normal levels. The team was impressed that Dynatrace (AppMon) had helped identify the solution to the VCT drop.
Story 2:
Some servers and processes were experiencing high CPU usage and had Problem cards. Teams worked tirelessly to identify the issue but had no luck. The application team pointed fingers at Dynatrace, claiming it was the source of the high CPU. Despite efforts from various Dynatrace experts, the problem remained elusive.
They brought me in to see if I could uncover the root cause. I didn’t believe Dynatrace was at fault, and I was determined to prove them wrong. I examined the process group responsible for the high CPU usage and accessed the “View detailed CPU breakdown” feature. By analyzing the thread causing the spikes through Method Hotspots, I quickly identified the culprit: a new logging feature that had been recently added, not Dynatrace as others had suggested. Once the logging was removed, CPU usage returned to normal.
In the end, it was Dynatrace and Full Stack for the win!
02 Oct 2024 07:17 PM
Recent success story:
We're moving from Splunk to Dynatrace for log ingestion. A lot of users were having trouble finding the logs they needed and were very used to the idea of indices that we've used in Splunk for a long time.
I managed to create some custom log buckets to match what the users were used to seeing in Splunk as indices to help solve the pain for them. Not only does this allow them to easily find the logs they're looking for, but it also allows me as a Dynatrace administrator to customize security and retention at the bucket level, and (my personal favorite) it reduces the amount of scanned data and therefore query costs associated with querying the logs they need to keep the platform moving forward.
I built them a simple Notebook as a guide on how to see which buckets were available and then how to use them in their queries for filtering. The new Logs app has also help drastically decrease their time to convert from Splunk to Dynatrace.
I cannot stress the importance of data segregation with custom buckets enough!
04 Oct 2024 07:55 PM - edited 04 Oct 2024 07:56 PM
I love this approach for two reasons, first is that it's always interesting to read other uses and stories from members who solved challenges with Dynatrace. And second is that there is never enough Dynatrace SWAG.
I have two stories that come to mind right away:
1. The first one is from a client in which the challenge was to improve the end user's perception of interactions with their homebanking. After interacting and ingesting survey data from said clients with Grail, we proceeded to use that information to analyze the different User Sessions and be able to identify current problems that could have affected the proper functioning of the site and prevented users from achieving the expected results.
The great thing about using Dynatrace is that we not only see the interaction with the frontend, but we can also find backend problems with the components that that session interacts with and find other root causes.
Thanks to this work, we were able to improve the end user's perception of the homebanking functionalities and be able to measure the UI changes and how they affect the end customers.
2. This story is about Davis, and I never tire of telling it because it really is an AI that works. Specifically, the client had a Datapower monitored with the IBM Datapower extension, and one of the metrics began to identify a drop in performance. This was reported to the support team responsible for DataPower, who evaluated these alerts with IBM people. Finally, a few weeks later, the Datapower in question went out of service for scheduled maintenance. Dynatrace was able to identify this condition before anyone else.
04 Oct 2024 10:15 PM - edited 04 Oct 2024 10:15 PM
Story01:
Previously I shared this last year in the Tracing challenge...I really love use Dynatrace for dig into the traces. With trace analysis I saved a major release (watermelone project - three weeks before the release was green, two weeks before the release was yellow, one week before the release was full red). So one week before the release the BAs and developers found me becasue there were huge performance issues (huge response times) at a new green field kubernetes app. Quickly I checked a random individual trace between two meetings. I immadiately checked the method hotspots and in the call hierarchy the bcrypto method indetified as a source of the huge response time at this individula request. Then I asked the projects to generate load on the services and my first observation was perfect (Dynatrace observation). At each services I could see code execution perfromance issues becasue of the bcrypto method during the whole load test. After that the bcrypto method was replaced by developers they ran a new perfromance test. These performance test were checked quickly in DT again with compare function. So the project, the release was saved becasue received the green light in time within hours thanks to DT (and to me 😉).
Story02:
This with my colleague which demonstrate the power of Dynatrace for me. Because of the pandemic, the need for parcel services increased at that time. One of our clients installed 88 parcel delivery machines on country level. There was a major release change on the delivery machines (cilents) and the backend part also. They had three unsuccessful release changes on consecutive weekends (whit a lot of human costs). Finally, they turned to us to help them. We had very limited information about the infrastructure and the application. The symptom was very simple. After the release changes and the first few transactions the backend servers (Jboss) always had CPU and memory saturations. During the 4th release change weekend, we also participated (with the clients CIO 😂) in the process and finally found that it was a program code issue (we all know that this can never happen every code is perfect😉 ). We investigated the problem with the trace function and turned out that one wrong database transaction loaded the whole database (millions of records) into the memory and killed the Jboss instances. It was a huge coding mistake and pointed out that there was no proper performance testing at all. This case proved for the client the capabilities of Dynatrace.
05 Oct 2024 09:00 AM
Over the last 6 years, with more than 40 different customer tenants using Dynatrace, I can confidently say that Dynatrace has been successful across all implementations, but there are a few specific cases I want to highlight. From using *Request Attributes* to capture data from an error that crashed the application—without knowing exactly which values caused it—to the most recent advancements in near real-time log data analysis to detect anomalies as if they were metrics we had always configured with *Davis Anomaly Detection*. I would say my top 3 experiences are:
**Status Code 200 for Everything**
I had a client whose APIs all returned a 200 status code, even when there were errors. On the front end, errors were handled by showing an error message in the body of the response, making it very difficult to identify the issue if the response body was not available. With Dynatrace’s "new" features in *Business Events* and the help of *Grail*, we managed to build dashboards that allowed us to track the number of errors being generated within the APIs, as well as set up alerts for them.
[Documentación sobre Business Events]
[Documentación sobre Anomaly Detection]
**API Management - Azure**
It's no secret that Azure API Management is a powerful tool, but it can become a challenge when you need to analyze your APIs, especially when you have more than 1000 APIs with numerous internal methods. While *Azure Monitor* or *App Insights* provide a general overview, they are quite limited in terms of the data you can extract. With Dynatrace integration, good processing (DPL), and custom metrics, we managed to get a much clearer view of the errors occurring in the APIs, their SLOs, and later, by adding *Grail*, we could segment each team’s APIs into more meaningful dashboards.
[Documentación sobre Log Processing]
[Documentación sobre OpenPipeline]
[Documentación sobre Anomaly Detection]
**Network I/O**
This is not my direct accomplishment, but I recall a client using network information from service analysis to identify that, when the application became slow, *Network I/O* spiked to abnormal levels. These fluctuations happened overnight, with no changes made to the application. Upon investigation, they discovered that the network team had reduced production bandwidth by 50% without notifying anyone, creating a bottleneck for clients with higher bandwidth. Once the bandwidth was restored, the application returned to normal.
[Documentación sobre Network I/O]
[Documentación sobre Service Analysis]
4.- **Extensions / Integrations**
Dynatrace’s ease of metric ingestion and extension creation has been key for many clients. It has allowed us to create views for JMX, IBM I, SQL, SNMP, NTP, and more. While there are now many ways to leverage these in the HUB, I remember how clients were amazed by the ease of integration.
For example, I recall a client where, with just one query into their synchronization database, we could see the current status of their ETL processes, knowing exactly which tables were affected, when their last update was, which were at 100%, and when the next synchronization process was scheduled. All of this was displayed in a single view, with status tiles in green or red.
I know these are more use cases outside of *war rooms*, and it may be more interesting when you find a problem and its solution. For example, I've dealt with *hung threads* in *WebSphere*, CPU vs *Garbage Collection* analysis to identify application locks, or how, with a database newly added to monitoring, using the *lock waits* view and query analysis *MDA*, we quickly resolved multiple BD issues. But these types of problems have become quite common, and with similar technologies and developers making the same mistakes, these problems tend to repeat across clients.
That’s why I feel these three macro-level and long-term problems are among the most memorable.
08 Oct 2024 06:07 AM
Industry: Travel (Airline)
Issue: Global website slowness and unavailability
Observations:
- Unexpected surge in throughput (no marketing campaigns)
- Website slowdown and unavailability
- Deep analysis revealed traffic originating from a single geographic location
- Real User Monitoring (RUM) enabled identification of the issue
Root Cause: DDoS (Traffic Flood) attack on the website
Solution:
- Blocked traffic from the specific geographic location using Content Delivery Network (CDN)
- Mitigated global outage, allowing application recovery
This was before AppSec available.
11 Oct 2024 04:25 PM
I thought for a while about how to tell my story and whether it fit this theme or not, ended up deciding to post it anyway because it was really meaningful to me.
I have worked with Dynatrace for a relatively short amount of time, but Dynatrace wasn't my first contact with the world of Observability.
During my Master's thesis I agreed to take on the (at the time unknown) massive challenge of implementing a SIEM system for my faculty's informatic's department. The idea itself was very appealing to me, given the immense potential I saw in simply knowing what was going on in our systems. Some of my mates from the cyber security master's course found the concept kinda stale and perhaps even boring, but I personally really wanted to give it a shot because up until that time there was virtually nothing in terms of observability going on within the organization. The idea was just... fascinating, to be honest. I kept wondering what I would see and how I could use that data to bring value to the organization.
Now the issue was, that while I found the idea itself quite appealing, the process of implementing it wasn't. It was a long, drawn out process that took several months and achieved very limited results. I spent hours and hours configuring agents through IPs and ports, setting up log paths, processing and storage, managing disk spaces on a centralized server, and mostly just struggling in general to get significant resources assigned to the project because people just didn't seem to value observability to the extent I did. To add on to that, the tools I ended up using (with the resources that were allocated to me) just weren't cutting it. It felt like I was trying to cut down a tree with a hammer, if that makes sense?
In the end, although I did feel like the project I completed brought value to the university, it felt underwhelming. I felt like I could've done a lot more, but at that time was done with the world of observability. It was too much single-handed effort and, while I really did see value in it, others just didn't seem to see it. So I was pretty much done with Observability at the time: I was ready to ditch it and work on something else for the next 20 years 🤣
Fast forward a couple of months though, and a fairly interesting message finds itself in my inbox: A proposal, to work with a platform that was fairly unknown to me at the time: Dynatrace.
Now remember: I was pretty much done with observability at this point. Was I looking forward to deep diving into another wanna-be high end solution for observability? Hell naw. I wasn't about to agree to spend the next few years of my career configuring more yamls for more limited agents that could do 1/10th of what I actually needed them to do.
But the interview was scheduled anyway, so I figured the polite thing to do would be to show up and listen to their pitch.
And 'oh boy', was it a pitch.
The introduction and experience I had with Dynatrace from the start was just.... wildly and vastly different from my previous experiences. Deploying, seeing data, analyzing data, dashboarding, real user monitoring... It was all insane. Everything I wished I'd been able to do and things I didn't even consider doing, in an incredibly short timeframe. And it was, for the most part, smooth, efficient and easy to set up. It far outmatched my previous solutions in terms of versatility, agility and acquired value for the organization.
Now, I don't want to claim Dynatrace was always an easy platform to work with, especially when I was starting. It was and still is to me an incredibly complex platform with a lot of quirks and new things I need to learn about every day.
But I suppose that the "Success" part of this story was how it brought back that original excitement and feeling of wonder that got me to accept my thesis challenge of observability in the first place. Messing around with Dynatrace for the first time made me feel like yes, *it was* possible to elevate and do observability beyond the frustrations I'd found before.
Imagine how a kid would feel like, knowing he was tired of riding an old half-destroyed single wheel bike put together with duck tape and prayers, ready to give up and just take the bus. And then that kid wakes up in the morning and finds a new modern bike gifted to him downstairs. That's how I felt when I first started working with Dynatrace 🤣🎁
And then it just kept on improving when I realized the absolute value I could bring to extremely large organizations within my country, even with relatively short sprints of effort. Previous arguments I'd presented to my college colleagues proved to be true when I managed to use Dynatrace to, for example, find tons of processes on VMWare machines that had been stuck for over a year, consuming the same amount of CPU at all times and doing nothing. I could find them because I used Dynatrace to do it. How much I achieved in such a short time, compared to my previous experiences, did make me feel like a hero. Or how we managed to organize, analyze and segregate logs from a major organization through the use of buckets in such a manner that whereas logs that were really only considered clutter before, within a single month became an essential and integral part of that organization's observability and performance universe.
So mine isn't a hugely impressive success story: I didn't save any client's puppy, big data center, or improved their servers or database's performance by a gazzillion %. Because the truth is that, realistically, Dynatrace is complex. Powerful, but complex. It takes time to learn, to study, to get certified, to know. Those success stories do exist, but they aren't mine. Not yet, and that's alright.
But I did want to contribute to this thread, and that means you instead get the story of how I used Dynatrace to regain that goosebump and excitement feeling that made me want to work in this field for the first time! Of how I was ready to drop it all and, through Dynatrace, was brought back to this world of observability and have been happily dwelling in it for a few months now. 😁