Open Q&A
If there's no good subforum for your question - ask it here!
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Excessive Process Availability Alerts Due to Nightly Restarts – Need Recommendation

Sohel_Rashid
Visitor

Hello Dynatrace Community,

In our production environment, UPI and Mobile Banking application processes are restarting every night as part of a scheduled activity.

We have configured Process Availability alerts for these critical processes. However, due to the nightly restarts, we are receiving around 180–190 alerts daily, which is creating alert noise and unnecessary ticket generation.

If we enable a Maintenance Window, it suppresses alerts for the entire application, which is not ideal because we still want to receive other critical alerts during that time window.

Challenges:

  • Nightly process restarts generate large volumes of availability alerts.

  • Maintenance Window suppresses all alerts for the application, not only process availability.

  • Alert fatigue and unnecessary escalations are occurring.

What we are looking for:

  • Best practices to handle expected process restarts without losing visibility of real issues.

  • Is there a way to suppress only process availability alerts during a specific window?

  • Any recommendation on alert tuning, delay thresholds, alert correlation, or custom logic?

  • Has anyone handled similar scenarios in banking / batch restart environments?

Any guidance or suggestions would be highly appreciated.

Thank you in advance.

7 REPLIES 7

Julius_Loman
DynaMight Legend
DynaMight Legend

@Sohel_Rashid Maintenance window is the preferred solution. Can you please share how your maintenance window is defined?

Dynatrace Ambassador | Alanata a.s., Slovakia, Dynatrace Master Partner

We are currently not using a Maintenance Window because these are business-critical UPI and Mobile Banking applications, and enabling a maintenance window would suppress genuine critical alerts. Additionally, every week new JARs are deployed, but alerts related to old JAR processes remain open, requiring manual closure of nearly 180–190 problems, which is operationally difficult. We are looking for a solution that reduces alert noise without losing visibility of real production issues and avoids manual cleanup.

andre_vdveen
DynaMight Leader
DynaMight Leader

Hi @Sohel_Rashid 

I would create an auto tag for those processes only, then use that in a MW config to suppress problems for entities with that tag.

We are currently not using a Maintenance Window because these are business-critical UPI and Mobile Banking applications, and enabling a maintenance window would suppress genuine critical alerts. Additionally, every week new JARs are deployed, but alerts related to old JAR processes remain open, requiring manual closure of nearly 180–190 problems, which is operationally difficult. We are looking for a solution that reduces alert noise without losing visibility of real production issues and avoids manual cleanup

ChadTurner
DynaMight Legend
DynaMight Legend

So you have Critical Processes - but these processes incur a regular recycling which is to be expected. So I'm assuming your settings are to alert on the availability if any one of these processes are offline - but again, expected at midnight each night for example to be offline. 

Do these processes need to be up over the weekend? Holidays etc? 

I agree a Maintenance window might not be the best solution as maybe you want to alert on certain aspects outside of the availability and not render blindness overall to the process. 

There are a few things you can do but it will leverage Workflows. 

1 - Create a workflow that targets your identified process groups and disable, at your defined start time, the availability alerts. Then clone that workflow to do the inverse, and turn on the alerting at your defined time. 

While this feels manual in the sense of a work flow looking at each individual process group, you could make it a tad bit more "Automated". For example, you could make a specific tag and automatically apply that tag to the said process group, and then incorporate that into the workflow. So if a new process is onboarded that is critical, the workflow does not need to be altered, just add the tag to the new process group. 

2 - Create Davis Anomaly Detectors. Depending on your desired alert state, you could indeed turn off the alert via the settings UI on the process groups. Then formulate a Davis Anomaly Detector (DAD) to alert on your criteria, which would allow you to hold an event to be observed for a full 60 mins before an alert. So if its down for 1 hour your fine, but past 1 hour there is an alert. Even then, you could conjoin a workflow to turn the DAD rule on/off during the desired hours. 

There are a few other possible solutions like including a nightly NEGATE for your alert profile rule conjoined with tags, but that all might add in unnecessary complications and would only reside at the alert profile level, meaning while it might not send an incident into Service now, It may still show up in the Dynatrace UI. 

-Chad

Thanks for the detailed suggestions — the workflow-based enable/disable approach and tagging strategy sound promising and we will definitely evaluate them. I also agree that a full maintenance window is not ideal since we still need visibility for other critical signals beyond availability. One additional challenge we face is that every week new JARs are deployed, but alerts related to old JAR processes remain open and require manual closure of nearly 180–190 problems, which is operationally difficult for the application team. If you have any recommendation on automatically handling or cleaning up stale process alerts during JAR upgrades, that would be very helpful.

So we had a similar use case, but it was centered on when a host was onboarded. This method tho could still work for you. In every week a new JAR process shows up and you alert on the old JAR's which is noisy and isn't a cause for alarm. So there are a few options. 

1 - MONACO - When your teams do a deployment or however that new JAR process kicks off, add in a step for an API call to Dynatrace to "Tag" the older Processes/Process groups that is inline with either a Maintenance Window or the workflow that we discussed prior. That way as new things are released, the older ones are tagged as such, keeping a watchful eye on the new ones. Similar to PaaS instances going up and down as needed. There was a support document on the same method but for PaaS hosts that spin up and Spin down and how to set a tag on the PaaS entities slated for shutdown. 

2 - First Time Seen Tagging - This is a method that we designed allowing us to target when things were released. If you have a release but you don't have the API connection for metadata, you could use the "first time seen" metadata as your primary logic to say anything after Week X, will be considered OLD, you can formulate a tag on the entity and then leverage that for your workflows etc. https://community.dynatrace.com/t5/Dynatrace-tips/PRO-TIP-Setting-an-onboarding-date-stamp/m-p/20603... 

-Chad

Featured Posts