Alerting
Questions about alerting and problem detection in Dynatrace.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

How to Automatically Detect Process Failures in Dynamic Infrastructure

VictorRuiz
Contributor

Hello community,

I am trying to configure a global availability alert for a specific infrastructure process, but I have hit a logical wall within Dynatrace SaaS.

The Goal:

I need to monitor the availability of several process thats is not running in all my servers (example: p_ctmag).

  • It is deployed across approximately 140 Linux servers.
  • We do not have a static inventory of these servers, as the deployment is dynamic.
  • Requirement: Trigger a "Process Unavailable" alert indicating the affected host if the p_ctmag process crashes on any server that was running it.

The problem:

Our infrastructure is heavily divided using Host Groups. By design, Dynatrace does not merge processes belonging to different Host Groups into a single Process Group. Consequently, I do not have one global Process Group to monitor, but dozens of isolated ones.

Failed Strategies:

  1. Declarative Process Grouping:

I created a rule targeting the $eq(p_ctmag) executable. While the detection works perfectly, Dynatrace still separates the resulting Process Groups based on the underlying Host Groups. Because they remain fragmented, I cannot configure a centralized "Alert if any process becomes unavailable" anomaly detection rule.

  1. Advanced Detection Rules:

Not applicable. As per the documentation, these rules only apply to deep-monitored technologies (like Java or .NET) and do not work for native OS binaries like this one.

  1. Process Availability Rules (with Auto-Tags):

Since I lack a static inventory, I attempted to use Settings > Processes and containers > Process availability. I configured a Host Auto-tag rule to dynamically tag any server running p_ctmag. Then, I scoped the Process Availability rule to that specific host tag.

The logical flaw: If the process dies on a server, that server dynamically loses the tag. Once the tag is lost, the host falls out of the availability rule's scope, and the alert either never triggers or auto-closes instantly.

My Question:

Is there a native, fully dynamic way to monitor infrastructure OS process crashes distributed across multiple Host Groups without relying on manual tagging or maintaining static inventories?

Any insights or architectural workarounds (even utilizing the API or DQL) would be greatly appreciated.

Thanks in advance.

Victor

0 REPLIES 0

Featured Posts