Solved: What JVM metrics are captured by OneAgent by default and how these metrics and custom metrics are used by problem detection AI

thanes_bala · ‎07 Aug 2019

it's my understanding that DT will (by default) monitor the JVM metrics and associate any out-of-range (out of baseline) values with a problem identification using AI

For example a GC pause might trigger a spike in CPU usage. An alert on CPU usage might trip a custom alert. However, if the CPU spike does not impact service performance posture, then it essentially become a false positive. How does the DT Problem Identification AI uses custom metrics along side default metrics to decide appropriate action to cut down false positive alerts? Or custom alerts and metrics are not used by the AI?

The key question is, should we rely on Dynatrace’s ability to detect problems or should we also import custom metrics using JMX plugin to monitor JVM internal health for preventive monitoring? IF we did the latter will there be any conflicts on the custom metrics/alerts?

@Andreas G.

@Michael K.

wolfgang_beer · ‎08 Aug 2019

Davis AI is triggered mostly by real user affecting events such as service and application error increases, slowdowns or process crashes. In those cases Davis AI follows all the transactions running through the unhealthy service and automatically analyzes all the thousands of individual metrics of the underlying infrastructural nodes (no matter if those are built in metrics or custom metrics such as JMX or OneAgent Extension metrics).

In case a metric shows an abnormal behaviour just before the problem was detected Davis highlights that in the root cause section of the problem, as it is shown below:

Those metric anomalies do not trigger false positive alerts as those are only analyzed during an already open problem.

Best greetings,

Wolfgang

thanes_bala · ‎08 Aug 2019

Thanks Wolfgang!

Following up on the abnormal behavior identification for a problem; let's says we are collecting JVM threadpool metrics using JMX plugin and due to thread mutex issues this pool became exhausted.

Will the AI identifies threadpool metrics in the problem identification?

wolfgang_beer · ‎08 Aug 2019

if your JMX metric changes its behaviour right before the problem is raised (say within 15min before the problem start) I would say yes. Of course this very much depends on timing and how significant that metric change is.

thanes_bala · ‎09 Aug 2019

Thank you! @Wolfgang B.