I'm wondering if any of you have some ideas for this weird
Generic Execution Plugin + Alerting issue we're having.
Essentially we want to get the Generic Execution Plugin to
be run when an Incident threshold is reached (standard alerting stuff really).
The Generic Execution Plugin is installed on the DT server and can be
successfully tested as working by creating a monitor which runs the plugin from
the DT server and manually running it - it works.
However the trouble comes when the Generic Execution Plugin
is used as an Alert on an incident. The Incident is noticed because if you add
email alerting that works, but the Generic Execution Plugin doesn't seem to get
run at all (nothing even appears in the DT logs for the Generic Execution
Plugin, even with logging set to 'finer')
We have two Dynatrace environments - a production and a
non-production - and annoyingly alerting works in non-production with the
Generic Execution Plugin (and it adds to the log files). As far as I can tell
both environments are pretty much identical.
I know for monitors the plugin can be specified to run on a
specific location, but as far as I can see for alerts it always runs from the
DT server. But could I be mistaken and it's trying to run somewhere it has no
permissions for, maybe?
I suppose what's really adding to my confusion is no logs
are written - as if the plugin is failing silently.
The platform is Windows and Dynatrace Appmon 7.0.17
The Generic Execution Plugin version is the 3.36 (though
this fault occurred with earlier versions also)
Any ideas or thoughts - my appreciated!
I can say that for an incident action it wouldn't be running from anywhere besides the server, they will only run from there. Regarding what the possible issue could be I don't have any specific ideas but I can say that a few years back we saw something similar where it appeared the plugin wasn't running at all and after looking at the server we found logs in the server logs indicating other server health issues that once resolved caused the GEP to be reliable. I can only recommend taking a look at the server itself to make sure it is completely healthy - it is possible the increased load on the production server is the difference between prod and pre-prod.
Thanks for the suggestion James.I've had a look at the load on the server and it all appears
If there are any other metrics you can recommend checking
please let me know, I was just looking for anything that looked high?
Beyond this are there any other paths you can suggest to go
investigate this? It appears to be that the problem can be described as:
"when GEP is called as an Action from an Incident either the call does not
happen or silently fails."
Is there any way of getting more detail on what is happening
in the code when an incident is fired and an Action should be actioned?
I don't have any recommendations at the moment. There are some people internally who are trying to resolve something similar so I'll update here if they find anything.
The thought at the moment is that it is a 'permissions' issue but I have no details beyond that.
An update for all who are watching this question...
Yesterday we upgraded this Dynatrace installation with this problem to version 7.0.20, and today I noticed that the General Execution Plugin appeared to being run when an incident alerts. In some cases it actually runs (and runs a small PowerShell script), but then at other times it fails. The good news is either of these outcomes are improvements as it's actually writing to the log file this time so I can see when the incident alert does fail it fails with the reason:
"2018-04-24 12:07:49 WARNING [UserPluginManager@com.dynatrace.diagnostics.plugin.extendedexecutor.GenericExecutionPlugin.action] [Channels] Incident Alert Test: java.lang.IndexOutOfBoundsException - Index: 1, Size: 1"
The other thing I noticed about these failures - there weren't any at all overnight. According to the logs from the General Execution Plugin it ran successfully (except for once) from after the upgrade yesterday afternoo until this morning when I noticed it was working and attempted to create a new incident to properly use it's functionality.
This in summary means either I've mis-configured it somehow this morning or maybe, as was previously suggested, load on the platform stops it working sometimes. Could anyone give any background on what the error "java.lang.IndexOutOfBoundsException" could mean?
The exception itself just means that in the code for the plugin at some point an index in an array or string or something is called that is out of range. E.g. calling the 5th element in a 4 element array.
Unfortunately, especially in your case where it sometimes is working, it probably isn't something simple or a configuration issue but rather something that would take a lot of debugging of the code to figure out, and the GEP is quite a hefty plugin. A stacktrace could point to where in the code the issue shows up but then it would still need to be fixed.