Has anyone come across a 'Process unavailable' type problem being created for a process stopping but then when viewing the problem card you have something like 'State of process xyz is unknown'. The problem never actually closes out automatically even after the process is started back. Even manually closing the problem will cause another problem to be created for the exact issue. It takes several times closing out this problem to prevent it from coming back.
What I have seen is that this occurs when you enable process group monitoring for a process group then change the hostgroup of the agent and restarting the process from the process group. I understand that hostgroups are typically something you try to steer away from changing but it is sometimes unavoidable and it seems by doing so Dynatrace loses its mind a little with its process group.
Solved! Go to Solution.
Below are the steps I took. After my original post I was able to follow these steps to get the same type problem created
- Had 2 hosts with the same hostgroup. Process group was reflecting correctly and showing both processes as running
- Changed the hostgroup on just 1 host via oneagent ctl
- Stopped the process on the 1 host with the changed hostgroup
- Received an open problem card that was worded correctly (Process xyz on host xyz has been shut down)
- Started process on host that previously had the process shutdown
- I waited some time to see if the above problem would close out, it never did after 20 mins so I stopped the same process on the same host, trying to get the state of unknown problem
- Never received problem of "state of process is unknown"
- Started the process again almost 10 mins later after stopping it
- The above original problem never closed automatically. I expected this being that the hostgroup had changed and a new process group was created. Even after 20+ hours it was still open, I manually closed it
- New problem opened with text of "State of process xyz is unknown" immediately after closing the original problem
@travis_anderson No, as you said this 'unknown' event is definitely not the 'normal' case. I can only explain that with the configuration change on host group that was made at the same time. This config change on agent level does split the process group, which means that the original process groups state can no longer be retrieved and is considered unknown. That's also the reason why it does not resolve automatically.
This should be a rare case as the host group should be a stable definition.
No objection to that of course. I just meant that this opt-in alert is not the 90% normal case and its occurrence is very rare. What I could offer is to modify the text within the card a bit to include the information that this 'unknown' most likely is related to a reconfiguration of the host group. What do you think?
Dear @travis_anderson ,
Yes the process unavailable monitoring is an opt in alert that comes in 2 different flavors, either you alert when ANY of the process group instances within a process group becomes unavailable (which in detail means that the OneAgent does not send any info update anymore for that identified process group instance).
Or you choose the option to only alert if the number of instances drops below a given limit, for clustered cases.
The resulting problem and underlying event is either resolved again if the PGI comes back online, or in case it never will come back you have to manually close it. (which is also mentioned in the setting page where you activated the option).
In your case, the change of PG detection for sure negatively interfered with that alerting logic. Best would be to disable that before making any changes on the PG availability alerting.
I hope those details helped to understand the logic behind.
I am fully aware of the PG monitoring and the 2 different options we have and what they do, what I am more so inquiring about is whether it is normal to have this process state of unknown. If so, what would be the reason to have this 'state of process .... is unknown' problem card created, what value does it bring? This problem is referring to the old host group and therefore old process group so why would we need to know about it? All it seems to do is create confusion and if the solution is to turn off availability monitoring temporarily before changing the hostgroup then all that does is increase the risk of further issues, if you forget to enable back process group monitoring. Could we not avoid creating this type of problem all together?