25 Oct 2022 07:53 AM - last edited on 26 Oct 2022 12:39 AM by MaciejNeumann
Hey everyone, we experienced a situation where a problem was not created for a windows os service not running when we were expecting a problem to be created. We did some testing and was able to reproduce the situation with the below steps
- Put a windows host into a maintenance window
- Stopped a specific windows os service and noticed the availability event for the service was logged on the host level but no problem card created, all working as design and as expected
- Host was removed from maintenance window
- After host removed from the maintenance window, the service was still not running but no problem was created. This is not what we were expecting
What seems to be happening is Dynatrace is seeing the service stopped during the maintenance window (and ignores the problem creation, per design) but because it never saw the service as running again AFTER the maintenance window and thus clearing the availability event, it is not creating a problem card for the service not running.
My question would be is this the expected behavior? This seems to leave gaps where something could not be alerted for after maintenance windows.
It depends on the type of maintenance sections you have picked.
If you chose detect but dont alert, you will not get an alert once the window expires. The trigger only happens once, If the service Stops at 8:10 and your Maint. Window is from 8 - 9, it will detect but spress the alert. Once the Window expires, and the service is still down, Dynatrace wont trigger another problem card as it was done during the window.
If you want to alert on any issue after the window expires you need to select to ignore the event and alert that way the problem isnt triggered. It turns a blind eye to the whole event. Then once that window expires, it opens its eyes and will then detect and alert on any violations.
@ChadTurner thanks for the response here but I'm not entirely following you. Seems you are saying that what we want is possible, I'm just not sure how to make that happen. In the example above a windows service was not running after it came out of a maint window so in that situation we definitely do want a problem card created. How would I make that happen?
For the maintenance window that was applied to this host we have 'Disable problem detection during maintenance'. We definitely do not want to alert during the window so I assume the option we should be doing is 'Detect problems but don't alert'? Would this truly allow a problem to be created after removed from the window and if the service is still not running? Keeping in mind that the situation we are working with is that the service was never found to be running from the time Dynatrace initially logged and saw it as not running (during a maint window), to the time the host was removed from the maint window. It isn't as if Dynatrace saw it in a good state then the service stopped again
Make sure that Service or Process is set to alert if its offline, i dont think that is on by default
@sivart_89 I hope this visual helps:
Basically if you have it set to Detect but not alert during the Maint. Window, you dont have the functionality of alerting after the problem is detected. The two actions of Detecting and alerting only happen once for the issue and (depending on alert profile rule times) trigger shortly there after. But the Maint. Window is set to suppress the alert. So the Problem card is there as it was detected but your alert integration was skipped. Now it wont go back and alert on that detected problem once the window expires, the trigger went once and was told to suppress.
The other, Disable problem detection preserves your function of detect the issue then alert, because the one triggers the other, Once your Main. Window expires, it will find any problems, being that service not running, then a problem is detected, a card created and triggers the alert integration to notify the users.
@ChadTurner but our maintenance window is set to 'Disable problem detection during maintenance'. So if I am reading and understanding your comment correctly, it should have created a problem? Or maybe you had it flipped around in your comment? We even had the event logged in the Events section on the host (it was an availability event) for the service stopping, and showing the maintenance window ID (dt.event_maintenance_windows_ids). This tell me that yes Dynatrace saw the service stopped but the event was in a maintenance window, please correct me if I am wrong.
@ChadTurner I've created the RFE below. Essentially the issue is that we are removing the host from the maintenance window, the window does not expires. All works fine if the window expires, not if you remove the entity from the window. The reason why we are doing it that way in a nutshell comes down to a limitation in the # of windows you can create. All of this is outlined in my RFE.