Solved: Problem Detail should mention Process instance along with affected service methods

praveen_begur · ‎06 Apr 2018

Hi,

This is for the usecase of enabling auto-remediation or self-healing.

The problem detail REST API mentions the events that contributed to the root cause of a Problem. These Events can be potentially consumed by an external orchestration tool to undertake corrective actions if proper context and actionable details are provided.

In case the root cause is due to events at level of a Service method for example 'CheckDestination', dynatrace REST API does not specify which specific Process Instance is contributing to slow execution of 'CheckDestination'.

In case a Service is associated with a Process Group that has multiple Processes (i.e Process Group Instances) that are all executing the code for 'CheckDestination', then the 3rd party Orchestration tool will not know which of the Process Group instances needs to be examined further.

So I request Prod Mgmt to add name and entity id of ProcessGroupInstance(s) along with Service Id in the problem detail rest API.

See below json which shows only Service Id and does not mention which Process Group Instance needs to be fixed.

{
"startTime": 1522994220000,
"endTime": 1522996500000,
"entityId": "SERVICE-EC0AEC20F017D197",
"entityName": "CheckDestination",
"severityLevel": "PERFORMANCE",
"impactLevel": "SERVICE",
"eventType": "SERVICE_RESPONSE_TIME_DEGRADED",
"status": "CLOSED",
"severities": [{
"context": "FAILURE_RATE",
"value": 9803186.0,
"unit": "MicroSecond (µs)"
},
{
"context": "RESPONSE_TIME_50TH_PERCENTILE",
"value": 9803186.0,
"unit": "MicroSecond (µs)"
},
{
"context": "RESPONSE_TIME_90TH_PERCENTILE",
"value": 2.2907056E7,
"unit": "MicroSecond (µs)"
}],
"isRootCause": true,
"serviceMethodGroup": "Default service method group",
"referenceResponseTime90thPercentile": 140563.8,
"affectedRequestsPerMinute": 353.0,
"referenceResponseTime50thPercentile": 121407.0,
"service": "CheckDestination",
"percentile": "50th"
}

wolfgang_beer · ‎09 Apr 2018

The issue here is that we do not always have an event or incident sitting on the process underneath. In case that also a process event is raised it is also part of the correlated problem of course.

So in an example where you have a cluster of 10 service instances and we detect a service degradation this does not necessarily mean that any of the underlying service instances and processes are showing an outlying behavior.