27 Apr 2026
10:18 PM
- last edited on
28 Apr 2026
08:14 AM
by
Michal_Gebacki
Ruben Dario Garzon Toro
Observability Specialist & Pre-Sales Consultant
Submitted for the MCP Server Community Challenge — April 2026
Modern SRE teams face "Analysis Paralysis." When a service degrades, the volume of logs, metrics, and vulnerabilities in Grail is too vast for immediate human correlation. We noticed that while Dynatrace provides the data, the Chain of Thought required to link a log error to a specific vulnerability or a metadata-driven entity relationship was still a manual task.
We needed a system that doesn't just show data, but reasons through it autonomously.
We engineered a web-based orchestration layer that utilizes the Dynatrace Remote Model Context Protocol (MCP) to power a three-stage autonomous agent pipeline:
Step 1: The Collector (Data Ingestion)
Uses DQL via Dynatrace APIs to gather the "State of the Union": Logs, Metrics, and Vulnerabilities. It focuses on the top 10 log anomalies.
Step 2: The Analyzer (Contextual Reasoning)
The core "brain." It uses the MCP's entity metadata to determine if a log is a cause or an effect. It groups issues by Smartscape entities to identify the blast radius.
Step 3: The Reporter (Governance & Delivery)
Validates the findings against SRE best practices using an LLM, manages execution stages, and delivers a time-stamped executive report via email.
Each agent operates with a specific scope to ensure reliability and prevent "AI hallucinations":
| Agent | SDF Layer (Context) | Primary Toolset | Governance Rule |
| Collector | Signal | execute_dql, get_logs | Read-only. Must identify 10 distinct patterns before passing. |
| Analyzer | Defect / Failure | get_entity_details, list_vulnerabilities | Correlative only. Must link log to entity metadata. |
| Reporter | Reporting | send_email, status_tracker | Validation. Cannot send if Confidence Score < 85%. |
Unlike a single prompt, our solution uses a Multi-Stage LLM Validation:
Autonomous Re-launch: If the Analyzer finds insufficient data, it triggers the Collector for a deeper DQL sweep.
Stage-Aware Timing: Each agent tracks its own execution time, ensuring the system stays within the defined SRE response windows (SLAs).
Efficiency: Reduced initial incident triage time from 30 minutes to 45 seconds.
Accuracy: The grouping logic identified "hidden" dependencies that manual log searches often missed.
Scalability: By using the MCP Server, the agents understand the environment topology without hardcoded configurations.
The ultimate goal of the C.A.R. Framework is to move from "Data Noise" to "Actionable Wisdom." Once the Reporter Agent validates the analysis, it dispatches an automated SRE Master Technical Report via email.
This is not just a log dump; it is a structured diagnostic summary. Here is an example of the autonomous output generated after analyzing an Extensions Controller failure:
Subject: [AI-ANALYSIS] SRE Master Report: Extensions Controller Failure Analysis
1. EXECUTIVE SUMMARY: The primary bottleneck is a loss of connectivity with critical external data sources (SNMP and JDBC). The ActiveGate is reporting massive failures because dependent modules cannot establish connections, creating a cascade effect on the endpoint polling process.
2. DEPENDENCY & ROOT CAUSE ANALYSIS:
Root Cause (Level 1): Protocol-level connectivity failures: [ERROR] x10560 (SNMP) and [ERROR] x1071 (JDBC). These indicate network issues, credential expiration, or external service downtime.
Cascade Effect (Level 2): Error x714 (EndpointPollerFactory) is a symptom, not the cause. It occurs because the system is attempting to instantiate data sources with failed Level 1 dependencies.
3. PREVENTIVE ACTION PLAN:
Network Validation: Verify ActiveGate egress traffic and firewall rules for target SNMP devices and JDBC databases.
Credential Audit: Review security groups and access tokens for these external services.
Sequence: Only after stabilizing Level 1 connections will the EndpointPollerFactory error rate normalize.
Technical Context: Entity: Dynatrace ActiveGate Extensions Controller ID: PROCESS_GROUP-B7905CE3A929BE7F Key Metrics: Memory: 1.5% max | CPU Stalls: 0.0% max
![]()
28 Apr 2026 01:57 PM
Hey Ruben, very interesting framework. Can you elaborate on the toolset you use (execute_dql, get_logs, get_entity_details)?
29 Apr 2026 08:13 AM
Hi Hans,
Our framework leverages the Dynatrace MCP as a strategic orchestrator rather than a simple data bridge. By integrating native AI capabilities, we provide a 360° Ops + Security diagnostic through four key pillars:
Advanced Data Processing (execute-dql): We use Grail to aggregate millions of logs into real-time "Error Patterns." This reduces noise and reconstructs entity hierarchies (Host/Process)
Predictive Operations (timeseries-forecast & timeseries-novelty-detection): We’ve moved from reactive to proactive. By using novelty detection to filter out expected spikes and forecasting to project memory consumption (e.g., "72-hour exhaustion warning"), our CAR future STARL reports offer actionable foresight.
Davis AI Context (query-problems): Instead of isolated events, we query the Davis AI engine directly. This links log patterns to existing root-cause problems, enriching diagnostics with severity data while preventing alert duplication.
Full-Stack Security (get-vulnerabilities): We integrate AppSec by crossing application errors with active CVEs. If a failing service has a critical vulnerability, the framework automatically elevates the incident priority.
The Strategy: Use execute-dql for massive data aggregation, but delegate heavy intelligence to native tools like forecast and vulnerabilities for a unified, high-context diagnostic.
Featured Posts