MCP Server Challenge entry #8: Autonomous SRE Analysis by logs patterns

rgarzon1 · ‎27 Apr 2026

Autonomous SRE Analysis: How We Built the C.A.R. Multi-Agent Framework to Automate Root-Cause Analysis via Dynatrace MCP

Ruben Dario Garzon Toro

Observability Specialist & Pre-Sales Consultant

Submitted for the MCP Server Community Challenge — April 2026

1. The Scenario — The Problem We Set Out to Solve

Modern SRE teams face "Analysis Paralysis." When a service degrades, the volume of logs, metrics, and vulnerabilities in Grail is too vast for immediate human correlation. We noticed that while Dynatrace provides the data, the Chain of Thought required to link a log error to a specific vulnerability or a metadata-driven entity relationship was still a manual task.

We needed a system that doesn't just show data, but reasons through it autonomously.

2. What We Built — The C.A.R. Framework

We engineered a web-based orchestration layer that utilizes the Dynatrace Remote Model Context Protocol (MCP) to power a three-stage autonomous agent pipeline:

Step 1: The Collector (Data Ingestion)

Uses DQL via Dynatrace APIs to gather the "State of the Union": Logs, Metrics, and Vulnerabilities. It focuses on the top 10 log anomalies.
Step 2: The Analyzer (Contextual Reasoning)

The core "brain." It uses the MCP's entity metadata to determine if a log is a cause or an effect. It groups issues by Smartscape entities to identify the blast radius.
Step 3: The Reporter (Governance & Delivery)

Validates the findings against SRE best practices using an LLM, manages execution stages, and delivers a time-stamped executive report via email.

3. Agent Capabilities & Governance Matrix

Each agent operates with a specific scope to ensure reliability and prevent "AI hallucinations":

Agent	SDF Layer (Context)	Primary Toolset	Governance Rule
Collector	Signal	execute_dql, get_logs	Read-only. Must identify 10 distinct patterns before passing.
Analyzer	Defect / Failure	get_entity_details, list_vulnerabilities	Correlative only. Must link log to entity metadata.
Reporter	Reporting	send_email, status_tracker	Validation. Cannot send if Confidence Score < 85%.

4. Implementation Logic: The "Chain of Agents"

Unlike a single prompt, our solution uses a Multi-Stage LLM Validation:

Autonomous Re-launch: If the Analyzer finds insufficient data, it triggers the Collector for a deeper DQL sweep.
Stage-Aware Timing: Each agent tracks its own execution time, ensuring the system stays within the defined SRE response windows (SLAs).

5. Results & Practical Impact

Efficiency: Reduced initial incident triage time from 30 minutes to 45 seconds.
Accuracy: The grouping logic identified "hidden" dependencies that manual log searches often missed.
Scalability: By using the MCP Server, the agents understand the environment topology without hardcoded configurations.

4. The Final Deliverable: SRE-Ready Intelligence

The ultimate goal of the C.A.R. Framework is to move from "Data Noise" to "Actionable Wisdom." Once the Reporter Agent validates the analysis, it dispatches an automated SRE Master Technical Report via email.

This is not just a log dump; it is a structured diagnostic summary. Here is an example of the autonomous output generated after analyzing an Extensions Controller failure:

Subject: [AI-ANALYSIS] SRE Master Report: Extensions Controller Failure Analysis

1. EXECUTIVE SUMMARY: The primary bottleneck is a loss of connectivity with critical external data sources (SNMP and JDBC). The ActiveGate is reporting massive failures because dependent modules cannot establish connections, creating a cascade effect on the endpoint polling process.

2. DEPENDENCY & ROOT CAUSE ANALYSIS:

Root Cause (Level 1): Protocol-level connectivity failures: [ERROR] x10560 (SNMP) and [ERROR] x1071 (JDBC). These indicate network issues, credential expiration, or external service downtime.

Cascade Effect (Level 2): Error x714 (EndpointPollerFactory) is a symptom, not the cause. It occurs because the system is attempting to instantiate data sources with failed Level 1 dependencies.

3. PREVENTIVE ACTION PLAN:

Network Validation: Verify ActiveGate egress traffic and firewall rules for target SNMP devices and JDBC databases.

Credential Audit: Review security groups and access tokens for these external services.

Sequence: Only after stabilizing Level 1 connections will the EndpointPollerFactory error rate normalize.

Technical Context: Entity: Dynatrace ActiveGate Extensions Controller ID: PROCESS_GROUP-B7905CE3A929BE7F Key Metrics: Memory: 1.5% max | CPU Stalls: 0.0% max

fuelled by coffee and curiosity. ☕ searching for a job,

HansLougas · ‎28 Apr 2026

Hey Ruben, very interesting framework. Can you elaborate on the toolset you use (execute_dql, get_logs, get_entity_details)?

rgarzon1 · ‎29 Apr 2026

Hi Hans,

Our framework leverages the Dynatrace MCP as a strategic orchestrator rather than a simple data bridge. By integrating native AI capabilities, we provide a 360° Ops + Security diagnostic through four key pillars:

Advanced Data Processing (execute-dql): We use Grail to aggregate millions of logs into real-time "Error Patterns." This reduces noise and reconstructs entity hierarchies (Host/Process)
Topological Correlation: Instead of flat analysis, the Analyst Agent utilizes entity relationships (from/to/related). This allows the framework to understand dependencies and trace the blast radius across the stack.
Predictive Operations (timeseries-forecast & timeseries-novelty-detection): We’ve moved from reactive to proactive. By using novelty detection to filter out expected spikes and forecasting to project memory consumption (e.g., "72-hour exhaustion warning"), our CAR future STARL reports offer actionable foresight.
Davis AI Context (query-problems): Instead of isolated events, we query the Davis AI engine directly. This links log patterns to existing root-cause problems, enriching diagnostics with severity data while preventing alert duplication.
Full-Stack Security (get-vulnerabilities): We integrate AppSec by crossing application errors with active CVEs. If a failing service has a critical vulnerability, the framework automatically elevates the incident priority.

The Strategy: Use execute-dql for massive data aggregation, but delegate heavy intelligence to native tools like forecast and vulnerabilities for a unified, high-context diagnostic.

fuelled by coffee and curiosity. ☕ searching for a job,