25 Apr 2026
10:17 PM
- last edited on
30 Apr 2026
12:29 PM
by
Michal_Gebacki
How We Built a Signal–Defect–Failure Classification Framework, and Extended It to Govern Remote Model Context Protocol (MCP) Server Interactions — Ready for Federal Scale
Author: Randy Chambers
Role: Dynatrace Practice Lead
Organization: Discipline Consulting Group LLC
Contact: rchambers@disciplineconsulting.com | 540-645-1149
Submitted for the MCP Server Challenge — April 2026
――――――――――――――――――――――――――――――――――――――――
When Dynatrace redesigned its certification exam to scenario-based diagnostic reasoning in October 2024, it exposed a gap we saw firsthand. Our candidates were knowledgeable — they understood OneAgent, Davis Artificial Intelligence (AI), Smartscape, Grail, and the platform's architecture. But they couldn't solve scenario-based problems consistently because they lacked a structured diagnostic methodology. The training pipeline taught what things are. The exam tested how to use them to solve problems. Nothing bridged that gap.
For organizations operating in federal environments — where every automated action must be auditable, classified, and authorized — ungoverned AI agent access is a deployment blocker. You can't hand an AI agent 14 tools and say "figure it out." You need a classification framework that determines what the agent can see, what it can do, and what it cannot do, based on the data it's working with. This isn't hypothetical. Dynatrace is Federal Risk and Authorization Management Program (FedRAMP) authorized and actively pursuing federal market expansion. Federal agencies are already deploying Dynatrace in production environments governed by NIST, FISMA, and FedRAMP continuous monitoring requirements. When these agencies adopt the MCP Server for agentic operations, every AI agent action must satisfy the same compliance requirements as every human operator action. The governance gap isn't a future problem — it's a deployment prerequisite.
We already understood this problem. For our federal customers operating hybrid scientific environments, we built the Integrated Continuous Security Methodology (CSM) — an operational cycle that keeps these environments defensible, auditable, and mission-ready. CSM treats security as an ongoing, measurable process rather than a one-time project. The cycle operates across five continuous phases: Detect, Respond, Remediate, Verify, Report. CSM connects telemetry, people, engineering, and governance so that threats are found quickly, handled consistently, and lessons are fed back into controls and documentation.
At the same time, we noticed something about Davis AI itself. Internally, the platform operates on a classification pipeline that nobody had named:
This classification pipeline — the baseline calculation, event correlation, and topology-aware root cause analysis engine that Wolfgang Beer has architected inside Davis AI — is the foundation that SDF formalizes. We didn't invent the classification logic. We named what was already there and extended it to govern the MCP Server boundary.
When the Dynatrace MCP Server launched at Perform 2026 — announced as "the connective tissue between agentic systems and Dynatrace Intelligence" and deployed internally as Customer Zero — it gave AI agents direct access to 14 tools and 6 agent-level capabilities. That's powerful. It's also ungoverned. When an AI agent connects to the MCP Server, it can call get_environment_info AND list_problems AND send_slack_message in the same session. Nothing in the MCP protocol itself distinguishes between reading baseline metrics and triggering remediation workflows. The agent sees tools. It doesn't see boundaries.
We set out to change that — first for certification, then for production operations, and now for AI agent governance through the MCP Server. To complement the CSM and extend governance to agentic AI, we built the SDF Governance Guard Framework.
|
CSM Phase |
What Happens Operationally |
Davis AI Function |
SDF Layer |
MCP Agent Permission |
|
Detect |
Continuous monitoring identifies anomalies in the hybrid environment |
OneAgent ingests telemetry; Davis establishes and monitors baselines |
Signal |
OBSERVE — read metrics, topology, logs |
|
Respond |
Qualified anomaly triggers investigation and stakeholder notification |
Davis generates events when baselines are breached; anomaly qualified but impact not confirmed |
Defect |
INVESTIGATE — create tickets, send notifications, query related entities |
|
Remediate |
Confirmed problem triggers pre-approved corrective action |
Davis correlates events into problems with confirmed root cause and affected entity chain |
Failure |
REMEDIATE — execute pre-approved playbooks within defined blast radius |
|
Verify |
Confirm resolution; validate metrics return to baseline |
Davis monitors for auto-resolution; telemetry confirms baseline recovery |
Signal (return) |
OBSERVE — confirm baseline recovery, validate remediation effectiveness |
|
Report |
Document the full lifecycle; feed lessons into controls |
Complete audit trail across all classification layers; governance artifacts updated |
All layers |
AUDIT — full chain: tool call → data → SDF classification → permission → action → outcome |
This alignment means the CSM cycle our federal customers already operate becomes enforceable through the MCP Server. When an AI agent calls an MCP tool, the SDF Governance Guard classifies the data, resolves the permission, and maps the action to the corresponding CSM phase — creating a single, unified governance model for both human operations and agentic AI.
The National Institute of Standards and Technology Interagency Report (NIST IR) 8011 defines an automated security assessment methodology built on defect checks — systematic evaluations that determine whether a security control is operating as intended. NIST IR 8011's assessment pipeline follows a progression that is structurally isomorphic to SDF:
This structural isomorphism means SDF classification doesn't just align with Dynatrace's internal architecture — it aligns with the federal government's own methodology for automated security assessment.
Dynatrace deployed the MCP Server internally as Customer Zero. This is where three frameworks converge:
At the Customer Zero mesh point, every AI agent action flows through all three: the MCP tool call retrieves data, SDF classifies it, and the classification maps to a CSM phase — ensuring the agent operates within the same operational governance that human operators follow. This is what makes the framework ready for federal scale.
――――――――――――――――――――――――――――――――――――――――
Before the MCP Server existed, we built the CSM operational cycle for our federal customers' hybrid scientific environments. CSM established the governance baseline: Detect, Respond, Remediate, Verify, Report. Every automated action — whether performed by a human operator, a script, or an AI agent — must map to a CSM phase and produce an auditable record.
We formalized what Davis AI already does internally as a three-layer classification taxonomy:
Every piece of data accessible through the MCP Server sits on one of these three layers.
The LOCATE diagnostic protocol — a six-step reasoning framework that mirrors how Davis AI root-causes problems:
LOCATE is the human-executable version of Davis AI's deterministic fault-tree analysis. We use it to train practitioners AND to validate the reasoning path an AI agent should follow when operating through the MCP Server. When an agent follows the LOCATE protocol through MCP tools, its investigation maps to the CSM Respond and Remediate phases — creating a traceable reasoning chain.
We didn't build a study guide — we built a systems-engineered training architecture with the same rigor Dynatrace applies to its own platform:
Supporting infrastructure: 27 unified taxonomies in a Master Reference Catalogue, 9 governance artifacts aligned to the Department of Homeland Security (DHS) Systems Engineering Lifecycle (SELC), and a Cross-Pillar Traceability Matrix.
We deployed the SDF/LOCATE ecosystem with Dynatrace Practitioner exam candidates through Discipline Consulting Group. Candidates followed the 30-day structured mastery path, working through the SDF classification framework and the 53 scenario drills. The key differentiator: instead of memorizing platform features, candidates learned to classify observability data (Signal, Defect, or Failure), then apply the LOCATE protocol to reason through scenarios diagnostically. Candidates achieved exam scores of 85 and above. The ecosystem runs without requiring the original architect to deliver every session — facilitator-independent deployment.
We mapped the SDF classification framework to the Dynatrace x ServiceNow strategic partnership (announced October 2025). Every one of the 6 certified ServiceNow integrations operates on SDF-classified data:
|
ServiceNow Integration |
SDF Layer |
CSM Phase |
Function |
|
Service Graph Connector |
Signal |
Detect |
Topology synchronization |
|
Event Management Connector |
Signal → Defect |
Detect → Respond |
Qualified anomalies cross the boundary |
|
Incident Integration App |
Defect → Failure |
Respond → Remediate |
Confirmed problems with root cause context |
|
Dynatrace Workflows for ServiceNow |
All layers |
All phases |
Orchestration across SDF/CSM spectrum |
|
Service Observability Connector |
Signal |
Detect |
Context enrichment |
|
Analysis AI Agent Connector |
Failure |
Remediate |
Agentic root cause analysis |
When the Dynatrace MCP Server launched at Perform 2026, we asked: does the same SDF classification framework that governs our training and the ServiceNow integration boundary also govern what AI agents can see and do through the MCP Server? The answer was yes. We mapped every one of the 14 MCP Server tools and 6 agent-level tools to their SDF classification layer and CSM phase, and built the SDF Governance Guard.
We walked through three end-to-end MCP interaction patterns — Signal monitoring (CSM Detect), Defect investigation (CSM Respond), and Failure remediation (CSM Remediate) — demonstrating that SDF governance prevents both over-action (remediating noise) and under-action (merely alerting on confirmed outages). Each scenario includes the Verify and Report phases to complete the CSM cycle.
――――――――――――――――――――――――――――――――――――――――
|
SDF Layer |
CSM Phase |
What the Agent Sees |
What the Agent Can Do |
What It CANNOT Do |
|
Signal(Observe) |
Detect |
Baseline metrics, entity topology, logs, environment info |
Read data, generate reports, explain trends, compare to baselines |
Cannot create alerts, cannot send notifications, cannot trigger workflows |
|
Defect(Investigate) |
Respond |
Davis events, vulnerabilities, Kubernetes warning/error events |
Create investigation tickets, send notifications, recommend actions, query related entities for root cause hypothesis |
Cannot execute remediation, cannot modify infrastructure, cannot auto-resolve |
|
Failure(Remediate — Guardrailed) |
Remediate |
Davis problems with confirmed root cause and affected entity chain |
Execute pre-approved remediation playbooks, create P1 incidents, trigger notification workflows |
Cannot execute novel remediation without human approval, cannot exceed defined blast radius |
Note: The Verify and Report phases close the CSM loop: after any Failure-level action, the agent re-queries metrics (Signal) to confirm baseline recovery (Verify), and the full classification chain is logged for audit (Report).
――――――――――――――――――――――――――――――――――――――――
|
MCP Tool |
SDF Layer |
CSM Phase |
Permission Level |
Governance Rule |
|
get_environment_info |
Signal |
Detect |
OBSERVE |
Read-only. No action constraints. |
|
get_entity_details |
Signal |
Detect |
OBSERVE |
Read-only. Returns topology context. |
|
get_ownership |
Signal |
Detect |
OBSERVE |
Read-only. Returns ownership for notification routing. |
|
get_logs_for_entity |
Signal |
Detect |
OBSERVE |
Read-only. Rate limiting recommended for large log volumes. |
|
verify_dql |
Signal (meta) |
Detect |
OBSERVE |
Validates DQL syntax only. No data exposure. |
|
execute_dql |
Depends on query |
Detect → Respond → Remediate |
OBSERVE → REMEDIATE |
Classification depends on query results. Metrics = Signal. Events = Defect. Problems = Failure. Agent must classify results before acting. |
|
get_kubernetes_events |
Signal + Defect |
Detect + Respond |
OBSERVE + INVESTIGATE |
Normal K8s events = Signal. Warning/Error events = Defect. Agent must distinguish before acting. |
|
list_vulnerabilities |
Defect |
Respond |
INVESTIGATE |
Returns CVEs — qualified anomalies requiring investigation, not immediate remediation. |
|
get_vulnerability_details |
Defect |
Respond |
INVESTIGATE |
Deep vulnerability context. Investigation only — remediation requires change management. |
|
list_problems |
Failure |
Remediate |
INVESTIGATE + REMEDIATE |
Returns confirmed problems with root cause. Agent can recommend and (if pre-approved) execute remediation. |
|
get_problem_details |
Failure |
Remediate |
INVESTIGATE + REMEDIATE |
Deep problem context. Full governance rules apply. |
|
send_slack_message |
Defect + Failure |
Respond + Remediate |
NOTIFY |
Channel routing must match SDF layer: Defect → investigation channel, Failure → incident channel. |
|
create_workflow_for_notification |
All layers |
All phases |
ORCHESTRATE |
Created workflows must embed SDF classification checks. |
|
update_workflow |
Signal (meta) |
Report |
ADMINISTER |
Administrative action — governance review recommended. |
|
Agent Tool |
SDF Layer |
CSM Phase |
Permission Level |
|
Grail Query Agent |
All (query-dependent) |
All phases |
OBSERVE → REMEDIATE — depends on what's queried |
|
DQL Explanation Agent |
Signal (meta) |
Detect |
OBSERVE — explains queries, no data exposure |
|
Help Agent |
Signal (meta) |
Detect |
OBSERVE — product information only |
|
Data Analysis Agent |
All (query-dependent) |
All phases |
OBSERVE → REMEDIATE — depends on results classification |
|
Root Cause Agent |
Failure |
Remediate |
INVESTIGATE + REMEDIATE — specifically designed for problem analysis |
|
Forecasting Agent |
Signal + Defect |
Detect + Respond |
OBSERVE + INVESTIGATE — predicts future anomalies |
――――――――――――――――――――――――――――――――――――――――
These three scenarios demonstrate how SDF classification governs real MCP Server interactions through the complete CSM cycle — preventing both over-action and under-action.
Why this matters: Without SDF governance, an eager agent might flag 68% CPU as "high" and send an unnecessary Slack alert. SDF classification prevented that false positive from becoming operational noise. In the CSM model, the Detect phase completed cleanly — no escalation to Respond.
Governance: INVESTIGATE permission. Agent alerts and investigates but cannot remediate. The LOCATE protocol guided the investigation path, and the CSM Respond phase ensures the investigation is documented and traceable.
Governance: REMEDIATE permission within pre-approved scope. Agent acts but stays within guardrails. Novel remediation requires human approval. The complete CSM cycle executed through MCP tools, governed by SDF classification, and fully auditable.
These three scenarios demonstrate a principle that Wolfgang Heider will recognize from his work on progressive delivery and CI/CD pipeline architecture: classification-driven progression. Just as progressive delivery gates software releases through staged validation — ensuring each promotion is earned — SDF gates agent actions through staged classification. Signal → Defect → Failure. Each escalation is validated, each action is authorized, each outcome is auditable. The same engineering rigor that governs how code moves through delivery pipelines now governs how AI agents move through observability data.
――――――――――――――――――――――――――――――――――――――――
Every SDF-governed MCP interaction follows these rules. Each rule maps to a CSM principle — ensuring that agent governance and operational governance are unified.
Rule 1 — Classification Determines Permission. The SDF layer of the data determines the agent's authorized action scope. Signal = observe (CSM Detect). Defect = investigate (CSM Respond). Failure = remediate, guardrailed (CSM Remediate).
Rule 2 — No Escalation Without Classification. An agent cannot jump from observation to remediation without confirming the SDF classification changed. Every escalation must be traceable to a classification transition — mirroring the CSM requirement that every phase transition is documented.
Rule 3 — Guardrailed Remediation Only. Even at the Failure level, remediation is limited to pre-approved playbooks with a defined blast radius. Novel remediation requires human approval. This enforces the CSM principle that corrective actions must be authorized and bounded.
Rule 4 — Classification Auditability. Every agent action traces back to the SDF classification that authorized it. Full audit chain: tool call → data returned → SDF classification → permission resolved → action taken. This directly supports CSM Report phase requirements and federal audit compliance.
Rule 5 — Severity Mapping Consistency. Davis event severity maps consistently to notification routing. Defect-level events route to investigation channels (CSM Respond). Failure-level problems route to incident channels (CSM Remediate). No cross-routing.
Rule 6 — Topology-Aware Classification. SDF classification requires topology context. Agents must query entity relationships (via get_entity_details or Smartscape data) before making classification-dependent decisions. This ensures the CSM Respond phase includes full dependency analysis.
Rule 7 — Feedback Loop Integration. Every agent action outcome feeds back into classification refinement. If a Defect-classified event escalates to Failure, the classification history informs future pattern matching. This is the CSM Verify-to-Report feedback loop — lessons learned are fed back into controls and documentation. The framework self-improves.
――――――――――――――――――――――――――――――――――――――――
|
ServiceNow Integration |
SDF Layer |
CSM Phase |
Function |
|
Service Graph Connector |
Signal |
Detect |
Topology synchronization |
|
Event Management Connector |
Signal → Defect |
Detect → Respond |
Qualified anomalies cross the boundary |
|
Incident Integration App |
Defect → Failure |
Respond → Remediate |
Confirmed problems with root cause context |
|
Dynatrace Workflows for ServiceNow |
All layers |
All phases |
Orchestration across SDF/CSM spectrum |
|
Service Observability Connector |
Signal |
Detect |
Context enrichment |
|
Analysis AI Agent Connector |
Failure |
Remediate |
Agentic root cause analysis |
The CSM operational cycle and the SDF classification taxonomy are structurally isomorphic — they describe the same governance logic at different layers. CSM governs operational process. SDF governs data classification. The MCP Server provides the agent access layer. At the Customer Zero mesh point, all three converge: AI agents operate through MCP tools, SDF classifies the data to determine permissions, and every action maps to a CSM phase — creating a unified governance model that makes autonomous operations auditable, deterministic, and ready for federal deployment.
――――――――――――――――――――――――――――――――――――――――
Dynatrace Intelligence is fusing deterministic AI with agentic AI for autonomous operations. The MCP Server is the connective tissue that makes this accessible to every AI ecosystem. But autonomous operations without classification governance is like giving someone the keys to the car without traffic laws.
Andreas Grabner has championed observability-driven engineering and autonomous cloud operations for over a decade — the vision that organizations should be able to deploy, operate, and remediate with confidence because observability makes the invisible visible. SDF Governance Guard is the governance layer that makes that vision safe for agentic AI. When AI agents operate autonomously through the MCP Server, every action they take must be as observable, classified, and auditable as the telemetry they're analyzing. The autonomous cloud future that Andreas Grabner has been building toward doesn't just need observability — it needs classification governance. SDF provides it.
Signal–Defect–Failure classification IS those traffic laws. The Continuous Security Methodology IS the rules of the road.
What we built is not just a governance framework for the MCP Server — it's a unified operational model where human governance (CSM) and agent governance (SDF) operate on the same classification logic. The Customer Zero deployment proved it works. The 85+ exam scores proved the classification framework is learnable by humans. The three scenarios proved it governs agent behavior through MCP tools.
For federal environments — where FedRAMP authorization and NIST compliance require every automated action to be classified and auditable — the CSM + SDF Governance Guard makes the MCP Server deployable.
The platform already speaks Signal–Defect–Failure. The CSM cycle already governs how our federal customers operate. We unified both — and now the AI agents operate within the same governance model as the humans they support.
――――――――――――――――――――――――――――――――――――――――
Dynatrace is FedRAMP authorized. Federal agencies are deploying the platform in environments governed by NIST, FISMA, and continuous monitoring mandates. As these agencies adopt the MCP Server for agentic operations, a compliance requirement emerges: All observability data in federal environments should be classifiable using a taxonomy that maps to NIST IR 8011's defect check methodology. SDF provides that mapping.
This is not a theoretical position — it's a structural fact. SDF classification is isomorphic to NIST IR 8011's automated assessment pipeline. Every MCP tool call that returns SDF-classified data produces output that is directly assessable under federal compliance frameworks. No translation layer required. No compliance bolt-on needed. The governance is native to the classification.
The implication for Dynatrace is strategic: if federal customers begin requiring SDF classification as a deployment standard for MCP-connected environments, the platform has a business incentive to formalize it — in documentation, in certification, and in the MCP Server architecture itself. Federal compliance requirements create top-down institutional pressure that complements the bottom-up community adoption already underway through initiatives like this MCP Server Challenge.
SDF Governance Guard isn't asking Dynatrace to build something new. It's asking Dynatrace to name and formalize what Davis AI already does — the same way we did. The classification pipeline exists. The compliance mapping exists. The MCP Server exists. SDF connects them into a governance model that makes autonomous operations deployable at federal scale.
――――――――――――――――――――――――――――――――――――――――
This framework is designed to be adopted by any organization connecting AI agents to Dynatrace via the MCP Server:
The full ecosystem documentation — including the Integrated CSM Model, the SDF Connector Classification Map, the SDF for Agentic Operations Governance Framework, the NIST IR 8011 Compliance Crosswalk, the Seven Pillars Training Architecture, and the 53 Scenario Drill Library — is available upon request.
――――――――――――――――――――――――――――――――――――――――
To Wolfgang Beer, Wolfgang Heider, Gabriele HB, and Andreas Grabner — this submission is built on a conviction: the classification logic that already runs inside Davis AI is too important to remain unnamed and informal. Wolfgang Beer built the engine — the baseline calculation, event correlation, and root cause analysis pipeline — that makes SDF possible. Wolfgang Heider's work on progressive delivery and CI/CD architecture demonstrates that classification-driven progression is already a proven engineering pattern. Andreas Grabner's decade-long advocacy for observability-driven autonomous operations defines the exact future that needs classification governance. And Gabriele HB's commitment to product quality and community engagement is precisely the lens through which frameworks like SDF move from community innovation to platform capability.
We named what was already there. We formalized it. We proved it works — with 85+ exam scores, with a complete CSM operational cycle, with NIST IR 8011 structural isomorphism, and with three practical MCP scenarios that demonstrate classification-governed agent behavior. SDF Governance Guard is ready for the platform. We hope this submission demonstrates why.
――――――――――――――――――――――――――――――――――――――――
Randy Chambers
Dynatrace Practice Lead | Discipline Consulting Group LLC
rchambers@disciplineconsulting.com
25 Apr 2026 10:25 PM
Here's my submission for the MCP Server Challenge: [paste your post URL here]
At Discipline Consulting Group, we built a Signal–Defect–Failure (SDF) classification framework that started as a certification training methodology — our candidates now score 85+ on the Dynatrace Practitioner exam using it. We then discovered that SDF classification governs every integration boundary between Dynatrace and its strategic partners (all 6 ServiceNow connectors, all 17 workflow connectors). So we extended it to the MCP Server: we classified all 14 MCP tools and 6 agent-level tools by SDF layer, and built a governance framework — the SDF Governance Guard — that uses data classification to determine what AI agents can see and do. For review or a comment: @wolfgang_beer, @wolfgang_heider, @GabrieleHB, @andreas_grabner
Full write-up with the Agent Permission Matrix, MCP tool governance map, three practical scenarios, and seven governance rules in the post. Looking forward to feedback!
— Randy Chambers, Dynatrace Practice Lead, Discipline Consulting Group LLC
Featured Posts