AI
Everything around AI: AI observability, agentic AI, LLMs, MCP servers, and more
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MCP Server Challenge entry #9: R.E.A.D.Y. - Reliability Evidence Assessment for Dynatrace Readiness

MaximilianoML
Champion

R.E.A.D.Y. Project Use Case Write-Up

R.E.A.D.Y. is a Dynatrace-native app that uses Dynatrace Remote MCP, DQL, and Dynatrace platform APIs to transform observability data into two practical operator workflows:

  1. Problems Intelligence for real operational triage
  2. Ready Report Generation for fleet-level operational readiness

The goal is not to create another chatbot-first experience. Instead, R.E.A.D.Y. reduces manual observability work by collecting, normalizing, and scoring evidence before any AI explanation is generated.

The workflow follows a simple pattern:

Dynatrace MCP / DQL / APIs
-> normalized evidence
-> deterministic checks and scoring
-> optional OpenAI summarization
-> operator-ready insight

Problem We Wanted to Solve

In many environments, the signals already exist in Dynatrace, but answering operational questions still requires too many manual steps.

Operators often need to:

  • review recent Davis Problems
  • understand which services, applications, or entities are most affected
  • compare categories such as Error, Slowdown, Resource, Availability, or Custom
  • inspect duration and recurrence patterns
  • identify the next entity to investigate
  • assess whether a service or application fleet is operationally ready

The data is available, but the workflow is fragmented. R.E.A.D.Y. brings that evidence together into a structured, repeatable, and operator-friendly experience.

Tools Used

1. Dynatrace Remote MCP

Dynatrace Remote MCP is used as the main evidence and context bridge.

MCP tools used in the project include:

  • execute-dql
  • get-entity-id
  • get-entity-name
  • query-problems
  • get-problem-by-id
  • find-documents
  • find-troubleshooting-guides

2. Dynatrace Platform APIs and App Functions

Dynatrace App Functions are used on the backend side of the app to orchestrate evidence collection and report generation securely.

This keeps sensitive configuration and tokens out of the browser and makes the flow more production-friendly.

3. OpenAI API

OpenAI is used only after the evidence has already been collected, normalized, and scored.

The AI layer is optional and is used to generate concise operator-facing explanations from the real evidence set, not to invent conclusions.

4. Playbook based

The Playbooks layer provides structured guidance for the LLM, defining how it should use Dynatrace MCP tools, DQL, and platform evidence during the analysis flow. Instead of allowing the model to guess, each playbook orients the LLM to first collect evidence, resolve entities, query Problems or telemetry when needed, validate the available context, and only then generate an operator-facing explanation. This keeps the output grounded in real Dynatrace data, reduces hallucination risk, and makes the workflow repeatable across different operational scenarios.

Primary Use Case 1: Problems Intelligence for Real Operational Triage

Scenario

An operator wants to understand the health of a selected scope, such as services, applications, frontends, or infrastructure, over a time window like the last 2 hours, 24 hours, 7 days, or 30 days.

Instead of opening multiple Dynatrace views manually, the operator opens the Problems view in R.E.A.D.Y. and filters by:

  • time window
  • impact
  • status
  • category or type

What the App Does

  1. Queries recent Davis Problems through Dynatrace MCP.
  2. Normalizes the result set into a stable Problems overview payload.
  3. Aggregates total Problems, active vs. closed Problems, status, category, time trends, duration statistics, recurrent Problems, top affected entities, and slowdown-related endpoints or services.
  4. Resolves entity IDs into readable names using MCP.
  5. Links affected entities directly to Dynatrace topology views.
  6. Optionally generates one AI Operator Insight from the real evidence set.

Why MCP Matters

MCP makes the integration practical because the app can use stable, named tools instead of hardcoding every tenant-specific retrieval path into the UI.

It allows the app to:

  • discover available tools
  • query Problems consistently
  • resolve entity names
  • execute DQL
  • search related documentation and troubleshooting content

Results Achieved

  • a live Problems dashboard backed by real Dynatrace data
  • filtering by scope, status, impact, category, and time window
  • readable entity names where resolution is possible
  • direct links to affected entities in Dynatrace
  • histograms showing how long Problems stay open
  • category-level duration distribution
  • recurrence and concentration signals
  • optional AI-generated operator insight based only on collected evidence

Practical Value

This helps operators answer questions such as:

  • Are current Problems active pressure or mostly historical noise?
  • Which category dominates this time window?
  • Are Slowdown Problems short-lived or staying open too long?
  • Is one service, application, or entity repeatedly involved?
  • Which entity should be inspected next?
  • Where is operational risk concentrated?

Primary Use Case 2: Fleet-Level Ready Reports for Operational Readiness

Scenario

A platform team or SRE team wants to assess whether a fleet is operationally ready.

This is different from simply checking whether telemetry exists. A service may have traces and metrics but still be missing important operational metadata, ownership, documentation, or governance signals.

R.E.A.D.Y. currently supports readiness generation for:

  • All Services
  • All Applications

What the App Does

The Ready Report workflow collects evidence for the selected scope and evaluates readiness using deterministic checks.

Example signals include:

  • ownership metadata
  • team tags
  • environment tags
  • runbook or contact metadata
  • dashboard evidence
  • documentation evidence
  • governance metadata
  • application or service operational context

The app then generates a structured report containing:

  • overall readiness score
  • domain-level results
  • detected gaps
  • recommendations
  • evidence status

The result clearly separates:

  • evidence present
  • evidence missing
  • evidence unknown or unavailable

That distinction is important. R.E.A.D.Y. does not pretend to know more than the data supports.

Why This Is Useful

Readiness reviews are often manual and inconsistent. One team may consider a service ready because it has traffic. Another team may expect ownership, SLOs, dashboards, alerts, documentation, and runbooks.

R.E.A.D.Y. makes that conversation more explicit and repeatable by turning readiness into an evidence-based workflow.

Results Achieved

  • real fleet-level readiness generation for Services
  • real fleet-level readiness generation for Applications
  • deterministic scoring before AI explanation
  • clear visibility into missing operational metadata
  • structured recommendations based on the collected evidence
  • a repeatable report format that can be reused across environments

Repeatable Workflow Pattern

This project is not tied to one specific tenant. The same pattern can be reused in other Dynatrace environments.

  1. Configure the Dynatrace environment URL and platform token.
  2. Configure the Dynatrace MCP server endpoint.
  3. Discover available MCP tools.
  4. Collect evidence through MCP, DQL, and platform APIs.
  5. Normalize the evidence into a stable internal structure.
  6. Apply deterministic checks and scoring.
  7. Use AI only after the evidence is already structured.

This same pattern can be extended to:

  • Kubernetes workload readiness
  • synthetic monitor readiness
  • host readiness
  • deployment-change correlation workflows
  • fleet-wide governance audits
  • service ownership validation
  • operational metadata quality checks

Why This Is a Good Dynatrace MCP Use Case

This project demonstrates Dynatrace MCP beyond a simple chat interface.

R.E.A.D.Y. uses MCP as part of a real operator workflow for:

  • evidence discovery
  • DQL execution
  • entity resolution
  • Problems analytics
  • documentation search
  • troubleshooting context
  • readiness assessment

The key value is that MCP becomes an operational building block. It helps answer questions teams already care about:

  • Where should we investigate first?
  • What is recurring?
  • Which entities are driving risk?
  • What evidence is missing?
  • Is this fleet operationally ready?
  • What should be improved before production readiness is accepted?

Architecture Pattern

UI
-> Dynatrace App Function orchestration
-> Dynatrace MCP / DQL / Platform APIs
-> normalized evidence layer
-> deterministic rules and scoring
-> optional AI summarization
-> operator-facing result

This design keeps the system explainable, testable, auditable, extensible, and grounded in Dynatrace data.

What Makes It Creative

The creativity in R.E.A.D.Y. is not about replacing operators with AI.

The creative part is combining Dynatrace-native evidence collection, MCP-powered context retrieval, deterministic operational scoring, and optional AI explanation into one repeatable workflow.

Business and Operational Outcome

R.E.A.D.Y. helps teams:

  • shorten time to triage
  • standardize readiness reviews
  • identify recurring Problems
  • detect concentration of operational risk
  • highlight missing ownership or governance metadata
  • produce structured reports instead of relying on tribal knowledge
  • make operational readiness more evidence-based

In short, Dynatrace MCP is used here not as a novelty, but as a repeatable operational building block for observability-driven decision support.

Below you have some images of the App:

Problems Intelligence

MaximilianoML_3-1777970963661.png

MaximilianoML_1-1777970881063.png

MaximilianoML_4-1777971057463.png

Fleet-Level Ready Reports for Operational Readiness

MaximilianoML_5-1777971217464.png

MaximilianoML_6-1777971334014.png

MaximilianoML_7-1777971404041.png

 

Max Lopes
3 REPLIES 3

MaximilianoML
Champion

I made some improvements on the app, look below:

MaximilianoML_0-1778492071839.png

MaximilianoML_1-1778492247985.png

MaximilianoML_2-1778492345174.png

MaximilianoML_3-1778492415351.png

 

 

Max Lopes

andreas_grabner
Dynatrace Guru
Dynatrace Guru

Hi. Really cool. Got a couple of quick questions for you on this one

* Does the App query the data ad-hoc when you open it? Or is the analysis result persisted in Grail as it is done automatically?

* You talk about a Workflow. Also wondering if you have an actual Dynatrace Workflow that periodically executes the steps you explained and the persist the data in grail so that the app has it easier to visualize the insights

* You said that the MCP here is not used as a novelty. I am actually wondering why you use the MCP at all and not just make calls to the Dynatrace API as you are a native Dynatrace App anyway. I obviously like it that you call the MCP - but - it feels like architecturally not really necessary. Also wondering how you configure the MCP. Do you need to configure a token in the App that is used to authenticate against the MCP? Can you connect to multiple MCPs?

 

Contact our DevRel team through devrel@dynatrace.com

Thanks @andreas_grabner , really good questions.

The short answer is that today R.E.A.D.Y. is mostly interactive and ad-hoc. When the user generates a report or uses the Problems tab, the app collects the evidence at that moment, runs the deterministic logic, and returns the result to the UI.

For the Ready Report, the current flow is:

  1. User selects scope
  2. App Function generates the report
  3. DQL / MCP / API evidence collection
  4. deterministic checks and scoring
  5. report returned to the UI

The current implementation reads from Dynatrace/Grail, but it does not yet persist readiness scores or historical report results back into Grail. The report exists as the App Function response and can be exported, for example as Markdown.

The Problems tab works in a similar way. It is also ad-hoc today. When the tab is opened or when filters change, the UI calls the problems-overview App Function through useAppFunction. That function queries current Dynatrace data for the selected time window. It does not read precomputed or persisted analysis results from Grail.

For the Problems tab, the current flow is closer to:

  1. Problems tab opens / filters change
  2. problems-overview App Function
  3. MCP execute-dql / entity lookup
  4. aggregate Problems, recurrence, blast radius, duration, deployments, etc.
  5. return visualization-ready result to the UI

So today:

  • There is no scheduled Dynatrace Workflow, and not planned actually.
  • There is no persisted Problems Intelligence snapshot in Grail yet.
  • There is no historical precomputed trend store created by this app yet.
  • The app reads from Dynatrace/Grail data, computes the overview, and returns the result live.

The main difference between the two tabs is the current MCP usage pattern. The Problems tab currently leans more directly on MCP. The problems-overview function uses MCP helpers such as executeDql, getEntityName, and getMcpToolRegistry. The Ready Report side is designed with an evidence-provider approach, MCP evidence, and later direct DQL/API evidence providers.

The AI part is also ad-hoc. For Problems, generate-problems-ai-insights takes the already-computed Problems overview data and asks AI to produce an operational insight. It does not discover Problems by itself but it can use the MCP if needs, and it does not persist the AI result.

Regarding the Workflow point: in my description I used “workflow” mostly in the process sense, meaning the sequence of steps the app follows to generate the readiness assessment or Problems overview. I do not yet have a scheduled Dynatrace Workflow periodically executing the full analysis and persisting the result into Grail, but it's a good idea.

That would be a very good next step. This would give two modes:

  • Ad-hoc mode: generate a fresh report or Problems overview for a selected service, application, or scope.
  • Scheduled mode: periodically scan important scopes and persist readiness or Problems Intelligence snapshots to Grail.

Regarding the usage MCP: I agree that, architecturally, the deterministic core could be implemented directly with DQL and Dynatrace APIs, especially because this is a native Dynatrace App.

The reason I use MCP here is not because the app cannot work without it, but because MCP adds value as a reusable evidence and context layer. It gives the app a stable tool interface for things like DQL execution, entity lookup, Problems lookup, documentation discovery, and future playbook-driven investigation.

In other words, I do not want MCP to replace the native app architecture. I want to use it where it adds value: as an evidence bridge that helps collect and resolve context in a repeatable way, without turning the app into a generic chatbot.

For configuration, the current PoC keeps the MCP connection server-side in the App Function layer. The browser does not call MCP directly. The App Function uses a configured MCP endpoint and token. For a production-ready version, I would tighten this further using Dynatrace-supported secure configuration or secret handling, so tokens are not exposed or hardcoded.

Today the implementation assumes one configured MCP endpoint. Supporting multiple MCP servers would be possible by adding a small server registry and selecting the MCP endpoint by environment or scope, but that is not implemented yet.

We can discuss more about that, I'd love to clarify everything you ask 🤠

Max Lopes

Featured Posts