AI
Everything around AI: AI observability, agentic AI, LLMs, MCP servers, and more
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

MCP Server Challenge entry #6: My very first App - A Kubernetes Cluster Performance & Capacity Report

dannemca
DynaMight Guru
DynaMight Guru

Inspired by this excelent video in Dynatrace YouTube channel, I started to put in practice something that I was always trying to do but never had the guts to complete. To create my very first App in Dynatrace.

I am not even close to a Developer. My background was always a tech guy, that solve problems applying Observability practices. But I am a SRE, and one of my competencies should be a Software Engineering. I can read code, I can troubleshoot and, sometimes, fix bugs, but I can not code. It is not fair I said that I can.

But thanks to AI, I can now vibe code. 

Last year I have discovered the powerfull of MCP servers, during a presentation in the SREDay in Brazil (https://sreday.com/2025-campinas-q4/) and then figured out that Dynatrace has its own MCP Server to be used.

I have first played with the MCP Server and the MS Copilot to ask simple questions about the tenants I admin, but I never thought that I could use it for others complexes tasks, as create an entire app.

And then I saw the YT video, where, as always, @andreas_grabner , presents amazing features that people are creating within Dynatrace.

I had to try by myself and see if it will be that easy. And it was.

 

I have started creating the default app, following this doc.

The idea was to replicate a current Dashboard I have created, that provides an history of CPU and Memory Usage vs Request and Total, from 2 Kubernetes Clusters, allowing filter the results by Node Role. So I already had the required DQL queries. 

timeseries usage_result = sum(dt.kubernetes.container.cpu_usage, default: 0, rollup: sum, rate: 1m),
  ...
| append[timeseries request_result = sum(dt.kubernetes.container.requests_cpu, rollup: sum, rate: 1m),
  ...
| append[timeseries total_result = sum(dt.kubernetes.node.cpu_allocatable, rollup: sum, rate: 1m),
  ...
| lookup [smartscapeNodes {K8S_NODE}
          | fields name, 
                   role =`tags:k8s.labels`[`node-role/$NodeRole:noquote`]
| filter isNotNull(role)
], sourceField:k8s.node.name, lookupField:name
| filterOut isNull(lookup.name)
| fieldsRemove lookup.name, k8s.node.name
| summarize { Usage = sum(usage_result[]), Request = sum(request_result[]), Total = sum(total_result[])}, by:{timeframe,interval}

In this DQL, I have some appends for new metrics and a lookup function to allow the filtering by Node Role.

dashboard.png

So I have asked the Copilot to create a new page that display the result of this query, allowing the user to update the values based on the variable selection for the node role.

And just like that, I got the first custom page created:

app01.png

I am not totally sure if the Copilot could get the same or similar query just by prompting the needs, since it has to add different metrics, and filtering by a node entity property, where the metrics look for cluster entity instead. Maybe yes, and I had to explain it all in the prompt. But since I already got it, I just used.

 

Till now I just have a Dashboard that is bit hard to edit. Apps should be smarter.

So I have asked the Copilot to act as a Performance & Capacity specialist, to analyze the data, compare with previous period and generate a report with recommendations of improvements.

And I got this:

app02.png

Now, when we load the app and choose a Node Role, we get the graph with the CPU and Memory history consumption and we can quickly understand if the values we are seeing are good or not and what actions we can take to improve it, whether saving cloud provisioning money or adjusting the requests to avoid future resource congestion.

Of course, it was not so simple as ask and get it. Sometimes I was not so clear on my needs, and some hallucinations happened, but the Copilot could get back in path after few tries.

I will now enhance this, adding the recommendations by Namespaces and Workloads, with suggestion for limits and requests and even for pod numbers. Let's see.

 

Here are some lessons I learned with that:

  • Make sure you understand the Dynatrace querying capabilities.
  • Make sure you understand the data context of your environment. 
  • Add one feature/function per time. Be clear with your prompts.
  • Try to understand the App project structure, for minors and quick adjustments by hand.
  • Don't be afraid of AI.

 

RTFM is still required. WTFV (watch the f video) is a must.

Site Reliability Engineer @ Kyndryl
2 REPLIES 2

andreas_grabner
Dynatrace Guru
Dynatrace Guru

Hi. Really great to see that AI and our MCPs are enabling everyone to become builders. 

Did you have time to work on this further? Any more screenshots on additional features you have AI Engineered in the meantime?

I do have some questions though: as you said - its the first version. What else do you think the app should be able to do in the future? Also - why do you think this use case needs an app and cant be done through dashboards (using variables to filter on e.g: nodes) and through workflows (to capture remediation suggestions)... -> just curious to hear your thoughts moving this app from first version to an MVP: what other use cases besides those that you mentioned do you think you will build? 

Contact our DevRel team through devrel@dynatrace.com

Hi Andy, I have not added any additional features yet (namespace and workloads suggestions), since I was still re-thinking the app purpose, as you mention the suggestions can be done in a different way.

And that was my "main mistake" here. I am a noobie on that, so at my first thought, the suggestions would always get the history data and evaluate using the "AI thing" to provide specific actions, when in reality, turns out to be a series of IFs, if value is higher/lower than X, suggest this, else, suggest that. Still smart, but not as I expected.

I had step back from this app to work on a different one, which I have already created the v1 but it is still far from present here. This new app handle the audit events to show how many users accessed the tenant in last custom timeframe, converting the user.id to user names and email domains, so app users can select the domain as filter and check for tenant usage. 

This new app also translate the user actions to a natural language. For example, user.id XYZ123, name Danne, domail Kyndryl, has edited the Auto tag rule for "OS", adding a new value for "Solaris", when OS Type is equal "Solaris", at 12/05/26 13:34h.
Screenshot 2026-05-12 142627.png

That's what I got so far.

In this case, I am using the MCP Server and Copilot again to help me with the coding, and this time, I am explaining the audit events entries to extract the data I need, and copilot is creating the code to make it happen. 

For example, the first try, it show me the changes char by char, and I had to explain, that if something has changed from 'f' to 't', 'a' to 'r', 'l' to 'u', 's' to 'e' and 'e' to '}', is because something was 'false' and now is 'true'.

Again, this could be a dashboard, but I think it will be harder to create a dashboard with all those logics, than an app using the copilot with MCP Server.

This Audit "Translator" could be something that others clients could use as well (replicable) , since is not client related (as the node labels specifications from other app).

 

Site Reliability Engineer @ Kyndryl

Featured Posts