TL;DR: We open-sourced dt-evals, a CLI toolkit for evaluating LLM and agent quality using real GenAI traces in Dynatrace AI Observability. Run evaluations on live or recent gen_ai.* spans, score responses with an LLM judge, send results back to Dynatrace as business events, and use those scores for dashboards, alerts, drift detection, and CI/CD quality gates.
AI applications can fail silently. A response may be fast and error-free, but still inaccurate, ungrounded, unsafe, biased, incomplete, or unusable. That's why AI quality needs to be monitored alongside the same telemetry teams already use for latency, cost, errors, traces, and user behavior.
dt-evals helps close that gap. It evaluates real LLM and agent interactions from Dynatrace AI Observability, scores them with an LLM-as-judge, and writes structured evaluation results back into Dynatrace so quality becomes visible, queryable, trendable, and actionable.
npm install -g @dynatrace-oss/dt-evals dt-evals configure dt-evals doctor dt-evals run --since 1h --sample 10
For CI/CD quality gates:
dt-evals run --since 6h --ci
In CI mode, dt-evals can fail the pipeline when configured quality thresholds are breached, helping teams catch regressions before they reach users.

AI Evaluation & LLM App Performance.
When evaluation scores live in notebooks, spreadsheets, or standalone tools, it’s hard to connect a low score to the exact trace, prompt, model, retrieval context, tool call, or service that caused it.
With Dynatrace AI Observability and dt-evals, a failed faithfulness score is no longer just a number in a report. It becomes an operational signal connected to the full AI execution path.

Prompt stream showing evaluation score badges such as relevance, faithfulness, fluency, and toxicity.
We're working on improvements for targeted and bulk trace evaluations, custom evaluation libraries, evaluator versioning, baseline comparisons, experiment views, native quality gates, and deeper visibility into online evaluations.
This is an early release, and we're actively shaping dt-evals based on real-world usage. Try it out, open an issue, suggest an evaluator, or contribute on GitHub:
github.com/dynatrace-oss/dt-evals
If dt-evals helps you catch your next "why did quality drop overnight?" issue, give the repo a star so other teams can find it too.