Open Q&A
If there's no good subforum for your question - ask it here!
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

[OpenTelemetry Collector] Host Health Status Not Appearing in Dynatrace

patmis
Guide

Problem Statement

I'm monitoring a Windows Server 2025 host with OpenTelemetry Collector Contrib (v0.150.1) and sending metrics to Dynatrace via OTLP. Most metrics are flowing correctly (CPU, memory, disk, filesystem), but two key issues:

  1. Health column remains empty (shows -) in the Hosts view
    DT_Infra_OTEL_Health_Column_Empty.png
  2. Process Count and Processes Created metrics are not visible in Dynatrace despite being configured and exported
    DT_Infra_OTEL_Process_Count_Created.png

Configuration

  • Platform: Windows Server 2025 Standard
  • Collector: otelcontribcol v0.150.1 amd64
  • Backend: Dynatrace SaaS ({ENVIRONMENT_ID}.live.dynatrace.com/api/v2/otlp)
  • Protocol: OTLP HTTP

Receivers

  1. hostmetrics/10s — CPU utilization, memory utilization, network, processes count/created, load 1m
  2. hostmetrics/5m — CPU time, memory usage, disk I/O, filesystem, paging, per-process metrics, load 5m
  3. hostmetrics/1h — Memory limit, CPU logical/physical counts, uptime, load 15m, processes
  4. windowsperfcounters/30s — Processor, Memory, Process (7 instances), PhysicalDisk counters
  5. windows_event_log — System channel

Metrics Being Collected & Exported

 system.cpu.utilization (10s)  system.memory.utilization (10s)  system.processes.count (10s, 5m, 1h)  system.processes.created (10s, 5m, 1h)  system.uptime (1h)  system.cpu.load_average.1m/5m/15m (across intervals)  system.memory.limit (1h)  system.cpu.logical.count & system.cpu.physical.count (1h) All Windows perfcounter metrics (transformed to windows.* namespace) All system resource attributes (host.name, host.arch, os.type, os.description, os.version, os.build.id, etc.)

Verified: Metrics are reaching Dynatrace (visible in (configured in 3 separate scraper intervals) 2. Removed cardinality filtering to ensure all metrics pass through 3. Verified both system.processes.count and system.processes.created metrics are in the config 4. Verified resource detection is enabled with 13 attributes 5. Confirmed sending_queue and batch processors are configured 6. Checked that metrics have correct data types and values 7. Ensured all metrics flow without errors in collector log with 13 attributes 4. Confirmed sending_queue and batch processors are configured 5. Checked that metrics have correct data types and values 6. Ensured all metrics flow without errors

Current State

  • Collector runs without errors
  • All metrics export successfully to Dynatrace
  • Mey are process count metrics and health status not appearing in Dynatrace?**
  1. For Process metrics: system.processes.count and system.processes.created are configured, exported, and flow without errors. Why don't they appear in Dynatrace UI?

    • Missing a required attribute or dimension?
    • Wrong metric naming convention?
    • Requires a different field/resource mapping?
  2. For Health Status: What does Dynatrace need to calculate and display health status for a Windows host?

    • Is there a specific health status metric (system.status, host.health, etc.)?
    • Do I need additional metrics beyond what's currently collected?
    • Is there a minimum frequency or volume requirement for health calculation?
    • Should I be exporting collector internal telemetry (otelcol_* metrics)?
    • Is there a specific health status metric (system.status, host.health, etc.)?
  • Do I need additional metrics beyond what's currently collected?
  • Is there a minimum frequency or volume requirement for health calculation?
  • Should I be exporting collector internal telemetry (otelcol_* metrics)?
  • Is there a Dynatrace-specific configuration or entity type required?

Configuration Reference

Processors pipeline:

[batch] → [filter (idle CPU)] → [transform (cardinality cleanup)] → 
[cumulativetodelta] → [metricstransform (perfcounter renaming)] → 
[resourcedetection] → [otlp_http/dynatrace]

Exporter:

  • Endpoint: https://{environmentid}.live.dynatrace.com/api/v2/otlp
  • Sending queue: min_size=3000, max_size=3000, flush=60s

Environment

  • Dynatrace Version: Latest SaaS ({ENVIRONMENT_ID}.live.dynatrace.com)
  • OpenTelemetry Collector: v0.150.1 (Contrib, Windows amd64)
  • Collection Running Time: ~30+ minutes (continuous)

Complete Configuration File

otel_dynatrace_only.yaml

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  hostmetrics/10s:
    collection_interval: 10s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
          system.cpu.time:
            enabled: false
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
          system.memory.usage:
            enabled: false
      network:
        metrics:
          system.network.io:
            enabled: true
          system.network.packets:
            enabled: true
          system.network.errors:
            enabled: true
          system.network.dropped:
            enabled: true
          system.network.connections:
            enabled: true
      processes:
        metrics:
          system.processes.count:
            enabled: true
          system.processes.created:
            enabled: true
      load:
        metrics:
          system.cpu.load_average.1m:
            enabled: true

  hostmetrics/5m:
    collection_interval: 5m
    scrapers:
      cpu:
        metrics:
          system.cpu.time:
            enabled: true
      memory:
        metrics:
          system.memory.usage:
            enabled: true
      disk:
        metrics:
          system.disk.io:
            enabled: true
          system.disk.operations:
            enabled: true
          system.disk.io_time:
            enabled: true
          system.disk.operation_time:
            enabled: true
      network:
        metrics:
          system.network.io:
            enabled: true
          system.network.packets:
            enabled: true
          system.network.errors:
            enabled: true
          system.network.connections:
            enabled: true
          system.network.dropped:
            enabled: true
      filesystem:
        include_devices:
          match_type: strict
          devices: ["C:", "D:"]
        metrics:
          system.filesystem.utilization:
            enabled: true
          system.filesystem.inodes.usage:
            enabled: true
      paging:
        metrics:
          system.paging.usage:
            enabled: true
          system.paging.operations:
            enabled: true
      process:
        mute_process_all_errors: true
        metrics:
          process.cpu.utilization:
            enabled: true
          process.cpu.time:
            enabled: true
          process.memory.usage:
            enabled: true
          process.memory.virtual:
            enabled: true
          process.disk.io:
            enabled: true
      processes:
        metrics:
          system.processes.count:
            enabled: true
          system.processes.created:
            enabled: true
      load:
        metrics:
          system.cpu.load_average.5m:
            enabled: true

  hostmetrics/1h:
    collection_interval: 1h
    scrapers:
      memory:
        metrics:
          system.memory.limit:
            enabled: true
      cpu:
        metrics:
          system.cpu.logical.count:
            enabled: true
          system.cpu.physical.count:
            enabled: true
      system:
        metrics:
          system.uptime:
            enabled: true
      load:
        metrics:
          system.cpu.load_average.15m:
            enabled: true
      processes:
        metrics:
          system.processes.count:
            enabled: true
          system.processes.created:
            enabled: true

  windowsperfcounters:
    collection_interval: 30s
    perfcounters:
      - object: "Processor"
        instances: ["_Total"]
        counters:
          - name: "% Processor Time"
          - name: "% Privileged Time"
          - name: "Interrupts/sec"
      - object: "Memory"
        counters:
          - name: "Available MBytes"
          - name: "% Committed Bytes In Use"
          - name: "Cache Bytes"
      - object: "Process"
        instances: 
          - "svchost"
          - "lsass"
          - "csrss"
          - "services"
          - "sqlservr"
          - "w3wp"
          - "otelcol-contrib"
        counters:
          - name: "% Processor Time"
          - name: "% Privileged Time"
          - name: "Working Set - Private"
          - name: "Private Bytes"
          - name: "Thread Count"
          - name: "Handle Count"
      - object: "PhysicalDisk"
        instances: ["_Total"]
        counters:
          - name: "% Disk Time"
          - name: "Disk Bytes/sec"
          - name: "Avg. Disk Queue Length"

  windows_event_log:
    channel: System
    start_at: end

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  
  filter:
    metrics:
      datapoint:
        - metric.name == "system.cpu.utilization" and attributes["state"] == "idle"
  
  transform:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(resource.attributes, "process.cgroup") where IsMatch(metric.name, "^process\\..*")
          - delete_key(resource.attributes, "process.command") where IsMatch(metric.name, "^process\\..*")
          - delete_key(resource.attributes, "process.executable.path") where IsMatch(metric.name, "^process\\..*")
          - delete_key(resource.attributes, "process.owner") where IsMatch(metric.name, "^process\\..*")
          - delete_key(resource.attributes, "process.parent_pid") where IsMatch(metric.name, "^process\\..*")
          - delete_key(resource.attributes, "process.command_args") where IsMatch(metric.name, "^process\\..*")
          - delete_key(datapoint.attributes, "device") where datapoint.attributes["device"] == ""
  
  filter/delete-metrics:
    metric_conditions:
      - datapoint.attributes["low-memory-process"] != nil
  
  cumulativetodelta:
    max_staleness: 25h
  
  metricstransform:
    transforms:
      - include: '^\\\\Processor\(_Total\)\\\\% Processor Time$'
        action: update
        new_name: windows.processor.time
      - include: '^\\\\Processor\(_Total\)\\\\% Privileged Time$'
        action: update
        new_name: windows.processor.privileged_time
      - include: '^\\\\Processor\(_Total\)\\\\Interrupts/sec$'
        action: update
        new_name: windows.processor.interrupts
      - include: '^\\\\Memory\\\\Available MBytes$'
        action: update
        new_name: windows.memory.available
      - include: '^\\\\Memory\\\\% Committed Bytes In Use$'
        action: update
        new_name: windows.memory.committed_usage
      - include: '^\\\\Memory\\\\Cache Bytes$'
        action: update
        new_name: windows.memory.cache
      - include: '^\\\\Process\(.*\)\\\\% Processor Time$'
        action: update
        new_name: windows.process.cpu_time
      - include: '^\\\\Process\(.*\)\\\\% Privileged Time$'
        action: update
        new_name: windows.process.privileged_time
      - include: '^\\\\Process\(.*\)\\\\Working Set - Private$'
        action: update
        new_name: windows.process.working_set_private
      - include: '^\\\\Process\(.*\)\\\\Private Bytes$'
        action: update
        new_name: windows.process.private_bytes
      - include: '^\\\\Process\(.*\)\\\\Thread Count$'
        action: update
        new_name: windows.process.threads
      - include: '^\\\\Process\(.*\)\\\\Handle Count$'
        action: update
        new_name: windows.process.handles
      - include: '^\\\\PhysicalDisk\(_Total\)\\\\% Disk Time$'
        action: update
        new_name: windows.disk.time
      - include: '^\\\\PhysicalDisk\(_Total\)\\\\Disk Bytes/sec$'
        action: update
        new_name: windows.disk.throughput
      - include: '^\\\\PhysicalDisk\(_Total\)\\\\Avg\. Disk Queue Length$'
        action: update
        new_name: windows.disk.queue_length
  
  resourcedetection:
    detectors: ["system"]
    system:
      resource_attributes:
        host.arch:
          enabled: true
        host.id:
          enabled: true
        host.name:
          enabled: true
        host.ip:
          enabled: true
        host.interface:
          enabled: true
        host.mac:
          enabled: true
        host.cpu.model.name:
          enabled: true
        os.type:
          enabled: true
        os.description:
          enabled: true
        os.name:
          enabled: true
        os.version:
          enabled: true
        os.build.id:
          enabled: true

exporters:
  otlp_http/dynatrace:
    endpoint: "https://{environmentid}.live.dynatrace.com/api/v2/otlp"
    headers:
      Authorization: "Api-Token {DYNATRACE_API_TOKEN}"
    sending_queue:
      batch:
        min_size: 3000
        max_size: 3000
        flush_timeout: 60s

service:
  extensions: [health_check]
  pipelines:
    metrics:
      receivers: [hostmetrics/10s, hostmetrics/5m, hostmetrics/1h, windowsperfcounters]
      processors: [batch, filter, transform, filter/delete-metrics, cumulativetodelta, metricstransform, resourcedetection]
      exporters: [otlp_http/dynatrace]
    
    logs:
      receivers: [windows_event_log, otlp]
      processors: [batch, resourcedetection]
      exporters: [otlp_http/dynatrace]

Additional Context

The host metrics themselves are healthy and complete. The question is whether Dynatrace requires additional signals, different metric frequencies, or specific configuration to calculate overall host health status.


Any guidance appreciated! 🙏

1 REPLY 1

patmis
Guide

UPDATE: I also tried the the Dynatrace OpenTelemetry Collector Distribution. However, it seems that this distribution is not capable of collecting the Windows Event Logs. 

Featured Posts