28 Jan 2025 10:29 PM
Hi All,
Not a question - more of a what I've found.
I've been looking at getting OTEL working on the containerised platforms to address issues with the Active Gate.
For those that are interested, here are my thoughts and rough guide.
Where did I have issues (what I was addressing)-
1. Resource utlisation of Active Gate when scraping Prometheus data.
2. Having to configure duplicate sets of Prometheus scrape settings,
3. Prometheus gaps on core / managed components in cloud
4. Istio Tracing / Monitoring
5. Dynatrace OTEL API on the Active Gate (manual set up of service, having to use port 80, resource utilisation) ...
So, I thought I'd have a play with an OTEL Collector and see what I could get going.
Where to start:
1. GitHub - Dynatrace/dynatrace-otel-collector: Dynatrace distribution of the OpenTelemetry Collector.
Links for config & documents are in there for Dynatrace deployment and configuration guides.
- Do you need to use the Dynatrace one ? no, you could also use the out of box OTEL Collector.
- why the Dynatrace one? it has some handy plugins built in, and is based on OTEL Collector.
- Dynatrace support it (within reason).
So what are doing below?
1. Deploy and Configure an OTEL Collector using Helm.
2. Setup the OTEL Collector Receivers to have HTTP & gRPC endpoints.
3. Configure your OTEL collector Processors (transforms (attributes / trace enhancements).
4. Configure your OTEL collector Exporters
5. Add Prometheus Scraping
6. Configure your OTEL Collector Services.
7 (optional). You configure the Istio Connector using the default Istio configs for tracing: Istio / OpenTelemetry
All you need to do is replace the endpoint in the document with the endpoint for the deployed OTEL Collector.
- Nice and clean.
Here is a Helm based sample, that does all of the above.
A couple of points in this
- Below is configured for a Medium size k8s cluster, and is HA, it's overkill for a test lab.
- If you want to shrink it, reduce the Replica Sets and Memory Requests / Limits to suit your environment.
- This has been configured for an Internal Repository, just adjust the location as required.
- This has been configured for connecting to Dynatrace via an authenticated proxy. (adjust your proxy settings as required using standard unix proxy settings)
- I'm using secrets as environment, this can be configured for secrets via mount, read the OTEL docs.
The below example will Deploy and configure a 3 replica OTEL Collector, that has node anti-infinity, that has auto scale bases on load. the OTEL collector Service Account, Cluster Roles & RBAC required for Prometheus are also included (if you don't want or need Prometheus, you can take these out).
I've included a couple of examples of common transformations for traces and metric attributes for Prometheus scraping. Prometheus will need to be tailored to your environment (service, ports, namespace ...), however you can just use the vendor / open source provided scrape settings with OTEL.
mode: deployment
#Replica cound of pods to support sharding of prometheus
replicaCount: 3
image:
repository: "<my repo>/dynatrace/dynatrace-otel-collector/dynatrace-otel-collector"
tag: latest
#Name of pods
command:
name: dynatrace-otel-collector
# Pod Labels for support & enabling Istio if required
podLabels:
Service: "Dynatrace Otel"
Support: "my support team"
#sidecar.istio.io/inject: "true" ### enable if you want Otel to run behind istio - if you do you'll need to do the SE & Netpol
extraEnvs:
- name: HTTP_PROXY
valueFrom:
secretKeyRef:
key: http_proxy
name: otelproxy
- name: HTTPS_PROXY
valueFrom:
secretKeyRef:
key: https_proxy
name: otelproxy
- name: NO_PROXY
valueFrom:
secretKeyRef:
key: no_proxy
name: otelproxy
- name: DT_API_TOKEN
valueFrom:
secretKeyRef:
name: dynatrace-otelcol-dt-api-credentials
key: dt-api-token
- name: DT_ENDPOINT
valueFrom:
secretKeyRef:
name: dynatrace-otelcol-dt-api-credentials
key: dt-endpoint
- name: SHARDS
value: "3"
- name: POD_NAME_PREFIX
value: otel-prometheus-collector
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
resources:
requests:
cpu: 750m
memory: 4Gi
ephemeral-storage: "1Gi"
limits:
cpu: 5
memory: 8Gi
ephemeral-storage: "2Gi"
podDisruptionBudget:
enabled: true
minAvailable: 2
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
behavior:
scaleUp:
stabilizationWindowSeconds: 180
selectPolicy: Max
policies:
- type: Pods
value: 5
periodSeconds: 30
- type: Percent
value: 100
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 180
selectPolicy: Min
policies:
- type: Pods
value: 3
periodSeconds: 30
- type: Percent
value: 50
periodSeconds: 30
targetCPUUtilizationPercentage: 90
targetMemoryUtilizationPercentage: 95
rollout:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
strategy: RollingUpdate
dnsPolicy: "ClusterFirst"
# Additional settings for sharding
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- otel-collector
topologyKey: "kubernetes.io/hostname"
presets:
kubernetesAttributes:
enabled: true
useGOMEMLIMIT: true
ports:
jaeger-compact:
enabled: false
jaeger-thrift:
enabled: false
jaeger-grpc:
enabled: false
zipkin:
enabled: false
metrics:
enabled: true
serviceAccount:
create: true
annotations: {}
name: "k8s-otel-collector-sa"
clusterRole:
create: true
annotations: {}
name: "k8s-otel-collector-role"
rules:
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- namespaces
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- deployments
- replicasets
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- daemonsets
- statefulsets
- replicasets
verbs:
- get
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs:
- get
- list
- watch
clusterRoleBinding:
annotations: {}
name: "k8s-otel-collector-role-binding"
alternateConfig:
exporters:
otlphttp:
endpoint: "${env:DT_ENDPOINT}"
headers:
Authorization: "Api-Token ${env:DT_API_TOKEN}"
extensions:
health_check:
endpoint: ${env:MY_POD_IP}:13133
processors:
attributes:
actions:
- key: k8s.cluster.name
value: '<my cluster name>'
action: insert
cumulativetodelta: {}
filter:
metrics:
exclude:
match_type: expr
expressions:
- MetricType == "Summary"
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
batch/traces:
send_batch_size: 5000
send_batch_max_size: 5000
timeout: 60s
batch/metrics:
send_batch_size: 3000
send_batch_max_size: 3000
timeout: 60s
batch/logs:
send_batch_size: 1800
send_batch_max_size: 2000
timeout: 60s
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.cronjob.name
- k8s.namespace.name
- k8s.node.name
- k8s.cluster.uid
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.name
- from: resource_attribute
name: k8s.namespace.name
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: connection
transform:
error_mode: ignore
trace_statements:
- context: resource
statements:
- set(attributes["dt.kubernetes.workload.kind"], "statefulset") where IsString(attributes["k8s.statefulset.name"])
- set(attributes["dt.kubernetes.workload.name"], attributes["k8s.statefulset.name"]) where IsString(attributes["k8s.statefulset.name"])
- set(attributes["dt.kubernetes.workload.kind"], "deployment") where IsString(attributes["k8s.deployment.name"])
- set(attributes["dt.kubernetes.workload.name"], attributes["k8s.deployment.name"]) where IsString(attributes["k8s.deployment.name"])
- set(attributes["dt.kubernetes.workload.kind"], "daemonset") where IsString(attributes["k8s.daemonset.name"])
- set(attributes["dt.kubernetes.workload.name"], attributes["k8s.daemonset.name"]) where IsString(attributes["k8s.daemonset.name"])
- set(attributes["dt.kubernetes.cluster.id"], attributes["k8s.cluster.uid"]) where IsString(attributes["k8s.cluster.uid"])
log_statements:
- context: resource
statements:
- set(attributes["dt.kubernetes.workload.kind"], "statefulset") where IsString(attributes["k8s.statefulset.name"])
- set(attributes["dt.kubernetes.workload.name"], attributes["k8s.statefulset.name"]) where IsString(attributes["k8s.statefulset.name"])
- set(attributes["dt.kubernetes.workload.kind"], "deployment") where IsString(attributes["k8s.deployment.name"])
- set(attributes["dt.kubernetes.workload.name"], attributes["k8s.deployment.name"]) where IsString(attributes["k8s.deployment.name"])
- set(attributes["dt.kubernetes.workload.kind"], "daemonset") where IsString(attributes["k8s.daemonset.name"])
- set(attributes["dt.kubernetes.workload.name"], attributes["k8s.daemonset.name"]) where IsString(attributes["k8s.daemonset.name"])
- set(attributes["dt.kubernetes.cluster.id"], attributes["k8s.cluster.uid"]) where IsString(attributes["k8s.cluster.uid"])
receivers:
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
http:
endpoint: ${env:MY_POD_IP}:4318
##################################################################
## PROMETHEUS SCRAPE SETTINGS GO HERE ##
##################################################################
prometheus:
config:
scrape_configs:
- job_name: opentelemetry-collector
scrape_interval: 30s
static_configs:
- targets:
- ${env:MY_POD_IP}:8888
- job_name: 'kube-dns'
scrape_interval: 30s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
action: keep
regex: kube-system
- source_labels: [__meta_kubernetes_pod_label_k8s_app]
action: keep
regex: kube-dns
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: sidecar
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: 10054
- source_labels: [__address__]
action: replace
regex: (.*):\d+
replacement: $$1:10054
target_label: __address__
metric_relabel_configs:
- source_labels: [__name__]
action: drop
regex: ^go_.*
metrics_path: /metrics
scheme: http
- job_name: 'istio-ingressgateway'
scrape_interval: 15s
metrics_path: /metrics
scheme: http
static_configs:
- targets: ['otel-ingressgateway.istio-internal.svc.cluster.local:15020']
relabel_configs:
- source_labels: [__address__]
action: replace
regex: (.*):\d+
target_label: __address__
replacement: $1:15020
metric_relabel_configs:
- source_labels: [__name__]
action: drop
regex: ^go_.*
- job_name: 'istiod'
scrape_interval: 15s
metrics_path: /metrics
scheme: http
static_configs:
- targets: ['istiod.istio-internal.svc.cluster.local:15014']
metric_relabel_configs:
- source_labels: [__name__]
action: drop
regex: ^go_.*
##################################################################
##################################################################
service:
telemetry:
metrics:
address: ${env:MY_POD_IP}:8888
logs:
level: debug
extensions:
- health_check
pipelines:
logs:
exporters:
- otlphttp
processors:
- attributes
- k8sattributes
- memory_limiter
- batch/logs
receivers:
- otlp
metrics:
exporters:
- otlphttp
processors:
- attributes
- cumulativetodelta
- memory_limiter
- batch/metrics
- k8sattributes
- filter
receivers:
- prometheus
traces:
exporters:
- otlphttp
processors:
- transform
- memory_limiter
- batch/traces
receivers:
- otlp
Secrets In this are for Proxy and the DT API (mentioned in Dynatrace format). examples below
*if you are doing Prometheus, you need to add the node, pod & service CIDR to the no-proxy.
apiVersion: v1
data:
dt-api-token: ZHQwYzAxLm15YXBpdG9rZW4
dt-endpoint: aHR0cHM6Ly90ZW5hbnRpZC5saXZlLmR5bmF0cmFjZS5jb20vYXBpL3YyL290bHAv
kind: Secret
metadata:
annotations:
name: dynatrace-otelcol-dt-api-credentials
type: Opaque
---
apiVersion: v1
data:
http_proxy: aHR0cDovL3VzZXJuYW1lOnBhc3N3b3JkQHByb3h5OnBvcnQ
https_proxy: aHR0cDovL3VzZXJuYW1lOnBhc3N3b3JkQHByb3h5OnBvcnQ
no_proxy: MTI3LjAuMC4xLFBPRF9DSURSLE5PREVfQ0lEUixTRVJWSUNFX0NJRFI
kind: Secret
metadata:
annotations:
name: otelproxy
type: Opaque
have fun, and hopefully this will help someone.