Troubleshooting
Articles about how to solve the most common problems
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
annazaionchkovs
Dynatrace Advocate
Dynatrace Advocate

Summary

When a CronJob is configured with failedJobsHistoryLimit: 0, and the job fails after reaching the backoffLimit, the BackoffLimitExceeded event is triggered. However, the cloud-application is missing on the event, resulting in no problem being connected to the workload.

Problem

The failed job is not retained when failedJobsHistoryLimit is set to 0, preventing the problem from being connected to the workload.

Workaround

Increase the failedJobsHistoryLimit to 1 in your CronJob configuration:

failedJobsHistoryLimit: 1

Comprehensive Overview

How Is CronJob Data Handled in Dynatrace?

Let’s break it down with an example:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: zai-cronjob
spec:
  schedule: "* * * * *"
  failedJobsHistoryLimit: 0
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
          - name: zai-pod
            image: docker.io/library/bash:5
            command: ["sh", "-c", "sleep 120; exit 1"]
          restartPolicy: Never

What Happens Here?

The above CronJob runs every minute and creates the following Kubernetes entities:

  • CronJobs
  • Jobs
  • Pods

Timeline: Relation Between CronJob, Job, and Pod

To clarify the timewise relationship between the CronJob, the Job, and the Pod, here’s a detailed explanation with a timeline.

Example Scenario:

  • The CronJob runs indefinitely, triggering a new job every minute.
  • Each Job:
    • Starts immediately when triggered by the CronJob.
    • Fails after 2 minutes due to the sleep 120 command and exit 1.
    • Is visible in the Kubernetes API only until the failedJobsHistoryLimit is reached.
  • Each Pod:
    • Is created by the Job and runs for 2 minutes (sleep 120).
    • Becomes unavailable once the Job is removed from the Kubernetes API.

Timeline Representation:

Below is an example timeline for a CronJob configured with failedJobsHistoryLimit: 0:

Entity Visibility Timeline
CronJob Runs indefinitely
Job 1 Visible from 12:00 to 12:02
Pod 1 Visible from 12:00 to 12:02
Job 2 Visible from 12:01 to 12:03
Pod 2 Visible from 12:01 to 12:03
Job 3 Visible from 12:02 to 12:04
Pod 3 Visible from 12:02 to 12:04

Visibility of CronJobs, Jobs, and Pods in Dynatrace

In Dynatrace, CronJobs and Pods are saved, but Jobs are not visible in the GUI. Jobs are only available at runtime as long as they are:

  1. Accessible via the Kubernetes API.
  2. Represented in Dynatrace as Controllers.

These Controllers are updated every minute to reflect the current cluster state.

kubectl-jobs.png

 

dt-cronjob-pod.png

event-with-workload.png

 

 

Example Behavior

Each job will trigger the BackoffLimitExceeded event exactly 2 minutes after it starts, because:

  1. The job runs for 120 seconds (sleep 120).
  2. It exits with code 1 (failure).
  3. With backoffLimit: 0, the job fails immediately without retries.

Timeline:

  • A job starts at 12:00 → Fails at 12:02.
  • A job starts at 12:01 → Fails at 12:03.
  • And so on...

Why Does This Matter?

When failedJobsHistoryLimit: 0, the failed job is not retained after failure, meaning the BackoffLimitExceeded event cannot be associated with the workload. This results in no problem shown on workload level.

Resolution

To ensure that the failed job is available when the BackoffLimitExceeded event is triggered, update the CronJob configuration to include failedJobsHistoryLimit: 1:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: zai-cronjob
spec:
  schedule: "* * * * *"
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        spec:
          containers:
          - name: zai-pod
            image: docker.io/library/bash:5
            command: ["sh", "-c", "sleep 120; exit 1"]
          restartPolicy: Never

Impact of failedJobsHistoryLimit

Configuration Job Lifespan Pod Lifespan
failedJobsHistoryLimit: 0 Ends immediately after fail Ends immediately after fail
failedJobsHistoryLimit: 1 Retained until next failure Retained until next failure

Result

With this configuration:

  • The failed job will be retained even after failure.
  • The BackoffLimitExceeded event will be associated with the workload.
  • A problem will be raised in Dynatrace, ensuring visibility and proper monitoring.

event-with-workload.png

job-failure-event.png

kubectl-jobs-history.png

  

Additional Context on failedJobsHistoryLimit

According to the Kubernetes documentation, the default value for failedJobsHistoryLimit is 1. Therefore, one alternative solution is simply removing the limit instead of explicitly setting it to 1.

Why Set failedJobsHistoryLimit: 0?

In some scenarios, failedJobsHistoryLimit: 0 may be configured to:

  • Minimize Resource Usage: By not retaining failed jobs, the number of objects stored in the Kubernetes API is reduced, which might be useful in environments with high workloads or limited resources.
  • Avoid Clutter: Retaining failed jobs can lead to unnecessary clutter, especially for jobs that fail frequently and are not critical to monitor after failure.

Pros and Cons of Setting failedJobsHistoryLimit: 0

Pros:

  • Reduces resource usage in the Kubernetes API.
  • Keeps the system clean by not retaining failed jobs that are not needed for further analysis.

Cons:

  • Prevents visibility into failed jobs and their associated events, such as BackoffLimitExceeded.
  • No problem will be raised in monitoring tools like Dynatrace, leading to a lack of awareness about potential issues.
  • Makes debugging and troubleshooting more difficult, as historical data about failed jobs is unavailable.

Recommendation

While setting failedJobsHistoryLimit: 0 may be suitable for certain use cases, it is generally recommended to retain at least one failed job by keeping the default value of 1. This ensures that important events like BackoffLimitExceeded are properly associated with workloads and visible in monitoring systems like Dynatrace.

If there are specific reasons for setting the limit to 0, it may be helpful to evaluate the trade-offs and consider whether retaining failed jobs would provide greater value for monitoring and troubleshooting.

What's next

If you found this article unhelpful or if it did not resolve your issue, we encourage you to open a support ticket for further assistance. When submitting your ticket, please reference this article to provide context for your inquiry. Additionally, make sure to include the relevant YAML file for the cronjob configuration as part of your ticket submission. This will help the support team better understand your situation and provide more accurate and efficient assistance.

Version history
Last update:
‎02 Jan 2026 02:48 PM
Updated by: