cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Samples / Threshold / Oracle backup with status "FAILED"

mn_24
Contributor

Hi,

I want to receive a problem when there is a FAILED Oracle database backup and to be closed only if the next one is successfully COMPLETED otherwise I want to investigate further why it is failing. 

This is the metric/SQL statement behind/ which is used:

SELECT

      backup_job.SESSION_RECID,

      backup_job.SESSION_STAMP,

      backup_job.START_TIME,

      backup_job.END_TIME,

      backup_job.STATUS,

      backup_job.INPUT_TYPE,

      backup_job.ELAPSED_SECONDS,

      (backup_job.COMPRESSION_RATIO * 100) COMPRESSION_RATIO,

      backup_job.INPUT_BYTES,

      backup_job.OUTPUT_BYTES,

      backup_job.INPUT_BYTES_PER_SEC,

      backup_job.OUTPUT_BYTES_PER_SEC,

      backup_job.AUTOBACKUP_DONE,

      backup_job.OPTIMIZED,

      backup_job.AUTOBACKUP_COUNT,

      backup_set.DISTINCT_BACKUP_TYPES AS BACKUP_TYPES,

      CASE

          WHEN (

              backup_job.INPUT_TYPE = 'ARCHIVELOG'

          ) THEN 'NO'

          WHEN (

              INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND 

              COALESCE(MAX(backup_set.INCREMENTAL_LEVEL), 0) = 0

          ) THEN 'FULL'

          WHEN (

              INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND 

              MAX(backup_set.INCREMENTAL_LEVEL) = 1

          ) THEN 'INCREMENTAL'

          ELSE 'NO'

      END AS DATAFILES_INCLUDED,

      CASE

          WHEN INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'L') > 0 THEN 'YES'

          ELSE 'NO'

      END AS logs_included,

      CASE 

          WHEN MAX(CASE WHEN backup_set.CONTROLFILE_INCLUDED = 'YES' THEN 1 ELSE 0 END) > 0 THEN 'YES'

          ELSE 'NO'

      END AS CONTROLFILE_INCLUDED,

      CASE 

          WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 1 THEN '0'

          WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 2 THEN '1'

          ELSE 'N/A'

      END AS INCREMENTAL_LEVEL

  FROM V$RMAN_BACKUP_JOB_DETAILS backup_job

  LEFT JOIN

  (

    SELECT 

        distinct_backup_set.DISTINCT_BACKUP_TYPES,

        V$BACKUP_SET_DETAILS.SESSION_STAMP,

        V$BACKUP_SET_DETAILS.SESSION_RECID,

        V$BACKUP_SET_DETAILS.CONTROLFILE_INCLUDED,

        V$BACKUP_SET_DETAILS.INCREMENTAL_LEVEL

    FROM

        (

            SELECT  

                LISTAGG(BACKUP_TYPE, '+') WITHIN GROUP(ORDER BY BACKUP_TYPE) as DISTINCT_BACKUP_TYPES,

                SESSION_STAMP,

                SESSION_RECID

                FROM

          (

            SELECT 

                DISTINCT BACKUP_TYPE,

                SESSION_STAMP,

                SESSION_RECID

            FROM V$BACKUP_SET_DETAILS

            )

        GROUP BY

        SESSION_STAMP,

        SESSION_RECID

        ) distinct_backup_set

        JOIN 

        V$BACKUP_SET_DETAILS

        ON V$BACKUP_SET_DETAILS.SESSION_RECID = distinct_backup_set.SESSION_RECID AND V$BACKUP_SET_DETAILS.SESSION_STAMP = distinct_backup_set.SESSION_STAMP

 ) backup_set
ON backup_job.SESSION_RECID = backup_set.SESSION_RECID AND backup_job.SESSION_STAMP = backup_set.SESSION_STAMP
WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)
GROUP BY

      backup_job.SESSION_RECID,

      backup_job.SESSION_STAMP,

      backup_job.START_TIME,

      backup_job.END_TIME,

      backup_job.STATUS,

      backup_job.INPUT_TYPE,

      backup_job.ELAPSED_SECONDS,

      backup_job.COMPRESSION_RATIO,

      backup_job.INPUT_BYTES,

      backup_job.OUTPUT_BYTES,

      backup_job.INPUT_BYTES_PER_SEC,

      backup_job.OUTPUT_BYTES_PER_SEC,

      backup_job.AUTOBACKUP_DONE,

      backup_job.OPTIMIZED,

      backup_job.AUTOBACKUP_COUNT,

      backup_set.DISTINCT_BACKUP_TYPES

  ORDER BY backup_job.SESSION_STAMP DESC;

 

 

The archive log backups of this database are scheduled to run on every 3 hours as I saw an auto-closed problem for FAILED backup , start digging and saw also that the problem was opened and closed in 2min - couldn't believe!

Why Davis has closed the problem before there is a real successfully completed backup or at least leave the problem open?

Is it because of the configuration with the samples, what about the threshold - should it be 0 or 1? OR it is because of the SQL statement which is fetching the data with this WHERE clause:

"....

WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)

....."

 

mn_24_0-1727964780346.png

 

mn_24_1-1727965355960.png

 

Here I am checking from the database the last two backups now and you can see that the LAST one is FAILED - a problem was generated for that but auto-closed after only 2min - WHY?!:

mn_24_3-1727965437320.png

mn_24_2-1727965420569.png

 

3 REPLIES 3

I was just now struggling with a similar issue. I think the "dealerting samples" includes any 1-minute sample where the condition is not breached - even including times when there is no data at all.

So let's take a dummy example:

10:00 Condition breach

10:01 <no data>

10:02 <no data>

10:03 <no data>

10:04 <no data>

10:05 Condition breach

If your "dealerting samples" is 3, it means you have 10:01, 10:02 and 10:03 as 1-minute slots where the alert condition is not breached, and the alert is then closed (too early). The data is not indicating success either, which makes this a bit counter-intuitive for me at least. I guess the solution is simply to add a gigantic number in the "dealerting samples", to ensure that you'll likely get some further input by that time, to know if the issue is still ongoing.

GosiaMurawska
Community Team
Community Team

Hi @mn_24, did you manage to solve your issue? I'm sure the rest of the Community wants to know the solution 🙂 

Hi,

well i am not using the two backup related metrics because don't want to overlap the work of the assigned Backup Teams.

But if you create a separate Alerting profile and than if you are using for example ServiceNow integrate it properly based on the following rule: a human must close the incident manually.

The data for this metric is fetched on every 1min so change the Violating Samples to 4 and the Window to 5 just to have enough time to process the ticket and to lock it in SNOW until someone close it manually. 

The problem in Dynatrace will be auto-closed and it will continue to check based on the SQL. 

This is the process how it will continue to monitor only the latest backups and not stuck on one FAILED but after it could have completed without any errors, right?

On top of that you can enable the second metric to monitor only the FULL backups:

 

com.dynatrace.extension.sql-oracle.backup.time_since:filter(and(not(eq("incremental.level","INCREMENTAL 1")),or(not(eq("incremental.level","N/A"))))):splitBy("incremental.level","input.type","dt.entity.sql:com_dynatrace_extension_sql-oracle_host","dt.entity.sql:com_dynatrace_extension_sql-oracle_instance"):sort(value(auto,descending))

 

don't forget to adjust the threshold according to your environmental needs - maybe 1 week would be fine.

 

 

The other option to monitor oracle database backups is and if you don't have a dedicated Backup team who is monitoring them via third party technology(could be Dynatrace as well) and if you are using rman backup scripts in crontab. Those backups should have logs. You can edit the backup scripts to check for any keywords like "RMAN-", "WARNRING", "ORA-" and if there are any to save them in a separate log. Than use the Log Monitoring in Dynatrace to monitor this log.

Featured Posts