Re: Samples / Threshold / Oracle backup with status "FAILED"

mn_24 · ‎03 Oct 2024

Hi,

I want to receive a problem when there is a FAILED Oracle database backup and to be closed only if the next one is successfully COMPLETED otherwise I want to investigate further why it is failing.

This is the metric/SQL statement behind/ which is used:

SELECT

backup_job.SESSION_RECID,

backup_job.SESSION_STAMP,

backup_job.START_TIME,

backup_job.END_TIME,

backup_job.STATUS,

backup_job.INPUT_TYPE,

backup_job.ELAPSED_SECONDS,

(backup_job.COMPRESSION_RATIO * 100) COMPRESSION_RATIO,

backup_job.INPUT_BYTES,

backup_job.OUTPUT_BYTES,

backup_job.INPUT_BYTES_PER_SEC,

backup_job.OUTPUT_BYTES_PER_SEC,

backup_job.AUTOBACKUP_DONE,

backup_job.OPTIMIZED,

backup_job.AUTOBACKUP_COUNT,

backup_set.DISTINCT_BACKUP_TYPES AS BACKUP_TYPES,

CASE

WHEN (

backup_job.INPUT_TYPE = 'ARCHIVELOG'

) THEN 'NO'

WHEN (

INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND

COALESCE(MAX(backup_set.INCREMENTAL_LEVEL), 0) = 0

) THEN 'FULL'

WHEN (

INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND

MAX(backup_set.INCREMENTAL_LEVEL) = 1

) THEN 'INCREMENTAL'

ELSE 'NO'

END AS DATAFILES_INCLUDED,

CASE

WHEN INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'L') > 0 THEN 'YES'

ELSE 'NO'

END AS logs_included,

CASE

WHEN MAX(CASE WHEN backup_set.CONTROLFILE_INCLUDED = 'YES' THEN 1 ELSE 0 END) > 0 THEN 'YES'

ELSE 'NO'

END AS CONTROLFILE_INCLUDED,

CASE

WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 1 THEN '0'

WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 2 THEN '1'

ELSE 'N/A'

END AS INCREMENTAL_LEVEL

FROM V$RMAN_BACKUP_JOB_DETAILS backup_job

LEFT JOIN

(

SELECT

distinct_backup_set.DISTINCT_BACKUP_TYPES,

V$BACKUP_SET_DETAILS.SESSION_STAMP,

V$BACKUP_SET_DETAILS.SESSION_RECID,

V$BACKUP_SET_DETAILS.CONTROLFILE_INCLUDED,

V$BACKUP_SET_DETAILS.INCREMENTAL_LEVEL

FROM

(

SELECT

LISTAGG(BACKUP_TYPE, '+') WITHIN GROUP(ORDER BY BACKUP_TYPE) as DISTINCT_BACKUP_TYPES,

SESSION_STAMP,

SESSION_RECID

FROM

(

SELECT

DISTINCT BACKUP_TYPE,

SESSION_STAMP,

SESSION_RECID

FROM V$BACKUP_SET_DETAILS

)

GROUP BY

SESSION_STAMP,

SESSION_RECID

) distinct_backup_set

JOIN

V$BACKUP_SET_DETAILS

ON V$BACKUP_SET_DETAILS.SESSION_RECID = distinct_backup_set.SESSION_RECID AND V$BACKUP_SET_DETAILS.SESSION_STAMP = distinct_backup_set.SESSION_STAMP

) backup_set
ON backup_job.SESSION_RECID = backup_set.SESSION_RECID AND backup_job.SESSION_STAMP = backup_set.SESSION_STAMP
WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)
GROUP BY

backup_job.SESSION_RECID,

backup_job.SESSION_STAMP,

backup_job.START_TIME,

backup_job.END_TIME,

backup_job.STATUS,

backup_job.INPUT_TYPE,

backup_job.ELAPSED_SECONDS,

backup_job.COMPRESSION_RATIO,

backup_job.INPUT_BYTES,

backup_job.OUTPUT_BYTES,

backup_job.INPUT_BYTES_PER_SEC,

backup_job.OUTPUT_BYTES_PER_SEC,

backup_job.AUTOBACKUP_DONE,

backup_job.OPTIMIZED,

backup_job.AUTOBACKUP_COUNT,

backup_set.DISTINCT_BACKUP_TYPES

ORDER BY backup_job.SESSION_STAMP DESC;

The archive log backups of this database are scheduled to run on every 3 hours as I saw an auto-closed problem for FAILED backup , start digging and saw also that the problem was opened and closed in 2min - couldn't believe!

Why Davis has closed the problem before there is a real successfully completed backup or at least leave the problem open?

Is it because of the configuration with the samples, what about the threshold - should it be 0 or 1? OR it is because of the SQL statement which is fetching the data with this WHERE clause:

"....

WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)

....."

Here I am checking from the database the last two backups now and you can see that the LAST one is FAILED - a problem was generated for that but auto-closed after only 2min - WHY?!:

kalle_lahtinen · ‎21 Oct 2024

I was just now struggling with a similar issue. I think the "dealerting samples" includes any 1-minute sample where the condition is not breached - even including times when there is no data at all.

So let's take a dummy example:

10:00 Condition breach

10:01 <no data>

10:02 <no data>

10:03 <no data>

10:04 <no data>

10:05 Condition breach

If your "dealerting samples" is 3, it means you have 10:01, 10:02 and 10:03 as 1-minute slots where the alert condition is not breached, and the alert is then closed (too early). The data is not indicating success either, which makes this a bit counter-intuitive for me at least. I guess the solution is simply to add a gigantic number in the "dealerting samples", to ensure that you'll likely get some further input by that time, to know if the issue is still ongoing.

GosiaMurawska · ‎08 Jan 2025

Hi @mn_24, did you manage to solve your issue? I'm sure the rest of the Community wants to know the solution 🙂

mn_24 · ‎13 Jan 2025

Hi,

well i am not using the two backup related metrics because don't want to overlap the work of the assigned Backup Teams.

But if you create a separate Alerting profile and than if you are using for example ServiceNow integrate it properly based on the following rule: a human must close the incident manually.

The data for this metric is fetched on every 1min so change the Violating Samples to 4 and the Window to 5 just to have enough time to process the ticket and to lock it in SNOW until someone close it manually.

The problem in Dynatrace will be auto-closed and it will continue to check based on the SQL.

This is the process how it will continue to monitor only the latest backups and not stuck on one FAILED but after it could have completed without any errors, right?

On top of that you can enable the second metric to monitor only the FULL backups:

com.dynatrace.extension.sql-oracle.backup.time_since:filter(and(not(eq("incremental.level","INCREMENTAL 1")),or(not(eq("incremental.level","N/A"))))):splitBy("incremental.level","input.type","dt.entity.sql:com_dynatrace_extension_sql-oracle_host","dt.entity.sql:com_dynatrace_extension_sql-oracle_instance"):sort(value(auto,descending))

don't forget to adjust the threshold according to your environmental needs - maybe 1 week would be fine.

The other option to monitor oracle database backups is and if you don't have a dedicated Backup team who is monitoring them via third party technology(could be Dynatrace as well) and if you are using rman backup scripts in crontab. Those backups should have logs. You can edit the backup scripts to check for any keywords like "RMAN-", "WARNRING", "ORA-" and if there are any to save them in a separate log. Than use the Log Monitoring in Dynatrace to monitor this log.