03 Oct 2024 03:34 PM - last edited on 04 Oct 2024 07:07 AM by MaciejNeumann
Hi,
I want to receive a problem when there is a FAILED Oracle database backup and to be closed only if the next one is successfully COMPLETED otherwise I want to investigate further why it is failing.
This is the metric/SQL statement behind/ which is used:
SELECT
backup_job.SESSION_RECID,
backup_job.SESSION_STAMP,
backup_job.START_TIME,
backup_job.END_TIME,
backup_job.STATUS,
backup_job.INPUT_TYPE,
backup_job.ELAPSED_SECONDS,
(backup_job.COMPRESSION_RATIO * 100) COMPRESSION_RATIO,
backup_job.INPUT_BYTES,
backup_job.OUTPUT_BYTES,
backup_job.INPUT_BYTES_PER_SEC,
backup_job.OUTPUT_BYTES_PER_SEC,
backup_job.AUTOBACKUP_DONE,
backup_job.OPTIMIZED,
backup_job.AUTOBACKUP_COUNT,
backup_set.DISTINCT_BACKUP_TYPES AS BACKUP_TYPES,
CASE
WHEN (
backup_job.INPUT_TYPE = 'ARCHIVELOG'
) THEN 'NO'
WHEN (
INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND
COALESCE(MAX(backup_set.INCREMENTAL_LEVEL), 0) = 0
) THEN 'FULL'
WHEN (
INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND
MAX(backup_set.INCREMENTAL_LEVEL) = 1
) THEN 'INCREMENTAL'
ELSE 'NO'
END AS DATAFILES_INCLUDED,
CASE
WHEN INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'L') > 0 THEN 'YES'
ELSE 'NO'
END AS logs_included,
CASE
WHEN MAX(CASE WHEN backup_set.CONTROLFILE_INCLUDED = 'YES' THEN 1 ELSE 0 END) > 0 THEN 'YES'
ELSE 'NO'
END AS CONTROLFILE_INCLUDED,
CASE
WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 1 THEN '0'
WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 2 THEN '1'
ELSE 'N/A'
END AS INCREMENTAL_LEVEL
FROM V$RMAN_BACKUP_JOB_DETAILS backup_job
LEFT JOIN
(
SELECT
distinct_backup_set.DISTINCT_BACKUP_TYPES,
V$BACKUP_SET_DETAILS.SESSION_STAMP,
V$BACKUP_SET_DETAILS.SESSION_RECID,
V$BACKUP_SET_DETAILS.CONTROLFILE_INCLUDED,
V$BACKUP_SET_DETAILS.INCREMENTAL_LEVEL
FROM
(
SELECT
LISTAGG(BACKUP_TYPE, '+') WITHIN GROUP(ORDER BY BACKUP_TYPE) as DISTINCT_BACKUP_TYPES,
SESSION_STAMP,
SESSION_RECID
FROM
(
SELECT
DISTINCT BACKUP_TYPE,
SESSION_STAMP,
SESSION_RECID
FROM V$BACKUP_SET_DETAILS
)
GROUP BY
SESSION_STAMP,
SESSION_RECID
) distinct_backup_set
JOIN
V$BACKUP_SET_DETAILS
ON V$BACKUP_SET_DETAILS.SESSION_RECID = distinct_backup_set.SESSION_RECID AND V$BACKUP_SET_DETAILS.SESSION_STAMP = distinct_backup_set.SESSION_STAMP
) backup_set
ON backup_job.SESSION_RECID = backup_set.SESSION_RECID AND backup_job.SESSION_STAMP = backup_set.SESSION_STAMP
WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)
GROUP BY
backup_job.SESSION_RECID,
backup_job.SESSION_STAMP,
backup_job.START_TIME,
backup_job.END_TIME,
backup_job.STATUS,
backup_job.INPUT_TYPE,
backup_job.ELAPSED_SECONDS,
backup_job.COMPRESSION_RATIO,
backup_job.INPUT_BYTES,
backup_job.OUTPUT_BYTES,
backup_job.INPUT_BYTES_PER_SEC,
backup_job.OUTPUT_BYTES_PER_SEC,
backup_job.AUTOBACKUP_DONE,
backup_job.OPTIMIZED,
backup_job.AUTOBACKUP_COUNT,
backup_set.DISTINCT_BACKUP_TYPES
ORDER BY backup_job.SESSION_STAMP DESC;
The archive log backups of this database are scheduled to run on every 3 hours as I saw an auto-closed problem for FAILED backup , start digging and saw also that the problem was opened and closed in 2min - couldn't believe!
Why Davis has closed the problem before there is a real successfully completed backup or at least leave the problem open?
Is it because of the configuration with the samples, what about the threshold - should it be 0 or 1? OR it is because of the SQL statement which is fetching the data with this WHERE clause:
"....
WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)
....."
Here I am checking from the database the last two backups now and you can see that the LAST one is FAILED - a problem was generated for that but auto-closed after only 2min - WHY?!:
21 Oct 2024 01:05 PM
I was just now struggling with a similar issue. I think the "dealerting samples" includes any 1-minute sample where the condition is not breached - even including times when there is no data at all.
So let's take a dummy example:
10:00 Condition breach
10:01 <no data>
10:02 <no data>
10:03 <no data>
10:04 <no data>
10:05 Condition breach
If your "dealerting samples" is 3, it means you have 10:01, 10:02 and 10:03 as 1-minute slots where the alert condition is not breached, and the alert is then closed (too early). The data is not indicating success either, which makes this a bit counter-intuitive for me at least. I guess the solution is simply to add a gigantic number in the "dealerting samples", to ensure that you'll likely get some further input by that time, to know if the issue is still ongoing.