03 Oct 2024 03:34 PM - last edited on 04 Oct 2024 07:07 AM by MaciejNeumann
Hi,
I want to receive a problem when there is a FAILED Oracle database backup and to be closed only if the next one is successfully COMPLETED otherwise I want to investigate further why it is failing.
This is the metric/SQL statement behind/ which is used:
SELECT
backup_job.SESSION_RECID,
backup_job.SESSION_STAMP,
backup_job.START_TIME,
backup_job.END_TIME,
backup_job.STATUS,
backup_job.INPUT_TYPE,
backup_job.ELAPSED_SECONDS,
(backup_job.COMPRESSION_RATIO * 100) COMPRESSION_RATIO,
backup_job.INPUT_BYTES,
backup_job.OUTPUT_BYTES,
backup_job.INPUT_BYTES_PER_SEC,
backup_job.OUTPUT_BYTES_PER_SEC,
backup_job.AUTOBACKUP_DONE,
backup_job.OPTIMIZED,
backup_job.AUTOBACKUP_COUNT,
backup_set.DISTINCT_BACKUP_TYPES AS BACKUP_TYPES,
CASE
WHEN (
backup_job.INPUT_TYPE = 'ARCHIVELOG'
) THEN 'NO'
WHEN (
INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND
COALESCE(MAX(backup_set.INCREMENTAL_LEVEL), 0) = 0
) THEN 'FULL'
WHEN (
INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'D') > 0 AND
MAX(backup_set.INCREMENTAL_LEVEL) = 1
) THEN 'INCREMENTAL'
ELSE 'NO'
END AS DATAFILES_INCLUDED,
CASE
WHEN INSTR(backup_set.DISTINCT_BACKUP_TYPES, 'L') > 0 THEN 'YES'
ELSE 'NO'
END AS logs_included,
CASE
WHEN MAX(CASE WHEN backup_set.CONTROLFILE_INCLUDED = 'YES' THEN 1 ELSE 0 END) > 0 THEN 'YES'
ELSE 'NO'
END AS CONTROLFILE_INCLUDED,
CASE
WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 1 THEN '0'
WHEN MAX(CASE WHEN backup_set.INCREMENTAL_LEVEL = 0 THEN 1 WHEN backup_set.INCREMENTAL_LEVEL = 1 THEN 2 ELSE 0 END) = 2 THEN '1'
ELSE 'N/A'
END AS INCREMENTAL_LEVEL
FROM V$RMAN_BACKUP_JOB_DETAILS backup_job
LEFT JOIN
(
SELECT
distinct_backup_set.DISTINCT_BACKUP_TYPES,
V$BACKUP_SET_DETAILS.SESSION_STAMP,
V$BACKUP_SET_DETAILS.SESSION_RECID,
V$BACKUP_SET_DETAILS.CONTROLFILE_INCLUDED,
V$BACKUP_SET_DETAILS.INCREMENTAL_LEVEL
FROM
(
SELECT
LISTAGG(BACKUP_TYPE, '+') WITHIN GROUP(ORDER BY BACKUP_TYPE) as DISTINCT_BACKUP_TYPES,
SESSION_STAMP,
SESSION_RECID
FROM
(
SELECT
DISTINCT BACKUP_TYPE,
SESSION_STAMP,
SESSION_RECID
FROM V$BACKUP_SET_DETAILS
)
GROUP BY
SESSION_STAMP,
SESSION_RECID
) distinct_backup_set
JOIN
V$BACKUP_SET_DETAILS
ON V$BACKUP_SET_DETAILS.SESSION_RECID = distinct_backup_set.SESSION_RECID AND V$BACKUP_SET_DETAILS.SESSION_STAMP = distinct_backup_set.SESSION_STAMP
) backup_set
ON backup_job.SESSION_RECID = backup_set.SESSION_RECID AND backup_job.SESSION_STAMP = backup_set.SESSION_STAMP
WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)
GROUP BY
backup_job.SESSION_RECID,
backup_job.SESSION_STAMP,
backup_job.START_TIME,
backup_job.END_TIME,
backup_job.STATUS,
backup_job.INPUT_TYPE,
backup_job.ELAPSED_SECONDS,
backup_job.COMPRESSION_RATIO,
backup_job.INPUT_BYTES,
backup_job.OUTPUT_BYTES,
backup_job.INPUT_BYTES_PER_SEC,
backup_job.OUTPUT_BYTES_PER_SEC,
backup_job.AUTOBACKUP_DONE,
backup_job.OPTIMIZED,
backup_job.AUTOBACKUP_COUNT,
backup_set.DISTINCT_BACKUP_TYPES
ORDER BY backup_job.SESSION_STAMP DESC;
The archive log backups of this database are scheduled to run on every 3 hours as I saw an auto-closed problem for FAILED backup , start digging and saw also that the problem was opened and closed in 2min - couldn't believe!
Why Davis has closed the problem before there is a real successfully completed backup or at least leave the problem open?
Is it because of the configuration with the samples, what about the threshold - should it be 0 or 1? OR it is because of the SQL statement which is fetching the data with this WHERE clause:
"....
WHERE (
(
backup_job.END_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(5, 'MINUTE')
OR
(backup_job.END_TIME IS NULL AND backup_job.START_TIME >= CURRENT_TIMESTAMP - NUMTODSINTERVAL(60, 'MINUTE'))
)
AND backup_job.STATUS NOT LIKE 'RUNNING%'
)
....."
Here I am checking from the database the last two backups now and you can see that the LAST one is FAILED - a problem was generated for that but auto-closed after only 2min - WHY?!:
21 Oct 2024 01:05 PM
I was just now struggling with a similar issue. I think the "dealerting samples" includes any 1-minute sample where the condition is not breached - even including times when there is no data at all.
So let's take a dummy example:
10:00 Condition breach
10:01 <no data>
10:02 <no data>
10:03 <no data>
10:04 <no data>
10:05 Condition breach
If your "dealerting samples" is 3, it means you have 10:01, 10:02 and 10:03 as 1-minute slots where the alert condition is not breached, and the alert is then closed (too early). The data is not indicating success either, which makes this a bit counter-intuitive for me at least. I guess the solution is simply to add a gigantic number in the "dealerting samples", to ensure that you'll likely get some further input by that time, to know if the issue is still ongoing.
08 Jan 2025 04:15 PM
Hi @mn_24, did you manage to solve your issue? I'm sure the rest of the Community wants to know the solution 🙂
13 Jan 2025 04:34 PM
Hi,
well i am not using the two backup related metrics because don't want to overlap the work of the assigned Backup Teams.
But if you create a separate Alerting profile and than if you are using for example ServiceNow integrate it properly based on the following rule: a human must close the incident manually.
The data for this metric is fetched on every 1min so change the Violating Samples to 4 and the Window to 5 just to have enough time to process the ticket and to lock it in SNOW until someone close it manually.
The problem in Dynatrace will be auto-closed and it will continue to check based on the SQL.
This is the process how it will continue to monitor only the latest backups and not stuck on one FAILED but after it could have completed without any errors, right?
On top of that you can enable the second metric to monitor only the FULL backups:
com.dynatrace.extension.sql-oracle.backup.time_since:filter(and(not(eq("incremental.level","INCREMENTAL 1")),or(not(eq("incremental.level","N/A"))))):splitBy("incremental.level","input.type","dt.entity.sql:com_dynatrace_extension_sql-oracle_host","dt.entity.sql:com_dynatrace_extension_sql-oracle_instance"):sort(value(auto,descending))
don't forget to adjust the threshold according to your environmental needs - maybe 1 week would be fine.
The other option to monitor oracle database backups is and if you don't have a dedicated Backup team who is monitoring them via third party technology(could be Dynatrace as well) and if you are using rman backup scripts in crontab. Those backups should have logs. You can edit the backup scripts to check for any keywords like "RMAN-", "WARNRING", "ORA-" and if there are any to save them in a separate log. Than use the Log Monitoring in Dynatrace to monitor this log.