30 Jul 2025
03:11 PM
- last edited on
31 Jul 2025
06:50 AM
by
MaciejNeumann
When migrating to Dynatrace you will find that it has no simple out of the box way to use different thresholds based on different metrics for disk space. It is a fairly common practice to use % Free Space for smaller volumes (normally under 1 terabyte) and to use Free Space Available for larger volumes (normally over 1 terabyte). Even if you used the same metrics for both size volumes, common sense dictates that the thresholds would have to be different based on size anyway or you would get useless alerts, either for small volumes that will NEVER have the amount of free space a large volume has OR a large volume that is below 10% free, but still has over 100GB free.
So how do you address this? Well, after looking at all the solutions that had been posted online in various forums and trying to make disk edge address this specific scenario, Davis Anomaly Detector's (DAD's) seemed to be the logical solution. Why not metric events you may ask? Well, we tried that first, and found the issue is the dimension limits that you will not be able to keep under control in metric events, however, if you filter the DQL in a DAD correctly it keeps the dimension issue in check. Why not Disk Edge? Well, it relies on host properties, and that just is not the route we wanted to go. DAD's provide a somewhat simple self contained way via DQL to calculate a volumes size and then use the correct threshold. Note that you will have one DAD for each type of threshold (% Free or Free space avail) and also for each "severity level" you wish to alert at (Warning/critical, etc).
Although There is a limit on the number of static threshold DAD's you can have of 100, this is still the simplest and most practical way we have found to address this concern.
Enough about all that, here are the nuts and bolts, let's start with the DQL that is the meat of the solution. The comments document the query, so be sure to read them (the query is setup for LARGE VOLUME WARNING, be aware of this as you read through the section that has a thresholds). Also note that we addressed the issue of disk exclusions in Dynatrace being slow to implement sometimes by filtering any DISKS we had TAGGED with ALERT:FALSE (via API) from alerting:
// query the disk for used space, available space, then calculate total size and percentage available
timeseries {usage=avg(dt.host.disk.used, rollup:avg), avail=avg(dt.host.disk.avail, rollup: avg)}, by:{dt.entity.host, dt.source_entity}, interval:1m
| fieldsadd size = avail[] + usage[]
| fieldsadd percentAvail = avail[]/size[]
| fieldsadd DiskSize=arrayavg(size), DiskAvail=arrayavg(avail), DiskUsage=arrayavg(usage)
| fieldsadd DiskAvailPercent=(DiskAvail/DiskSize)*100
// filter for large or small volume here
| filter DiskSize > 1000000000000
// filter for percent free or space available here
// Small volumes use DiskPercentAvail
// warning (and clause to prevent warn/crit from firing at the same time)
//| filter DiskAvailPercent<10 and DiskAvailPercent>4
// critical
//| filter DiskAvailPercent<5
// Large volumes use DiskAvail
// warning (and clause to prevent warn/crit from firing at the same time)
| filter DiskAvail < 91268055040 and DiskAvail > 59055800319
// critical
//| filter DiskAvail < 59055800320
// add additional fields to remaining records for additional filtering
| lookup [
fetch dt.entity.disk
],
sourceField: dt.source_entity,
lookupField: id,
fields: {disk.entity.name = entity.name, tags}
| fieldsadd strTags=toString(tags)
// remove any disk that are tagged as Alert:False, are shared windows volumes, or are mapped drives
| filterOut contains(strTags,"Alert:False")
| filterout contains( disk.entity.name, "\\\\") and not contains(`disk.entity.name`,"windows_share")
| lookup [
fetch dt.entity.host
],
sourceField: dt.entity.host,
lookupField: id,
fields: {host.entity.name = entity.name, hypervisorType, networkZone}
// remove any kubernetes drives or drives with null volume names
| filterout isnull(disk.entity.name)
| filterout contains(networkZone,"_k8s")
// remove unnecessary fields - remember to leave only the appropriate array for large(avail) or small(percentAvail) volumes
| fieldsremove hypervisorType, networkZone, tags, strTags, usage, DiskUsage, size, DiskSize, DiskAvail, DiskAvailPercent, percentAvail
Ok, so that is the query, now, for the DAD, obviously, put the query in the SCOPE, for the alert condition (adjust as needed for your shop) be sure to put the same threshold used in the DQL:
and for the details (again adjust as needed for your shop):
I hope this helps anyone else who is fighting disk space monitoring limitations in Dynatrace. Please comment with any issues or ideas for improvement.
31 Jul 2025 07:23 AM
Thank you!!!
04 Aug 2025 02:31 PM
Disk edge alerting can be triggered on available space % and MiB as well. So if I understand correctly, your DQL contains tag based filtering. Also looking at an earlier post from you - https://community.dynatrace.com/t5/Product-ideas/Allow-the-use-of-tags-at-the-disk-level-in-disk-edg...
In case the edge alert's disk filter would accept tags ... would that provide a reasonable / final solution for this problem?
20 Aug 2025 03:27 PM
Disk Edge does not accept tags, and I submitted an RFE to use tags in Disk edge but it was rejected as not part of the roadmap