I have created Custom Alert for container CPU and memory utilization in % of the limit.
It has triggered problems but in Problem details there is no container/containerGroup name printed,
please let me know whether its possible to print container/containerGroup name in Problems.
Or let me know if there is totally different approach to achieve this.
select "show Explorer" to show containers breaching threshold.
Solved! Go to Solution.
you need to use the "container-group-instance" for splitting. The "Container" is only the name, so it doesn't link to an entity. The container-group-instance is the unique identifier for your container and is thus linked to an entity.
Using a valid entity for splitting will result in the proper link in your problem card 🙂
You can always use our data-explorer for testing these links by hovering over some lines.
FYI - here is a collection with some frequently asked expressions for K8s. This also includes my suggestion for your use-case: Checking this on a workload/deployment level instead of every single container. Cause you likely want to know if this is a general problem (e.g., limits set too low), or just some container randomly hitting the limits.
Hope this helps 🙂
Hi @florian_g , I am also facing a similar issue where the Problem is not getting tracked under the correct entity and the alert I have created is to notify me when a container restarts on my AKS , I cross checked the alert I have with the sample provided in the github link you shared and it almost looks similar with some extra filters in my case . Below is the metric expression I use :
first of all: congratulations on coming up with this entity filter 🚀 great job. As a quick side note, in case you don't know: Our data explorer is of great help here for creating such filters - most of the time I use it for coming up with an initial filter, and later refine it in code mode.
Back to your question - the query looks good to me. So the problem is triggered, but not attached to the right entity? Can you share some screenshots? I would be interested in the problem screen of an active problem and the corresponding alert configuration.
Please find below the screenshot for one of the sample fired alerts , the problem is attaching itself to the Environment , in my case my Env ID is val72867
Below is my Alert configuration :
also the "The number of metric dimensions in the last 24 hours exceeds the limit of 1,000. Please refine the entity scope or the dimension filters above or choose a different monitoring strategy" error is something I am seeing only today
Thanks a ton , I have changed my alert type to Kubernetes workload to if it helps . Will cross check as soon as another alert fires .
Is there some recommendation to get around the "the dimension error is a limitation," issue ? Will this prevent us from getting alerted if there is a issue ? Also please do share any recommendation to avoid such in this case . Below is the query I use .
Should i avoid using some filters maybe ? I for sure need the namespace filter , but if it helps I can try avoiding the clustername filter .
Happy Friday @hari-m94 👋,
as you said, the limitation makes your alert unreliable. Basically, the problem is, that there's simply too much data per minute to keep track off. So actually narrowing down the scope of your alert (that is, adding filters), might help you to solve this. From what I know, there's one limit per alert and one limit for your entire environemnt. Meaning, that you won't be able to simply narrow down the scope of your alert and afterward create an endless number of alerts to cover all the other areas. At some point this limit is likely going to hit you. So that's likely not what you wanted to hear - so let's turn to the good news.
We're currently working on a variety of out-of-the-box alerts for K8s entities which will easily scale also to large environments - enabled by simply flipping a toggle (no metric expression needed 💓). Think of alerts for frequent container restarts, OOM kills, abnormal CPU throttling, namespaces hitting quotas, your k8s cluster running out of CPU/mem, ... you name it 😉 Of course, customizable and easy to only configure for a narrow scope (like you're doing with your filter) or your entire tenant.
We're currently testing internally a first MVP for that and hope to deliver this feature soon. The plan is to provide you with some first alerts during CQ4 this year. The roll-out will afterward happen in phases: We'll start with a smaller set of alerts and will add more with later releases.
Hope you love these great news as much as I do 🙂
Happy weekend 👋
just realized another very important thing @hari-m94 - your use-case only involves one metric. For such use-cases, I always recommend sticking with the "build" mode - this one comes without the mentioned limitations 🚀 See below how this could look like for your use-case - works perfectly in my environment.
Hi @florian_g ,
Unfortunately the builder way of adding the alerts is not working as all and I am seeing no alerts getting fired . Unsure if I am missing something . Please take a look and advise (I have added the alert screenshots below for you quick reference)
looks valid - only thing I notice: the query offset says "undefined" (second screenshot on top). Could you set this to the default of 2 pls?
If this does not help, please open up a support ticket as then I would want to have my support/engineering friends to have a look at it.