Azure Monitor getting smarter
If you have been working with Azure workloads, chances are that you have used Azure monitor alerts, to keep any eye on the workloads metrics and get you notified in near real time whenever there is an unexpected behavior detected.
The main challenges here is that you will need to define which metrics you want to capture, and you also define what constitutes the normal range for each of those metrics.
Needles to say, you will need to manually revise the the metrics selection and the alerts range thresholds along the resource’s life cycle. While in some cases this may be acceptable, however this task becomes really tedious and error-prone, for newly deployed resources, frequently changing system behavior, and if the system has many resources.
Metrics Static Thresholds
During early phases of a project, I usually set up the initial metrics thresholds for the target resource, based on an initial estimate, sometimes a guess about the expected metrics values. After an initial period of monitoring, I get some sense of the actual values and re-adjust the metrics thresholds accordingly.
In the below screenshot, I configured an alert rule for a simple logic app with HTTP trigger, where I specified two conditions to be monitored; the number of failed runs and the average request latency logic app metrics, and get notified whenever any of those conditions are exceeded.
A few things to note here, you can have up to two conditions for each single alert rule, and as mentioned in the information note below in the screenshot, if you have multiple conditions with different granularities in an alert rule, the biggest one will be chosen for both, makes sense!
Further, I configured Email notification option in the configured actions of the alert rule, so, whenever any of my evaluated conditions is met, I get a notification. It is important to note that a newly created alert rule may take up to 10 minutes to get in action.
I used SOAPUI load test feature to invoke my logic app and simulated a set of failed transactions using an intentionally included input that would result in a logic app failed run.
And voilà ! I got an alert informing me that a monitoring condition was triggered, along with the target resource and the alert rule instances.
Importantly, for each triggered alert, Azure Monitor will automatically follow up in subsequent aggregation window to check if the trigger was resolved and it will inform you about that, like the screenshot below.
This was quite what I needed in terms of monitoring and notifications, and it was doing the required job, however there is one challenge, as mentioned earlier, the thresholds I set were based on my initial expectations of the actual resource usage and overall system behavior, and usually this is very hard to estimate, and sometimes it is based on a guess from me or from my customer, if I was lucky enough to such input. Moreover, I had to manually revise these static thresholds frequently along the way.
Metrics Dynamic Thresholds
I was really excited to learn that Microsoft introduced a new option in Azure Monitor which included having dynamic thresholds with upper and lower metrics’ ranges.
This leveraged Microsoft’s proprietary machine learning algorithms that will analyze and identify normal seasonal patterns from actual ingested metrics over a period of time and provide a notification whenever anomalies are detected, that form significant deviations from the normal patterns.
In the below screenshot, I wanted to monitor any deviation, lower or higher, in the total number of my logic app completed runs, thus I chose the (Greater or Less than) operator for the alert condition, alternatively the operator could be configured to (Greater than), or (Less than), if you want to focus only on higher or lower deviations.
The generated (Condition preview), seen in the screenshot below, is also a nice feature to clearly explain how your condition will be evaluated in plain English.
Further, the dynamic thresholds are…dynamic, that is, it will continuously be improved and adapted, according to the overall metrics data; newly ingested metrics and old ones. This actually mimicked what I was doing manually, it is getting the sense of the ingested metrics data along the way, and continuously re-adjusting the new metrics thresholds based on what is being considered a normal pattern.
However, you still have some control on the extent of deviation range that would trigger an alert, using a 3-level setting for the threshold sensitivity, starting from a lower deviation detection (Low), a balanced one (Medium), to detecting the slightest deviation from normal patterns (High).
To further enhance the notifications and avoid noise from transient anomalies, you also have control over the number of consecutive deviations over a specified period of time, before the alert is triggered. These transient spikes will still be displayed in the motioning graph, albeit no alert will be created for it.
Moreover, you have the option to ignore older metrics, this is especially useful, when there is a new system behavior that is expected to be completely different from the previous one, like when going live after an initial period of testing, adding a new upstream channel, or scaling the system resources, all of which could have an impact on the previously captured metrics and on what is considered a normal pattern.
I also liked fail-proof hints provided by Azure Monitor to guide you against inadvertent mistakes in certain settings, some of which might produce false positives, like the example in the below screenshot.
Testing the Dynamic Thresholds
To test the dynamic thresholds alerts, I used the same logic app, with a scheduled trigger that runs every 30 minutes to represent the normal pattern over a period a few days.
In the first few days while Azure Monitor is ingesting and analyzing the system metrics, you will not have any useful information, in fact, you will get an error message indicating that dynamic thresholds could not be retrieved yet, like the one below.
After the dynamic thresholds were determined and the normal metrics patterns detected, I started generating 5 load tests to invoke the logic app with 45 calls within a 30 minute period.
As expected, I got an email notifying me about this anomaly in the number of logic app completed runs. Again, like the static alerts, you also get a deactivating alert for the original one when the anomalies occurrence subsides.
Moreover, you are provided with a nice graph representing the occurrence of such anomalies outside the normal pattern and the corresponding time range.
In the screenshot below, the blue solid area represents the automatically generated threshold based on the threshold sensitivity selected, and the red area indicates the anomalies range and period. As mentioned in the information note, you will be able to see these thresholds change over time.
Interestingly, if you changed the (Threshold Sensitivity) configuration, you will be able to see in real-time on the graph how the new threshold limits are represented in relation to the captured data. For instance, you could see if the same captured metrics lies within the normal threshold range, or if these would have been treated as anomalies, as seen below.It important to note that, currently, the dynamics thresholds alert conditions are created based on a single criteria, you can’t add multiple conditions with different metrics in the same alert rule, you will need to create a separate rule for each metric.
Final Thoughts
I have seen several cases where anticipating, configuring and re-configuring static alerts thresholds was a very tedious, and error-prone task in any monitoring platform.
With Azure Monitor built-in smart machine learning algorithms that are wrapped in user-friendly screens and useful hints, while continuously analyzing metrics patterns and detecting anomalies becomes much easier and more accurate.
I highly recommend that you learn more about Azure Monitor dynamic thresholds from Microsoft Documentation and try it yourself, in addition to your static thresholds, to see its full potential and if it adds value to your solution monitoring.
Great Information! Thank you for sharing!