Author: Sarah Kreuzmann
Product Owner Rimscout – connect on LinkedIn.
On July 30, 2024, the cloud world experienced significant incidents and delays in the use of Azure and Microsoft 365 services. Many organizations around the world were affected. For administrators, one question is critical during such an outage: What services are affected, and how will this impact my users and operations?
Learn how to use Rimscout to identify incidents and efficiently assess user impact in this blog post.
The July 30th Azure incident: an overview
Like many incidents, the Azure outage on July 30th was the result of an attack. In this case, it was a massive DDoS attack on core Azure services such as Azure Front Door. However, what should not have been a problem for the defense system resulted in rapid network congestion due to a faulty network configuration. The result was a series of incidents where users experienced slow performance or unavailability when accessing services. Services directly affected included Azure App Services, Application Insights, and the Azure Portal itself.
The fact that some Microsoft 365 services were also affected proved to be particularly problematic. Microsoft 365 uses Azure services such as Microsoft Entra ID (formerly Active Directory) to manage user identities and authentication. As a result, the incident in Azure also caused problems in Microsoft 365, affecting services such as Admin Center, Intune, Entra ID, Power BI, and Power Platform.
Shortly after the incident began, at approximately 14:21 CEST, Microsoft notified administrators of problems accessing Microsoft 365 services. However, it was still unclear exactly which services and users were affected. To make matters worse, false reports circulated during the outage that Microsoft Teams, for example, was also affected. With this information alone, it is not easy for administrators to determine if users in their own organization are affected.
Client Monitoring: Your tool for rapid incident assessment
If you want to respond quickly and appropriately to such incidents, effective monitoring is essential. Only by monitoring their own infrastructure and users can internal IT quickly determine if there are any unusual delays or outages.
At Net at Work we use two main tools for this purpose: PRTG and Rimscout. Both have different uses:
Of course, our internal chat group also shared news about the global Azure outage during the incident. Thanks to the good data situation, our internal IT department was able to quickly assess the extent of the problem: not only were they aware of the incident, but they had not yet been able to detect any impact on our employees and key services such as Microsoft Teams.
Rimscout in action: The incident from the user’s perspective
But what information did the monitoring system provide during the incident? How could we be sure that our users were not affected?
With PRTG you can first determine whether your own servers and hosted applications are running within normal parameters. The tested remote stations could still be reached “normally” from the server room. But when it comes to the users, this view from the server room is not enough. For this reason, Net at Work has installed Rimscout clients on the devices of all our employees. This allows us to assess not only the general performance of the network, but also the quality of the connection to various services both in the office and at home. In particular, we use Rimscout to monitor Microsoft Teams, Outlook and Dynamics 365 thanks to appropriately configured tests.
When the first incident report went out to administrators around 14:20, we were able to quickly assess from the data in the Rimscout portal whether any problems had been reported by our users’ Rimscout clients. A quick glance at the health overview showed that the connection quality for employees was still in the green zone. This remained unchanged for the entire duration of the incident (approx. 13:45 – 16:30). As a result, the majority of customers reported that there was no performance degradation and that Microsoft services were available at all times.
In addition to an initial overview, Rimscout also provides a detailed look at individual locations. For example, the location overview for the Net at Work office showed that only a few clients were reporting minor performance problems with Microsoft Teams and Dynamics. However, when you filter the network environment data for these issues, as shown in the screenshot above, you can see that the affected clients were connected via VPN. Therefore, these issues were not related to the incident, but had their local cause in the VPN connection.
Review of the Microsoft 365 outage in January 2023
Services like Microsoft Teams have not been spared from the effects of every incident in recent years. For example, the global IT incidents on January 25, 2023 affected most Microsoft 365 services. Microsoft Teams and Outlook were temporarily unavailable.
When employees started reporting connection issues, Rimscout was already showing connection quality problems. The data quickly suggested that the individual problems had a common global cause, which ultimately proved to be the case. A look at the latency data collected at the time shows the gradual degradation of performance to the Microsoft Teams counterpart until the service was finally unavailable. In contrast, latency to Microsoft Teams remained “normal” during the July incident.
The average latency for Microsoft Teams on the Net at Work Office site during the Microsoft outage on January 25.
The average latency for Microsoft Teams at the NaW Office site during the July 30th Azure incident.
Monitoring Azure with Rimscout
Since Azure services are not a frequently used resource for most of our employees, they are not monitored with Rimscout at Net at Work. However, this can be easily adjusted in Rimscout:
By creating relevant tests in the Rimscout test configuration, you can monitor the performance and accessibility of various Azure services. For example, to monitor the Azure portal, you can create an HTTP test for the remote https://portal.azure.com. By running this test from the Rimscout clients in your own tenant, you can quickly get an overview of how good the connection to the Azure portal is at all employee locations.
More importantly, these tests can determine if performance is degrading, and if so, whether it is an individual user, a site, or perhaps a global incident. When users report problems with the service, you can quickly determine whether the cause is in the network and, if so, whether it is a local issue, such as a poor connection to the provider.
In addition to the Azure Portal, the other Azure services can also be monitored. Here are some examples with the corresponding remote stations:
- Azure Storage: Monitor the availability and performance of storage resources at https://.blob.core.windows.net.
- Azure SQL Database: Monitor database connections and performance at https://.database.windows.net.
- Azure Virtual Machines: Monitor the availability and performance of virtual machines at https://.wvd.microsoft.com for Azure Virtual Desktop.
This comprehensive monitoring with Rimscout ensures that problems can be responded to quickly, minimizing the impact on end users.
Bottom Line
Monitoring services is a critical factor in detecting global IT incidents and responding quickly to outages. Monitoring tools such as Rimscout provide an efficient way to monitor performance and accessibility from the user’s perspective, allowing you to quickly assess the impact on your own users.
Even if there is no global outage at the moment, Rimscout can give you insight into your network and possible problems for your users. After all, the answer to a performance problem often lies in the local connectivity of users, not in a global outage of Microsoft or other cloud providers.