In a preliminary post-incident report, Microsoft has revealed that this week’s 5-hour-long Microsoft 365 worldwide outage was triggered by a faulty Enterprise Configuration Service (ECS) deployment that led to cascading failures and availability impact across multiple regions.
ECS is an internal central configuration repository designed to enable Microsoft services to make wide-scope dynamic changes across multiple services and features, as well as targeted ones such as specific configurations per tenant or user.
What initially started like a minor Microsoft Teams outage ended up expanding downstream to multiple Microsoft 365 services with Teams integration that also leverage ECS, including Exchange Online, Windows 365, and Office Online.
As a result, users worldwide began reporting that they could not use Microsoft Teams and multiple Microsoft 365 services or features.
“This issue affected the users’ ability to connect to the Microsoft Teams Desktop, Web and Mobile clients,” the company explained in its preliminary report.
“Telemetry indicated that approximately 300k calls were impacted by this event. The Asia Pacific (APAC) region was most affected due to business hours coinciding with the impact window. Additionally, Direct Routing and Skype MFA were mostly impacted service.”
According to Redmond’s report, the incident started on Thursday, July 21, at 1:05 AM UTC, with the company’s engineers remediating most of its impact within five hours, by 6:00 AM UTC.
However, there was also some isolated residual impact until 1:14 PM UTC the same day, matching customer reports on social media.
In the end, the incident affected users attempting to utilize one or more of the following Microsoft 365 services and features (all impacted to some degree by the outage):
Exchange Online (Delays sending mail)
Microsoft 365 admin center (Inability to access)
Microsoft Word within multiple services (Inability to load)
Microsoft Forms (Inability to use via Teams)
Microsoft Graph API (Any service relying on this API may have been affected)
Office Online (Microsoft Word access issues)
SharePoint Online (Microsoft Word access issues
Project Online (Inability to access)
PowerPlatform and PowerAutomate (Inability to create an environment with a database)
Autopatches within Microsoft Managed Desktop
Yammer (Impact to Yammer flighting)
Windows 365 (Unable to provision Cloud PCs)
Preliminary root cause was an ECS failure
As described by Redmond in its incident report and mentioned in the beginning, the preliminary root cause behind this outage was a faulty Enterprise Configuration Service (ECS) deployment.
“A deployment in the ECS service contained a code defect that affected backward compatibility with services that leverage ECS. The net result was that for services that utilize ECS it would return incorrect configurations to all its partners,” the company explained.
“This resulted in downstream services getting a ‘200’ status message (indicating the pull was successful), however, it actually contained a malformed JSON object.
“The extent of the impact depended on how individual Microsoft services utilize the malformed configuration provided by ECS. Impact ranged from services crashing such as Teams while other services experienced limited to no impact.”
As a result of this incident, Microsoft says they’re working on improving the resiliency of the Microsoft Teams service to fail back to a cached ECS configuration version in the event of a future ECS failure.
They’re also investing in additional fault isolation to limit the impact of an ECS failure and updating monitoring thresholds to identify such low-grade failures better.