Major Azure Outage Takes Down Websites and Login Systems for 10+ Hours

A misconfiguration in Microsoft’s storage account setup led to widespread failures impacting virtual machine functionality, managed identity services, and developer processes.

Under tracking ID FNJ8-VQZ, numerous customers encountered issues deploying or scaling virtual machines, manifesting as errors during provisioning and routine operations. Other crucial services also felt the impact.

Users of Azure Kubernetes Service reported problems with node provisioning and extension setups. Similarly, Azure DevOps and GitHub Actions users experienced pipeline disruptions when tasks relied on virtual machine extensions or associated packages. Furthermore, any operations needing to download extension packages from Microsoft’s storage accounts suffered from reduced performance.

While an initial fix was implemented within roughly two hours, it inadvertently caused a secondary platform problem concerning Managed Identities for Azure Resources. This led to authentication failures for customers trying to create, update, delete Azure resources, or obtain Managed Identity tokens.

Microsoft’s status history, under tracking ID M5B-9RZ, confirmed that the previous mitigation resulted in a significant surge of traffic, overloading the managed identities platform service in both East US and West US regions.

This surge affected the ability to create and utilize Azure resources linked to managed identities, encompassing services like Azure Synapse Analytics, Azure Databricks, Azure Stream Analytics, Azure Kubernetes Service, Microsoft Copilot Studio, Azure Chaos Studio, Azure Database for PostgreSQL Flexible Servers, Azure Container Apps, Azure Firewall, and Azure AI Video Indexer.

Despite several attempts to scale up infrastructure, the backlog and retry volumes remained unmanageable. Consequently, Microsoft redirected traffic away from the compromised service to perform repairs on the core infrastructure without active load.

“This outage not only brought down websites but also paused critical development processes and interrupted day-to-day operations,” remarked Pareekh Jain, CEO of EIIRTrend & Pareekh Consulting.

Increasing frequency of cloud outages

Instances of cloud service disruptions have grown more common recently, with leading providers like AWS, Google Cloud, and IBM facing significant, publicized failures. AWS experienced severe impacts for over 15 hours due to a DNS issue that rendered the DynamoDB API unstable.

Last November, an incorrect configuration within Cloudflare’s Bot Management system caused sporadic service outages across various online platforms. Earlier in June, an faulty automated update compromised the company’s identity and access management (IAM) system, preventing users from authenticating on third-party applications via Google.

“The modern data center architecture is evolving, influenced by the increasing demand for complex workloads driven by the pace and unpredictability of AI. This rapid growth introduces new complexities and strains existing interdependencies. Therefore, any misconfiguration or oversight at the control level can severely impact the entire environment,” explained Neil Shah, co-founder and VP at Counterpoint Research.

Strategies for Future Cloud Outages

This occurrence is not unique. For Chief Information Officers, such events underscore the urgency of re-evaluating resilience frameworks.

When a hyperscale dependency fails, CIOs should avoid simply waiting. Instead, Jain advises adopting a strategy of stabilize, prioritize, and communicate. He elaborated, “Initially, stabilize the situation by formally declaring a cloud incident, appointing a single incident commander, promptly assessing if the problem impacts control-plane functions or active workloads, and halting all non-critical changes like deployments and infrastructure updates.”

Jain further suggested prioritizing restoration by safeguarding customer-facing operations such as traffic routing, payment processing, authentication, and support. If CI/CD is affected, critical pipelines should be moved to self-hosted or alternative runners, with releases queued behind a business-approved gate. The final step involves communication and containment: providing consistent internal updates detailing affected services, workarounds, and next update times, and deploying pre-approved customer communication templates if the public is impacted.

Shah highlighted that these outages serve as a stark reminder for enterprises and CIOs to diversify workloads across multiple cloud service providers (CSPs) or adopt hybrid solutions with essential redundancies. To mitigate future operational disruptions, it’s also crucial to maintain lean and modular CI/CD pipelines.

Furthermore, the scaling strategy for critical code and services, differentiating between real-time and non-real-time needs, must be meticulously planned. CIOs need a comprehensive understanding and operational insight into latent dependencies to anticipate potential impacts in such situations and must establish a strong mitigation strategy.

Cloud ComputingDevelopment ToolsSoftware Development

Trending →

Xcode 26.3 Is Here: Get Ready to Vibe Code.

Physical AI: Davos’s Unexpected Optimism

AI Is Outpacing Our Safety Checks

GitHub plans limits on AI code to protect maintainers.