Microsoft: Azure Delays Not Acknowledged For 5 Hours Because Manager Was Asleep

Microsoft has revealed it took five hours to acknowledge lengthy disruptions affecting European customers in late March because the task of informing customers relied on a US-based incident manager, who was asleep at the time. 

The delays affected customers in Europe and the UK for three days beginning around 9am UTC on March 24. However, at the outset, as customers struggled with extra-sluggish Azure services, Microsoft missed its 10-minute target for acknowledging issues by a wide margin. 

In a post mortem, Chad Kimes, director of engineering at Azure admits Microsoft's "communication during this incident was also problematic" and apologized for the frustration and confusion this caused to the 6,136 customers affected.   

The technical issue itself was caused by virtual-machine capacity constraints due to a surge in demand for Azure compute resources during COVID-19 coronavirus pandemic, which resulted in 21-minute delays affecting Microsoft's Pipelines DevOps service for releasing new builds targeting Windows and Linux agents in Azure. The longest delay was nine hours, according to Kimes. 

"The problem here is that our live-site processes have a gap for these types of incidents," Kimes said of the communication issue. 

"When incidents involve customer request failures or performance impacts, we have automated tooling that starts an incident and loops in both a DRI (designated responsible individual) and what we call a PIM (primary incident manager). The PIM is typically the person responsible for posting external communications acknowledging the incident," he adds. 

"Pipeline delays are detected by different tooling, and the PIM is not currently paged for these types of incidents. As a result, while the DRI was hard at work understanding the technical issues and looking for potential mitigations, the PIM was still asleep. Only when the PIM joined the incident bridge at roughly the beginning of business hours in the Eastern United States was the incident finally acknowledged."

Microsoft says it is planning to improve its live-site processes to "ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types".

The company is also rolling out architectural changes to mitigate bottlenecks in spinning up new agents from its hosted agent pool. 

RECENT NEWS

Reassessing AI Investments: What The Correction In US Megacap Tech Stocks Signals

The recent correction in US megacap tech stocks, including giants like Nvidia, Tesla, Meta, and Alphabet, has sent rippl... Read more

AI Hype Meets Reality: Assessing The Impact Of Stock Declines On Future Tech Investments

Recent declines in the stock prices of major tech companies such as Nvidia, Tesla, Meta, and Alphabet have highlighted a... Read more

Technology Sector Fuels U.S. Economic Growth In Q2

The technology sector played a pivotal role in accelerating America's economic growth in the second quarter of 2024.The ... Read more

Tech Start-Ups Advised To Guard Against Foreign Investment Risks

The US National Counterintelligence and Security Center (NCSC) has advised American tech start-ups to be wary of foreign... Read more

Global IT Outage Threatens To Cost Insurers Billions

Largest disruption since 2017’s NotPetya malware attack highlights vulnerabilities.A recent global IT outage has cause... Read more

Global IT Outage Disrupts Airlines, Financial Services, And Media Groups

On Friday morning, a major IT outage caused widespread disruption across various sectors, including airlines, financial ... Read more