Google's New Free Software Security Book: 'Wi-Fi Password Reset Crippled One Of Our Systems'
Even the biggest tech companies have major IT disasters because engineers fail to foresee that a seemingly minor event might overload a vulnerable system.
If your job involves protecting IT infrastructure, it could well be worth reading Google's new and free 500-page book detailing numerous failures affecting Google's internal systems and products like YouTube.
Importantly, the new book also reveals how its site-reliability engineering and security teams cooperate to protect key Google systems, from Android to Chrome, Gmail, Search, and Google Cloud.
Few companies in the world operate at Google's scale, but nonetheless there may be lessons to learn from Google's book, which comes as the COVID-19 coronavirus pandemic makes it more important than ever for online systems to remain reliable, available, and secure.
SEE: Cloud v. data center decision (ZDNet special report) | Download the report as a PDF (TechRepublic)
The book offers insights from teams that practice so-called site reliability engineering (SRE), Google's approach to coordinating software engineers who develop its products and systems, and operations teams that keep the product running.
Google, which has used SRE principles for nearly two decades, defines it as "what you get when you treat operations as if it's a software problem".
The new book, titled 'Building Secure and Reliable Systems', focuses on how Google brings an SRE approach to security, and security's role in software product development and operations. Google's previous books on SRE covered best practices in SRE but didn't deal with the links between reliability and security.
"For good reasons, enterprise security teams have largely focused on confidentiality. However, organizations often recognize data integrity and availability to be equally important, and address these areas with different teams and different controls," explains Royal Hansen, an early SRE lead for Gmail and Google's current VP of security engineering.
"The SRE function is a best-in-class approach to reliability. However, it also plays a role in the real-time detection of and response to technical issues – including security-related attacks on privileged access or sensitive data. Ultimately, while engineering teams are often organizationally separated according to specialized skillsets, they have a common goal: ensuring the quality and safety of the system or application."
The book opens with the questions "Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable?".
SEE: Try these six awesome Google Chrome extensions today
Google's first tale is about cascading failure in 2012 after its corporate transportation announced the Wi-Fi password for its buses connecting its San Francisco Bay Area campuses had changed.
The flood of employees trying to change their password overloaded its password manager and knocked it and its three replicas offline.
Google needed a smartcard to restart the system and had them in multiple offices across the globe, but couldn't access them in the US. So it reached out to engineers in Australia for one there, which turned out to be locked in a safe with a code the engineer had forgotten.
And where was the code saved? Of course, in the now-offline password manager. But there were even more failures as engineers fumbled to restart the password manager.
"On that day in September, the corporate transportation team emailed an announcement to thousands of employees that the WiFi password had changed. The resulting spike in traffic was far larger than the password management system – which had been developed years earlier for a small audience of system administrators – could handle.
The load caused the primary replica of the password manager to become unresponsive, so the load balancer diverted traffic to the secondary replica, which promptly failed in the same way. At this point, the system paged the on-call engineer. The engineer had no experience responding to failures of the service: the password manager was supported on a best-effort basis, and had never suffered an outage in its five years of existence. The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card.
These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager. Fortunately, another colleague in California had memorized the combination to the on-site safe and was able to retrieve a smart card.
However, even after the engineer in California inserted the card into a reader, the service still failed to restart with the cryptic error, "The password could not load any of the cards protecting this key."
At this point, the engineers in Australia decided that a brute-force approach to their safe problem was warranted and applied a power drill to the task. An hour later, the safe was open – but even the newly retrieved cards triggered the same error message.
It took an additional hour for the team to realize that the green light on the smart card reader did not, in fact, indicate that the card had been inserted correctly. When the engineers flipped the card over, the service restarted and the outage ended."
Reassessing AI Investments: What The Correction In US Megacap Tech Stocks Signals
The recent correction in US megacap tech stocks, including giants like Nvidia, Tesla, Meta, and Alphabet, has sent rippl... Read more
AI Hype Meets Reality: Assessing The Impact Of Stock Declines On Future Tech Investments
Recent declines in the stock prices of major tech companies such as Nvidia, Tesla, Meta, and Alphabet have highlighted a... Read more
Technology Sector Fuels U.S. Economic Growth In Q2
The technology sector played a pivotal role in accelerating America's economic growth in the second quarter of 2024.The ... Read more
Tech Start-Ups Advised To Guard Against Foreign Investment Risks
The US National Counterintelligence and Security Center (NCSC) has advised American tech start-ups to be wary of foreign... Read more
Global IT Outage Threatens To Cost Insurers Billions
Largest disruption since 2017’s NotPetya malware attack highlights vulnerabilities.A recent global IT outage has cause... Read more
Global IT Outage Disrupts Airlines, Financial Services, And Media Groups
On Friday morning, a major IT outage caused widespread disruption across various sectors, including airlines, financial ... Read more