A Practical Demonstration Of The Difference Between 'resilient' And 'redundant'
Who, Me? Monday is upon us, and with it comes a cautionary tale of how one Register reader's overconfidence led to his undoing, thanks to an unexpected interfacing with a belt buckle, in today's edition of Who, Me?
Our story comes from "Dan", a lead system admin at what he described as a "rather large company" where the vast majority of the business went though a single suite of application backed up by a central database.
"The number of zeros on the 'dollars per minute' lost in unscheduled downtime was frightening," he told us.
The company liked Dan. He had rocked up after his predecessor departed under a cloud and had spent quite a while dealing with the environment. He found Development, QA and Production all running with differing patch levels and occasionally even different OS versions. Configurations didn't match. Hardware architectures differed. And so on.
It took a while but, once Dan had lined everything up, downtime due to bugs being found in production dropped significantly. It's fair to say the company was very pleased. Perhaps a bit too pleased.
"There remained, however, one nasty fly in the ointment," he said. "The guys in sales have been promising our clients for years that our systems were resilient and redundant.
"Resilient, they had become. Redundant they weren't. Not in any sense of the word."
However, bit by bit, the company's systems were indeed slowly becoming redundant, aided by Dan "stalking the developer cubes with a bat to encourage the removal, or non-creation, of code that would not play nice with node failover."
The final piece of the redundancy jigsaw was the database. It ran on Sun hardware, and Dan's team could hotswap pretty much any hardware component without the system suffering any downtime. Resilient, for sure. But still not redundant despite the joy from management at the dramatic reduction in outages.
"They thought my team had learned to walk on water or something," recalled Dan, happily.
The next step was to do some hardware duplication and create a cluster capable of withstanding all manner of disaster. "The DBAs were practically salivating at the prospect," Dan recalled, but the numbers involved were large enough to invoke a "steady on, chaps" from management.
On the day in question, Dan received a routine trouble ticket. It looked like either an adapter card or Gigabit Interface Converter (GBIC) had died. No problem – he was already on site and there were plenty of spares in the data centre.
- Hacking the computer with wirewraps and soldering irons: Just fix the issues as they come up, right?
- Scalpel! Superglue! This mouse won't fix its own ball
- Electrocution? All part of the service, sir!
- Undebug my heart: Using Cisco's IOS to take down capitalism – accidentally
He asked a colleague to pop in a change request for a hotswap and headed into the computing sanctum to the do the deed. He had just enough time to get it sorted before heading off for lunch.
It transpired it wasn't the GBIC at fault, but the adapter card.
Fine. He'd done this many times. It was a simple case of powering down the system, pulling it out on its rails, replacing the card and firing it back up. Simple.
Or not.
There was a slight wrinkle in the process, but not an unfamiliar one.
"You could reach ONE rail-lock easily from the front of the system," explained Dan. "Reaching both of them, however, you ended up hugging this massive and heavy box like a mother bear, flipping both locks and then starting the system back on its way into the rack with a little judicious hip pressure.
"Everyone on my team had done it dozens of times.
"This time, Murphy was watching and arranged for my belt buckle to occupy the exact same piece of the universe as the tiny, *unshielded* master power switch on the top panel at the front of the box...
"There was a click and microseconds later the pager on my belt went nuts as the only not-yet-redundant component of our entire business took a hard power outage."
Yes, Dan's lunch turned out to be very, very late that day. Still, the budget needed to add that last bit of redundancy arrived soon after.
Ever accidentally demonstrated just how stable (or not) your company's systems truly were? Or dropped some spectacles into the whirring blades of a PSU fan? Share your totally SFW tales of clothing or body parts causing IT chaos in an email to Who, Me? ®
From Chip War To Cloud War: The Next Frontier In Global Tech Competition
The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more
The High Stakes Of Tech Regulation: Security Risks And Market Dynamics
The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more
The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics
Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more
The Data Crunch In AI: Strategies For Sustainability
Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more
Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser
After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more
LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue
In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more