There is no such thing as uptime…

Comments Off on There is no such thing as uptime…

simpson1_wideweb__470x286,2Some things are just depressing. One of the most depressing conversations ever witnessed was between Homer Simpson and Bart Simpson. Bart is complaining about how, after series of embarrassing events, this is the worst day of his life. Homer responds:

“It’s the worst day of your life, so far! 

The most depressing thing I’ve ever heard though has to be credited to Michael Savage (racist homophobe / conservative talk show host):

“There is no such thing as happiness, there are just brief moments of joy.”

One of the things you get to live with, if you want a career in IT, is constant and perpetual state of content with uptime, followed by raging depression of having to balance a few thousand things not to have it all explode. And then it explodes. Son of a #@%#@.

What had happened was…

We have been talking / working / planning a large scale upgrade to OWN’s web hosting infrastructure. My last professional job, prior to starting OWN, was writing control panel software for a web hosting operation. Back then I had a few choices: Webalizer which has continued to blow consistently through the years, spend the money I didn’t have on software I couldn’t justify financially…… or write my own!

Over the years the software had been hacked, secured, patched and rewritten in spurts. The nice thing about it was the overall reliability, look at the uptime on the web hosting sister cluster:

09:13:16 up 517 days, 16:28,  3 users,  load average: 0.51, 0.35, 0.25

Look at the uptime on Shockey Monkey:

09:09:43 up 531 days,  7:29,  1 user,  load average: 0.00, 0.00, 0.00

Looks similar, eh? Yep, if you scroll through the blog you can actually pinpoint when we did massive power strip upgrade at OWN because we couldn’t reboot f’n Microsoft servers after a patch. 🙂

Over the years the infrastructure running OWN hosting had aged. New, better, more redundant ways (better than active-active clusters) of running massive hosting deployments came around. The old stuff, to it’s credit, was rock solid, reliable, and didn’t bother anybody. But the patch-around-the-big-elephant process was simply no longer possible.

On Friday, our RAID configuration drank some moonshine and blew up the journal. The process of rebuilding the journal would have taken about 8 hours. That is also the interval we had imagined would be required to move to the new “Web 2.0” platform for OWN. Since the new system had been in sync for the past few months (so we could adequately test it) the switch was rather easy. Support systems on the other hand..


I looked around and said: This is the worst day of my life, so far. Looking back, this would have been a great time to just move some drives to the few new systems on the rack, reboot, go out to lunch, order every drink on the menu and float away from my problems.

I did something far dumber. I decided to face the 18 year old Vlad which sucked at programming. Along with the 20 year old Vlad which obviously sucked at change management, the 23 year old Vlad who documented changes.. well.. like a 23 year old 🙂 The stuff that I recall took me two weeks to write originally was now rewritten in less than 45 minutes. Most of that time went to the “WTF???” moments, trying to decipher why some things were done the way they were.

Life in IT…

It is remarkable how much time, man hours, knowledge and experience goes into keeping stuff together. I don’t think explaining complexity, beyond marketing terms, would ever leave anyone at ease about what they are trusting to run their business on. While the interfaces and controls have gotten so user friendly over time that makes this stuff appear easy, keeping the back of it at 99.999% has evolved into something impossible to maintain without an army of highly trained folks that can support a decade of patches, changes, processes and needs. This is why most of your midmarket and enterprise systems, which can’t afford demolish & rebuild IT, runs on dinosaurs-like process and technology. People like the uptime and reliability, and that comes at an expense.

Had this occurred in SMB (and we are talking high 6 figures of hardware) not only would all the data be gone, but it wouldn’t be rebuilt for weeks. When the gamble of good enough holds back the technology to keep things reliable and secure.. it’s only a matter of time until you painfully learn that you are an idiot.

Personally, my idiot day sucked and stretched to suck up most of the weekend. It forced my hand, moved up a project nearly 4 months ahead of time, without warning and likely left a lot of people violently upset 🙁

But here is the beauty of the thing: Only a dozen people noticed anything went wrong at all and none of them had to deal with the technical plan shifted 4 months forward. So back to living the American dream: Can’t someone else deal with it?

For what its worth, I’m not sure what I’d do with myself if my life were any less exciting.