I wrote about this a few months ago but we had to step back a bit and look at the picture a little bit differently. In just about 24 hours we’ll be shutting down the NOC, systems, networks, PDUs and effectively simulating what would happen to our network if Dallas disappeared from the face of the earth.
We get to do this twice, maybe three times a year if we’re lucky. You see, operating even the smallest of networks involves documentation nightmares out of this world. Even one server can change drastically over the course of one year. Not to mention new gear, new computers, new network equipment, new business processes. When you put those into action you usually do it quickly, do it at best-effort basis, test it, check on it eventually and hope for the best or at least a good fallback. Despite the size, we do the same damn thing except with slightly more expensive gear (for example, my network cards probably cost more than your entire server, pardon the arrogance) – but we need to get it working – now – along with 50 other things. And about two to three times a year I try to take a breather and make sure we are doing it right.
Planning, projects, management… as Amy says, are not about the tools. They are about the approach, about the plan, and about the process. Karl, whose presentation I hope you guys go see if you’re in Redmond, is releasing a project workbook towards the end of the month. What he’ll undoubtedly tell you is that planning is easy. You’re not an idiot. You look at your facts, look at your options, you throw a bunch of shit at the wall and then organize it before it hits the floor. You’ve got a plan! Excellent. Easy. Now, Execution… followthrough… evaluation… those are the details that make things either work or not. Those are the differences between a dead SBS server at 9 AM on a Monday morning and one that has been up and running for the last 5 months without a second of downtime. Ever met someone that complained about how their computer or server was suddenly slow after change XYZ? You know my first question – Slow as compared to what? How do you know? And 9/10 times they don’t know but have some anecdotal reference making the solution impossible to diagnose. Because they never established a baseline. They never bothered to stop and evaluate the performance impact of the changes they made. That is the key to running a phenomenal IT company – that your baselines, your key performance indicators, your data and your intelligence on the network continues to evolve. I deal with a lot of folks who think that $20,000 on a tool will fix it and answer it – it never, ever does. And a year later they are stuck with a huge bill, no way out, renewal fee – and the problem was in front of them, for free, all along. When you’re prepared for your failures you don’t need a tool to tell you about them – you can call them out before they happen. Quick, how many of you can predict total data loss on a server that has a single IDE hard drive? How do you know? Saw it happen – thats how.
Here is how I justify these exercises. Pick a critical piece of your infrastructure. Go ahead, any piece. Server, Switch, NAS. Pull the plug (if its dual, pull both). What happens next?
The answer to that question is what happens professionals from amateurs who can read the manual. I feel our job (and our goal) is to be able to prove we have an answer to that question. Not “we think”, not that “we’re pretty sure”, not “i believe” – but I know, here is why and this is my evidence.
You know that saying… been there, done that? I want my network to show a track record = we did this, this is what happened, and this is how we improved it. One of the biggest challenges of 2007 for Own Web Now Corp has been overcoming the stupid things that were in place an we just couldn’t get over ourselves. Things have been changing, I can see the results already.
I write this open letter as a CEO, as an IT guy, as your partner, as your vendor, as your client – you ought to expect me to fail. I expect to fail, I expect our decisions to eventually be proven wrong. Thats not pessimism, thats just the worst case scenario. And it’s not something I lose sleep over every night. What I lose sleep over every night is over things I think I know, about the things I don’t expect the network to fail on, over things that I arrogantly believe we have solved – but nobody bothered to update the baseline and assure validation.
What is of absolute importance to your company? What do you do to provide with absolute (and audited, confirmed) certainty that what you’re doing has been tried and tested? When was the last time you established your timelines? When was the last time you looked at all your “quick fixes” and “get it to works” and really focused on delivering what you know in your heart you owe to your customers? Thats what we’re doing over the next 48 hours, I hope it motivates you to do the same.