For the past month I have been embarking on a task of replacing myself with a shell script. This isn't as complicated as it seems, turns out I'm very predictable in terms of how I manage the business. The problem is that in the world where there are only 24 hours but I need to work 30 hours there is a need to hand off certain responsibilties to another monkey and hopefully with minimal training be able to delegate some of the responsibility without affecting the result (network up).
So about a month ago I set out to code the NOC management system for Own Web Now and man, did I underestimate just how much stuff happens transparently without really thinking or documenting it. For example, last night I spent nearly two hours writing the function to calculate ticket age. The function itself took about 10 minutes, the debug/test/what if took just shy of two hours. Why? Well, time zones. Leap years. SLA plans.
Here is the context. No client should be knocked out for more than 0.1% of the day — If the monitoring system notices a downed state it opens a ticket. So far so good. Now, the system calculates the time between the initial outage and service restore and if that period is less than 0.1% we're delivering on our SLA. Now let's say the system goes down at 11:58 PM but comes back at 12:04 AM. How is the outage logged? Uhh, it was just down Bob! But, the panel that shows the client flawless performance throughout the week will now have two orange fields instead of just one red one if we dropped the ball.
That makes me look bad even though they were only down for 6 minutes – maybe they were not even down, maybe it was just timing out on my monitor because we only do ping checks for this client. Hrm. Ok. Now, let me widen the range. If the time crosses two calendar days but we meet the SLA on the total outage I only count that as one down, not two separate incidents (as it appears in the overview panel). Ok, now calculate which day had a longer outage and stick the report into that calendar day. Woo! Oh, the field of when the outage happened? Ok, 11:58 PM. But wait, the other function calculates the time by subtracting the resolution time from the open time. Outage interval: -23:54? Ok, throw in another consistancy check.
So as a monkey I really only had the server out for 6 minutes in my head. On a logical evaluation and reporting mechanism I have 200ish lines of code to explain that to the computer. So who is smarter, the monkey or the computer? And people wonder why I keep on calling myself and my staff the monkey force. There was literally a moment last night / this morning where I felt like the gorilla at the opening of 2001 Space Odyssey, thank god there were no bones around me or that monolith would have been beaten. 🙂