I need a confidence patch

Security
14 Comments

Earlier today I posted a question on a mailing list trying to find out how other IT Solution Providers are dealing with the increasingly unreliabile and costly Microsoft Security patches.

Please don’t turn this into a security issue because it’s a business question:
 
I am depressed with Microsoft patching to the point that I might have to drop my SLA against all Windows-based servers at Own Web Now. Even  on a day when the patch does not cause any problems at all the reboots don’t happen as they should. Vanilla configurations just do not start all services. Make up a weirdest thing you can get a Windows server to do and we’ve seen it. Remember that this is on a good day, not on a bad day when the security patch locks out Blackberries one month, Macintosh the next, crashes Dell boxes the month after that.
 
I am considering automatically dropping all Windows servers into an automatic 8 hour maintenance cycle during the Microsoft patchday to compensate for Microsoft’s lack of QA. We can no longer minimize issues through testing because even identical boxes (Hardware and software, remember we virtualize the crap out of things) are not behaving the same. Reboots before the patch are fine, reboots after the patch.. poof.
 
How is everyone else handling this? Drop the SLA? Lower confidence in Microsoft (who does that help?) Extended maintenance cycle?
 
Second Tuesday of the month is becoming a religious holiday at Vladville…

The Process

Our process and our ingredients are pretty simple. We do a flash backup every Tuesday afternoon (EST). Those backups are generally complete by 10PM. We do a flash reboot just to make sure there are no hardware/software issues. We proceed with the patches that passed quality control / quality analysis earlier that day. We push using a collection of tools, WSUS and other bits and pieces. Other bits and pieces are used instead of WSUS when we want to apply hotfixes without a reboot to critical infrastructure systems.

Either way, pretty standard stuff. Most Windows servers run a similar configuration (actually, most are identical in both software and hardware as they are mostly Virtual Server systems) so there is little reason to expect one to work while the others fail. 

The Costs

Do not let Microsoft WSUS and “Secure by Default, Design, Description…” fool you, patching is expensive, very expensive. There is no alternative to patching, we have to do it. With critical updates, we have to do it ASAP. No complaints there though, its just a part of business.

My complaint is with the unplanned costs related to patching. Costs that I and my customers have to pay because Microsoft produces unreliable and unstable patches. Let me explain what my definition of that is: “If a patch causes unexpected downtime or adversely impacts my system performance I do not consider it to be stable or reliable.” Simple as that. A patch is supposed to close a security hole in the software without affecting the rest of the system.

This is no longer the case. Few months ago Microsoft patch knocked out Macintosh systems (Entourage) from connecting to Exchange. Month after that it stopped Blackberry from operating properly. You remember my post about it regarding Dell.

My actual complaint is that I am at the verge of losing confidence in Microsoft’s ability to reliably and predictably patch the problems in their software. It is costing me a small fortune both financially and in terms of reputation. If I cannot stand behind my SLA (Service Level Agreement) which states just how often the server will be up then what value am I providing. If I am put in the position of having to appologize for things that are not my fault to begin with, where does that put my reputation at with my customers? Forget about the cost of overtime for employees, support calls, graveyard shifts, and the near cottage industry built around the patching tools, preparation process, reporting and followup just to make sure that the software we paid for continues to behave the way it was sold to us.

Forget about me

Now this is simply a blog post that will change… nothing. But it is an opportunity to review your SLA and consider how you deal with unreliable partners whose products and services you are supporting. I am at the verge of having to rewrite my SLA to put Microsoft patches into a maintenance cycle without any assurance on the time period. Here is one of the intriguing answers I got:

Vlad, we ran into the same issues as we started to scale and eventually had to build a lab for testing where, once approved, the patches would be put on our corporate network and when approved, we would roll them out to the clients. To resolve the reboot problems we put in “lights out” cards in all our servers. I agree it is not for the faint of heart.

Anyhow, something to consider…

14 Responses to I need a confidence patch

  1. WorkingHard says:

    Hi Vlad,
    We’ve been at this game for a long time and it is just a fact of IT life and not only with Microsoft/IE or DELL but all and any software/hardware vendor (mandatory upgrades to stay supported by HP that then wreck you ESL library storage controllers which is fixed by “wishful” HP patching 7 months later or that clean the configuration out of your RAID Controllers to name a few) … any vendor, any app, any OS … get big enough and complex enough … and you’re in the grinder. Just make sure you run along at your pace … and the time to do that is bought by defense in depth.
    It’s a good life and “easy” to be a great services/infrastructure provider when all goes well … but if all went well all the time they wouldn’t need us as much.
    By the way, what happened? You had a couple of hard knocks lately on that front but imho nothing has changed much over the last years except for one thing … the time you have to respond dropped drastically (hence defense in depth = buy time) and the growing focus on application exploits.
    Perhaps you’re being confronted more than ever with the issues of growing to a considerable scale lately?
    Do we have a lab? Yes a couple of old systems running as our test environment. Anything else? Yes small less critica production systems that we image every night and we can have up again in 2 hours … are patched first and after some time with no issues we approve and use WSUS, scripts or manual patching for PC’s and servers. What gives us that time: firewalls, packet inspection, anti malware, reducing attack surfaces,…
    These are just a couple of thoughts on how we deal with this. Is it hard work? Yes. Is it bad, yup. Is it an MS problem? Yes. Is it an [INSERT VENDOR SOFTWARE/HARDWARE HERE]? Yes. The amount of software/hardware being produced at such a fast pace and that is so feature rich and so complex gives rise to so many FUBAR/SNAFU opportunities that I’m sometimes amazed that it all just keeps working anyway.
    One thing is for sure. When the shit hits the fan, you’re the man to have around. What makes people good? Solving loads of issues, i.e. experience … and that makes you a valuable service provider. Becuase issues will allways remain, they might change in nature but they will never go away. On the ELO/DRAC statement: I’ll second that!

    Cheers

  2. indy says:

    SImple solution:
    Run an either All Microsoft SOlution (so that you emulate their testing facility and won’t get burned by a patch) or don’t run an all Microsoft Solution. 3rd party is bad. Too complex.

  3. Adam says:

    I will disagree with indy in that 3rd party is bad. In patching, the best solutions come from third parties like Shavlik.

  4. richwalkup says:

    @WorkingHard

    I seriously think the point that Vlad is making is a valid one. Microsoft has been releasing patches so haphazardly lately that it’s very hard to not point the finger (you know the one) at them for some of these severe failures.

    You have to step back from your SBS pedestal and ask yourself what do small businesses do when they rely on Microsoft to keep them stable, yet they take them to their knees on a regular basis? Remember now that most small businesses do not employ gurus to manage their infrastructure and I would estimate that half of the ones that do hire specialists unwittingly hire incompetent dolts. In effect, Microsoft is forcing small businesses (and large ones too) to invest in support specialists like Vlad who can actually recover major failures with the least possible downtime. Is it good for your market space? Yes. Is it ethical? Hell no.

    When I hear concerns like the ones Vlad has expressed here being dismissed by tech savvy people as “it’s just like that in IT”, it’s quite bothersome. If Microsoft wants to hand out free Beta software to the world and rush sh*t to market, then that’s how it should be done but when most of the world relies on their core software to run and build their businesses daily, there should be more liability placed on their shoulders for software that users have to pay for. I’m not a hater and I use MS in almost all my development efforts, but there has to be a measure of accountability in someone who is supposed to be your “trusted partner”.

  5. indy says:

    Adam

    I’m not talking about patching solutions, I’m talking about applications and compatibility. Microsoft tests these patches heavily. They are more likely to test their own products well before a 3rd parties. You are just that much less likely to encounter difficulties when you stay the MS way.

  6. WorkingHard says:

    Hello richwalkup,
    No disrespect was ever intended. Vlad’s point is indeed valid. I’m not dismissing it, I’m not saying it should not improve or that is can’t improve. It should improve. But it is a fact of any profession that you will allways have issues, concerns an problems that are sometimes maddening enough to bang your head against the wall and perhaps 5 years from now they will be quite different from the ones we’re having now. That’s not the same like saying “it’s just like that”. I do know that that does not help to bang my head against the wall (unless for some reason I want a headache) so I try to do the best I can knowing I will always have shit to deal with just like a trauma surgeon knows that no matter what he will allways have casulaties in his ward that could have been prevented. He makes good money from being a surgeon. Thats’s not unethical, he does not cause the casualties. I don’t think any software firm wants to produce shitty software, they all often do. Do they feel proud or hapy with that? I think they want to vaoid it but probably often fail in the process just as much as I often fail at things I really wanted to get done better than I did. What keeps me going is the small and big succes I have. SBS 2003 on a pedestall? 🙂 I’m balancing between running 100 server environments, SAN Storage, Tape libs (huge and small ones), SQL Server, Exchange, GPS networks, codingn, coaching a team … and I still help my helpdesk people out with “silly issues” or fix 2 server big “infrastructures”. So please do not take this as an attempt to minimalize Vlads valid concerns or a SBS 2003 is “untouchable” rant. We all feel his frustration at given times. All I’m saying basically that frustration is all over the place in the IT world. As far as partnerships go. I put trust in persons I meet, work and fight fires with. People in suits who want to be my “strategic partners” … uhm no I don’t trust them for one second, no matter what company they are from. Once they have your money that’s it. Suddenly your an anoying customer … not a partner … the “service oriented business” model is not a great as it is often portrayed or implemented.

    Cheers

  7. Jack Francis says:

    I agree, Microsoft SUCKS when it comes quality updates. The one I could not believe was 898060 (MS05-019) when they broke the IP stack with a post SP4 Win2k security update and Win2k3 SP1. If you were running Outlook 2k and connecting to an Exchange server with this patch installed over a VPN connection, presto no more connectivity. When I worked at Microsoft I saw this fix so many wierd networking issues that it was the first thing I looked for, for a long while. I lost all faith in Microsoft updates after that. The problem with this update was that some how some old NT 4 source code wound up in the patch. When it comes to updates there is no QA, they must have out sourced that to India as well. I am starting to think that Linux may wind up being the answer long term. As far as I know they never re-released SP1 with a corrected version of TCPIP.sys and the company I work for now may have this problem all over the place, they are terrified to patch so we don’t unless we rebuild a box.

    So sad..

  8. Scott says:

    I’ve turned off all automatic updates.
    Choose detect only and I will approve the updates a week after Patch Tuesday.
    My test lab is our own production SBS R2 box and the rest of the community. I’ve seen two similar boxes, one patches and the other craps, so doing a test of one or two test boxes isn’t enough, IMHO. The world is my lab, I am sure I will hear about it. Last two months I’ve had issues, I hear about it hours after I encounter the patch issue. Next time I will wait.
    Scott

  9. richwalkup says:

    @WorkingHard

    I meant no disrespect either – and when I said get down from the SBS pedestal, I meant this community in general. It’s easy for us technical people to dismiss these issues as commonplace, but it’s also wrong. Unlike many others, I still don’t think Linux is a viable solution for most businesses and when the largest software development company in the world starts down this road as a matter of general practice, we’re all in trouble. And my general point was that we as technical people can fix these issues but Microsoft’s target audience (SMB) is the one who really gets the shaft. We endure headaches and sleepless nights to keep our customers happy but the ones who can’t run their business effectively for days and weeks on end are the ones that pay dearly. To that end, I think it is highly unethical for MS to release these patches the way they do. People who try to do the right thing and stay patched against malicious code are subjected to inadvertent malicious code brought to them by the makers of their OS on a consistent basis.

  10. Andy says:

    I know what you mean – I’ve had several servers for different customers take over 45 minutes to reboot after updating patches the past couple of months. A lotof the time I’ve had to use shutdown.exe to force a reboot after the reboots from the update program didn’t finish the shutdown process completely (but enough to stop services like exchange,iis, rdp etc).
    I really can’t see how people can publish 99.99% uptime sla’s unless they don’t patch at all (and just rely on servers being behind a firewall and in the case of web servers being the ONLY machine on that network so nothing else can hit it from behind the firewall)

  11. Pingback: E-Bitz - SBS MVP the Official Blog of the SBS "Diva" : The Reboot problem

  12. indy says:

    http://support.microsoft.com/kb/925308/

    Corruption possible, and Microsoft wants you to wait it out.

    Whoops.

  13. Nick says:

    I think remote access cards (ILO, DRAC, etc.) are great – maybe even a necessity today. But that doesn’t really address the root of the problem. We’ve developed an internal Patch KB for building out intelligence around new patches. All of our netadmins have their own individual profiles which they’ve used to virtually build-out the servers they support, and then they receive customized reports for their supported configurations. Reports include risk-assessments on individual patches, with risk status varying based on their server profiles.

    I have a post linked below which describes it in more detail. It’s worked for us – especially in terms of knowledge sharing. While it doesn’t address the root cause, it helps reduce pain-points for our customers.

    http://addicted-to-it.blogspot.com/2006/09/patching-risk-evaluation-of-patching.html

  14. Bob says:

    When I install server updates remotely, I always, always use tsshutdn to reboot the server. I haven’t had a problem since I started doing that.

Comments are closed.