Tonight I wrap up the worst week in my professional life, bar none. I will gladly accept and live with my shortcomings in the educational, technical, business and other areas of my life, college had prepared me for endless trials and failures. I cannot accept the shere incompetence and the cascading layers of crap I’ve had to fight with this week. To sum it up in a picture:
I didn’t have a visor, I didn’t have an ExchangeDefender blade but man, this was an ugly week.
It all started with the Google buyout of Postini. Our sales just exploded, our network had to double, quickly, and it just so coincided with the v3.1 upgrade that was more than your “cosmetic” touchup of the web interface. All is well now, which is why I can write about it, so here is the story:
Sunday, August 19th, 2007
Sunday was the day when we decided to implement NDR failures for hosts that do not have a valid PTR record (reverse lookup). Simple thing, we have been testing it for a while and it showed great results. So we implemented it and for a few hours all was good. * True mark of incompetence is when you truly underestimate just how dumb you are. You’ll see how/why in a minute.
Monday, August 20th, 2007
There are many ways to issue an NDR, even more places and codes to handle it with. What did we pick? 4.7.1, tempfail that stops the message at the MTA level (the SMTP server basically). This isn’t really a reject, it’s a temporary non-delivery status that is issued during the SMTP conversation – all mail servers attempt to deliver the message again.
Roughtly by mid-morning we started fielding a ton of calls about the mail just not showing up. It’s not in our logs, its not in our system, it’s just vanishing. WTH? Eval the new rules, they check out. Eval the routing, all checks out. Run every component of ExchangeDefender in debug mode, all works.
Somewhere during the afternoon we finally look at the counters for DNS failures and realize.. man.. the reason things aren’t being seen is because we’re tempfailing them all. Wait, it gets MUCH worse.
Customers are livid at this point.
So we decide to do something brilliant, something so… genious.. that we just deserve a darwin award for it – we move the rules from the MTA to our bayesian filtering. Sound reasonable.
You generally want to handle these events at the MTA level because once the message is allowed into the system the connection can be dropped and there is no temporary failure after that. You’re just left in the water.
For the record, we had three failure codes, all 4.7.1 tempfails that would force the remote server to retry.
We used a temporary failure for the failure to look up the PTR record. We used a temporary failure for the forged PTR (PTR does not match the A or EHLO does not match the PTR). Finally, we had the failure handler that defered messages if the IP lookup failed completely (no RDNS).
Monday Night, August 20th, 2007
Moved the RDNS checks from the MTA where they should be to the bayesian scanner where they shouldn’t be. the lack of logic used at the time: At least if we accepted it but assigned a high enough of a SPAM score nobody can complain about them not receiving messages, we will just hold a hard stance on the fact that anyone without valid RDNS is a spammer. How could this go wrong, right?
This was the beginning of the end. We had rewritten the highly efficient MTA code into the ridiculously inefficient bayesian filter that basically did a pattern match of:
Tuesday, August 21st, 2007
Daily mail quarantine reports start to lag and while looking into it the disk and database performance are actually running where they are expected. So not a system or disk bottleneck. What the heck is slowing it down so much?
Apparently far too many people use SMTP to route intranet mail to their servers from their desktops. The world does not revolve around Exchange. So, each mail server stamps “Uknown: [18.104.22.168]” in the message headers, internal message headers – not the external ones. So even though the external interface had a valid PTR assigned to it, nobody keeps PTRs for internal clients. Result: 18x more false SPAM than usual.
But wait, it gets worse, much worse.
Having just been cried to by yet another account I tell my guys to just yank the whole thing. I don’t want to hear about it, I don’t want to deal with it, just.. enough. Start provisioning the new servers, we’ll quadruple the size of the network if we have to but we are not bleeding away anymore money.
Provisioning work starts, new servers start spinning up all over Texas, California, Illinois and London.. Life is good. Almost too good.
And it was. The new version of CentOS (base OS we use for ExchangeDefender) comes with the new version of Perl (5.8.8). The new version of Perl (5.8.8) is not compatible with the optimized DBI packages we have for database operations.
Just to sum up: At this point customers are cancelling, partners are ready to crucify me, our new servers aren’t spinning up. But there is so much more waiting that I am just not aware of… One of my old high school buddies used to say: Shit happens. Shit always happens. Shit is happening to me right now, I just am not aware of the specifics yet.
Wednesday, August 22nd, 2007
Causality is a wonderful thing. Until it goes against you.
RDNS caused us to tempfail a lot of legitimate mail which caused customers not to receive mail. We then decided to accept the mail and instead just quarantine it. However, by doing so we just transfered the problem from an invisible but frustrating place, to a very visible but catastrophic place: now that our internal systems could account for the specifics of the PTR mess, they destroyed the auto-whitelist and IP reputation databases we handle.
Result: Packed SPAM reports again. Kill me now. As my friend Rich says: Fuck me running. I am not sure how that applies but I am sure it holds the degree of difficulty my day was about to get.
Wednesday afternoon, August 22nd, 2007
I’m out and about with Dave Sobel. We’re talking shop, every now and then I take a moment to look at my PDA to look at my credibility and SLA slide into the ocean. Great. At about 3:40 I fire off this message to team@:
This ends tonight.
At about 9 PM after the dust settles and we have a game plan I fire this one off to our partners:
Dear ExchangeDefender Partners,
I am sad to have to write this message to you but it has come to my attention that the ExchangeDefender v3.1 has received perhaps the worst satisfaction of any product release in the history of my company. Considering that I have written some of these solutions of scratch, the fact that at this stage in the product cycle we are having issues, I am more than disappointed in all that has happened.
I have a stack of 84 printed tickets and emails of complaints and issues with the ExchangeDefender v3.1, issues ranging from two week problems to the ones that we experienced with NDRs on Monday and false positive ratios on Tuesday. I have assembled my senior team and every member of the ExchangeDefender team, I have taken my staff away from training and we will work until the problems are resolved. You should expect 99% of functionality to be restored by 9 AM EST tomorrow. There are other changes being made to make sure this never happens again but you will receive a separate communication regarding that.
We are on top of it, we are working on it and we will have it fixed. Expect another update at 9AM EST tomorrow morning.
I realize many of you put your reputation on the line with us and I regret that we have done anything to shake your customers confidence in our product and in your ability to deliver it. We have the entire company working to resolve the litany of annoying issues and we will have them resolved by the morning.
It’s Thursday now. The backend fixes and commits are flying all over the place. I am trying to keep on top of one thing after another, trying to spot all the problems and see what gets to be fixed. I take a little 10 minute powernap and at 9 AM start writing this:
My staff and I have worked overnight and here are series of adjustments we have made to the backend as well as the issues we noticed in many of the reported messages that were sent to us via the support portal. We were able to isolate a number of issues with the client base that identified problems, we also uncovered a lot of problems with our infrastructure and this update is to let you know what we’ve done so far and what you can do.
We have resolved almost all of the backend infrastructure problems and are moving on to the GUI (web interface, email reporting) pieces of the code. Expect the next update at midnight, EST.
DNS Rejection Lists
In order to cope with the near endless number of misconfigured mail servers (mostly workstations) being used as botnets we have been forced to only accept mail from valid mail servers on the Internet – those with the reverse DNS entries. While we had to throttle it down quite a bit the practice does help a lot in reducing the issues and the amount of float to deal with.
To be clear: We only reject messages from mail servers that do not have a reverse DNS (PTR) entry for the server that the mail is coming from. We do not reject messages for mail servers where the lookup failed (remote DNS server for the zone is down when we run our query), we do not reject messages if the hostname that was returned is invalid, we do not reject messages if the hostname is forged (mail coming from sbs.customer.com with PTR record claiming its adsl-dynamic-22.214.171.124.lameisp.cn).
To be even more clear: This is not a blacklist – this is a network security policy. There is no whitelist, there is no “taking sender off the list” and there is no delay in it at all. As soon as they either start using their ISP’s smarthost as a relay server or get their ISP to issue a PTR record our systems will start relaying their mail.
reject=553 5.3.0 Message rejected. The email you tried to contact does not accept mail from mail servers that are not configured properly, for more details see http://www.exchangedefender.com/help.asp and contact your ISP. Reason: (1) Could not resolve PTR record for x.x.x.x
If you drop to your command prompt and nslookup x.x.x.x you will see why they are blocked, the query will fail.
We send the user to the following page with the clear instructions on what to do so you are not bothered with any troubleshooting of this.
It is worth noting that this practice has been around for years by major ISPs in USA – the fact that there are mail servers out there that still relay mail without DNS (any DNS) means they either just started or something changed in their network infrastructure recently and they are not aware of it yet. It is not your responsibility to fix remote networks.
New admin.exchangedefender.com mail Bridge
Lot, lot, lot’s of issues with the ExchangeDefender daily and intraday reports. We are currently rewriting them, will have it implemented and closely monitored by midnight EST.
In the meantime, we have put in a new network “bridge” between us and the end users mail servers. When the report is generated it will not go through the ExchangeDefender scanning network, it will go directly to the user. This will simplify any troubleshooting and make the reporting more timely and efficient, even though we believe the issues are on this end.
As mentioned, we are rewriting this piece. Tonight you will get the ability to “resend” the SPAM report (last 24 hours) on demand as the Service Provider, Domain administrator and end user themselves. There will also be a global policy of not sending empty reports to users though we strongly urge you NOT to use it because when people expect reports and see them every day you will receive a phone call the first time that does not happen.
We also recommend loosening the reliance to the SPAM reports. Put a shortcut on the end users desktop to allow them to access the quarantine at anytime if they have questions or are looking for mail. Releasing messages through the web site is far easier and more effective because you can release multiple messages at once. To build an IE shortcut in XP and Vista right click on the desktop, select New, Shortcut and type in this URL
Replace theirusername and theirpassword with their data and save. They will be logged into your portal and given your branding.
More to follow on this later.
What you can do
It is important to have all the IP ranges we relay mail from in the access/relay list. If you do not have us in the relay list, we cannot deliver messages to that server in the timely fashion (messages that fail connection at one DC are streamed to another DC and then delivered that way, adding at least 30 minutes in latency). Please add and use at least the following IP address ranges:
126.96.36.199/24 (255.255.255.0 netmask; 188.8.131.52-255)
184.108.40.206/24 (255.255.255.0 netmask; 220.127.116.11-255)
18.104.22.168/24 (255.255.255.0 netmask; 22.214.171.124-255)
126.96.36.199/24 (255.255.255.0 netmask; 188.8.131.52-255)
184.108.40.206/24 (255.255.255.0 netmask; 220.127.116.11-255)
If you have end user complaints about untimely delivery, unreliable delivery or messages just “vanishing” it is related to the IP ranges not being allowed to properly connect to your server. Either program in the above IP restrictions or do not enforce IP restrictions at all (bad idea).
Configure and enforce your SPF record in your DNS zone. If the remote users are complaining that they are not receiving mail from your users it is likely that the messages are being stripped due to SPAM policies in place. One way to reduce this risk is to define and use an SPF record. You can use the template below and add in your own A records that relay mail on your behalf.
“v=spf1 a mx ip4:18.104.22.168/24 ip4:22.214.171.124/24 ip4:126.96.36.199/24 ip4:188.8.131.52/24 -all”
Finally, I cannot to being to apologize enough for the issues this has caused some of you. All I can do is fix it, as fast as possible, my team is on it and we will not let the weekend start until all the issues are resolved. We have so far resolved the most critical pieces that would cause the end users to contact you (issues of non-delivery, mail vanishing in the cloud, undefined rejections, missing reports, etc). We have closed 60 our of 84 issues I originally mentioned, we aim to reduce that to single figure by the next time you hear from me (midnight)
Please stand by.
The backend at this point works fine. The network works fine. The interface… blows. So we tack that one next. Interface gets completely rewritten from scratch. I start my last minute desperation calls with folks all over the world, trying to replicate the issues, catch the errors, trap them and get the fixes in. I call Mark Taylor from lunch and ask him to help me with one of the issues. He hops onto IM, calls me on my UK number, walks me through the issue.. first glimer of hope, the bug he has had already been closed. On to the fun part.
Friday, August 24th, 2007
At this point days and nights are blurring together. Going into the third all-nighter is not for the squamish, esp when working under stressful conditions in a home office with no more Mountain Dew. Thankfully though, things are rocking and rolling now and everything works.
I spend most of the day in phone calls, load testing and usability analysis. We rename half the stuff to make the interface look and feel a little more consistent, little less confusing. Finally, a week later, the system is back where it ought to be, kicking mail around.
Time for a little dose of humility:
Dear Partners and Clients,
We have addressed all the major issues related to ExchangeDefender that have been present in an isolated fashion across our systems. Everything from service delays, branding issues, missing emails, connections and interface work has been addressed. We have rewritten major sections of the ExchangeDefender frontend and have integrated v3.1 features in more areas than originally planned than expected. The system is pretty much bullet-proof at the moment and everything is working as advertised.
My team has worked tirelessly over the past 3 days to address the issues and settings that have been in the system for years. We have also made changes to make sure this type of an event never occurs again and that we give you more and more control and insight into the system and how it operates.
Everything at ExchangeDefender currently works as advertised. We are taking a little breather today and will resume the development tomorrow and you will receive a new guide to ExchangeDefender v3.1 that explains all the features and how to use them.
I hope that we are fortunate to win back your business. We have learned a lot during the last week and a half and have changed our development methodology to eliminate future “point” releases and events such as the ones leading up to this. We took a risk to rewrite major sections of ExchangeDefender that has been running truly ancient code that I have personally written nearly 8 years ago. I felt it would not be right to ask for you to consider working with us in the future if everything was not perfect.
I believe we are there, I believe the feature set, price, scalability and the partnership support we offer is unique and I hope you consider working with us in the future. You will get a new guide over the weekend, details of all changes will be published at http://www.ownwebnow.com/blog and I just wanted to thank the many of you that have taken the time to help us troubleshoot these issues, get them into our feature and bug portal, work with me directly on replicating the issue and finally getting it all to where it needs to be. I am truly thankful for that.
Again, thank you for your patience and for your business.
Blog a fork in it Jr, we’re done.
After a nice dinner with some friends I get to get this off my chest and figure out just how and why things went so horribly wrong so quickly. It is not that we didn’t test these changes. It is not that we didn’t know the end result of these changes. It is not that we did not appreciate the complexity of what we’re dealing with – this isn’t some monkey banging at the set of checkboxes, these are very complex distributed systems with a very complex management. So where did we fail?
We failed at step #2. When things fail, you roll back. This is where the roles of developer and sysadmin cross in a very bad way. Developer troubleshoots, fixes, and retries. Sysadmin rolls back, triages on a separate system and tries to replicate the problem until it can be eliminated.
Was it an emotional response? Probably. You don’t want people to be unhappy with your product. You want to react quickly and “just make it work” but “just making it work” tends to “just break everything else that touches it”. I call this the reverse-Midas effect, everything we touched turned to shi..
If there is anything I did right in this whole situation it is the communication. Things go wrong. Most people expect things to go wrong. But when they do go wrong you have to admit to them, apologize, and get to fixing right away. Most importantly, everyone else thats being yelled at needs to be informed too. I tried as hard as I could to make sure everyone was up on it. Here is an email I got a few minutes ago:
Vlad and team, I cant express my sincere thanks for taking the issues addressed seriously, you have not turned me away as a client but rather made me a stronger partner in your business. All businesses encounter problems it is how they handle the problems that sets them apart from their peers, in my believe your commitment to resolution and communication is 2nd to none.
My job as a CEO is to shield my people from criticism, motivate them to do better, communicate clearly whats going on with our partners and clients and finally put the resources together to make everyone sucessful. I’m the f’n firewall. It’s not easy working 3 days straight. It’s not easy swallowing your pride and say “Shit, we messed up.” It’s not easy watching things fall apart, one after another, on something you have built from scratch. It’s not easy, as Dana Epp says: if it were easy everyone would be doing it.
So yes, this was the worst professional week of my life… our product and job is not easy, but, I cannot imagine doing anything else. To my team, to ExchangeDefender, to Microsoft developers that get the third degree like this every day, to all of our partners and customers that stuck with us and helped us all get to this Friday: This Bud’s for you.