Do they know that we know that they know that we can’t find the server they are looking at us from?

ExchangeDefender
Comments Off on Do they know that we know that they know that we can’t find the server they are looking at us from?

At times it’s not the big complex problems that make you rip your hair out, it’s the tiny little mistakes which compounded by complexity create a catastrophic failure on a magnificent scale and make you feel no better than a 14 year old in his first computer class. Ah, the joy of monitoring.

So within the ExchangeDefender architecture two of the most important objects are admin servers and scanning nodes. Admin servers hold the configuration which gets replicated to scanning nodes which scan mail. This seems like a very trivial thing, the admin node generates configuration that the scanning node pulls, loads and executes. How can this possibly go wrong?

(don’t laugh yet, there are over 30 monitoring indicators for just the replication-sync piece alone)

So the monitoring nodes scan the admin nodes and scanning nodes directly, either by pulling (web load of the local statistics, SNMP) or getting data pushed (SQL logging of activity and events).

Now, if you don’t understand how each of these components interacts and you try to optimize them on what was probably 2 hours of sleep you miss one bubble in a 3 mile flowchart and the catastrophic failure doesn’t make you look like a little ass, it makes you look like a total retard – I’m not talking messages getting delayed by minutes or seconds, but by 5 days in one instance. (faceplant)

So here is what happened. The web and SNMP reported everything was fine. As did the remote loads of the node statistics. As did the inserts because they were hardcoded with the system IP address.

Where Woody messed up by not going straight to the police was when he optimized the DNS infrastructure to use the local caching server and one of the scripts that was used to report the failures relied on a domain name and not a hardcoded IP address.

So what had happened was… the caching nameserver we use for sync of blacklist data from node to node failed and hung. The node reporting engine went to report the failure but the lookup for the hostname of the monitoring server failed because it relied on a component that had failed. The process scanner found that the process was alive and working in the process list. The fact that it didn’t actually resolve any queries was apparently besides the point, but because the node scanner and monitor were both in agreement that the system was live with current config and software and everything was moving along…. uhhh… kill me.

So now the monitor of the monitor for the on node – off node monitoring process also monitors the resolution and connectivity and DNS health.

What really hurts is that the header of the code reads a commit by one vmazek. Ouch.