At times it’s not the big complex problems that make you rip your hair out, it’s the tiny little mistakes which compounded by complexity create a catastrophic failure on a magnificent scale and make you feel no better than a 14 year old in his first computer class. Ah, the joy of monitoring.
So within the ExchangeDefender architecture two of the most important objects are admin servers and scanning nodes. Admin servers hold the configuration which gets replicated to scanning nodes which scan mail. This seems like a very trivial thing, the admin node generates configuration that the scanning node pulls, loads and executes. How can this possibly go wrong?
(don’t laugh yet, there are over 30 monitoring indicators for just the replication-sync piece alone)
So the monitoring nodes scan the admin nodes and scanning nodes directly, either by pulling (web load of the local statistics, SNMP) or getting data pushed (SQL logging of activity and events).
Now, if you don’t understand how each of these components interacts and you try to optimize them on what was probably 2 hours of sleep you miss one bubble in a 3 mile flowchart and the catastrophic failure doesn’t make you look like a little ass, it makes you look like a total retard – I’m not talking messages getting delayed by minutes or seconds, but by 5 days in one instance. (faceplant)
So here is what happened. The web and SNMP reported everything was fine. As did the remote loads of the node statistics. As did the inserts because they were hardcoded with the system IP address.
Where Woody messed up by not going straight to the police was when he optimized the DNS infrastructure to use the local caching server and one of the scripts that was used to report the failures relied on a domain name and not a hardcoded IP address.
So what had happened was… the caching nameserver we use for sync of blacklist data from node to node failed and hung. The node reporting engine went to report the failure but the lookup for the hostname of the monitoring server failed because it relied on a component that had failed. The process scanner found that the process was alive and working in the process list. The fact that it didn’t actually resolve any queries was apparently besides the point, but because the node scanner and monitor were both in agreement that the system was live with current config and software and everything was moving along…. uhhh… kill me.
So now the monitor of the monitor for the on node – off node monitoring process also monitors the resolution and connectivity and DNS health.
What really hurts is that the header of the code reads a commit by one vmazek. Ouch.
Both comments and pings are currently closed.
|
|
|
Whats on Vlad's Mind?
|
|
|
|
|
Sponsors: This blog is made possible by
Own Web Now Corp and ExchangeDefender.
If you like this blog and are in the need of products we offer I hope you give us some
consideration.
|
|
|
|
|
|
Get The Newsletter
|
Looking for a more focused, exclusive insight into the world of SMB tech & business? Sign up for my newsletter:
Click here to sign up
|
|
|
|
|
Vladfire Vlog
|
Vladfire is my video blog showcasing successful people and technology in small to medium business.
Below are a few recent episodes, check out the archive for all other films.
|

See more episodes...
|
|
|
SBS Show Podcast
|
SBS Show is a free weekly podcast (Internet for recorded radio show) focusing on small business and technology. More at sbsshow.com but check out our latest episode:
SBS Show #26
Erick Simpson
Managed Services Part 2

Listen to older shows..
|
|
|
|
| |
|
|
Categories
|
|
Archives
|
|
About
|
| Apple, Awesome, Beta, Blogroll, Boss, Cloud, Deals, E12, Events, Exchange, ExchangeDefender, Friends, Gadgets, Gators, Gaypile, Google, GTD, iPhone, IT Business, IT Culture, Legal, Linux, Microsoft, Misc, Mobility, Open Source, OS, OwnWebNow, Pimpin, Podcast, Programming, Rant, SBS Show, Security, Shockey Monkey, SMB, System Admin, Thieving Weasel, Uncategorized, Vista, Vladcast, Vladfire, Vladville, Web 2.0, Windows Home Server, WordPress, Work Ethic, Wrong |
 |
February 2012,
January 2012,
December 2011,
November 2011,
October 2011,
September 2011,
August 2011,
July 2011,
June 2011,
May 2011,
April 2011,
March 2011,
February 2011,
January 2011,
December 2010,
November 2010,
October 2010,
September 2010,
August 2010,
July 2010,
June 2010,
May 2010,
April 2010,
March 2010,
February 2010,
January 2010,
December 2009,
November 2009,
October 2009,
September 2009,
August 2009,
July 2009,
June 2009,
May 2009,
April 2009,
March 2009,
February 2009,
January 2009,
December 2008,
November 2008,
October 2008,
September 2008,
August 2008,
July 2008,
June 2008,
May 2008,
April 2008,
March 2008,
February 2008,
January 2008,
December 2007,
November 2007,
October 2007,
September 2007,
August 2007,
July 2007,
June 2007,
May 2007,
April 2007,
March 2007,
February 2007,
January 2007,
December 2006,
November 2006,
October 2006,
September 2006,
August 2006,
July 2006,
June 2006,
May 2006,
April 2006,
March 2006,
February 2006,
January 2006,
December 2005,
November 2005,
October 2005,
September 2005,
August 2005,
July 2005,
|
 |
Vlad says:
Thanks for checking out my blog. You've officially reached the end of the Internet so take in what you've read and don't look at it as gospel but an invitation to start thinking for yourself.
|
|
|
|
| |
Copyright © 2005-2010 Vlad Media, Inc. All Rights Reserved.
Content is provided AS-IS without warranty of any kind.
Syndicate this blog: 
|
|