January 27, 2013 all systems functioning normally again

As of 2:53 PM on January 27, 2013 All systems are functioning normally again. We had intermittent issues across our network.

Here’s what happened. 

One of our 4 memcached servers had run out of memory, and in the process locked up. This made it so that our database servers were seeing 8x the average calls. Since our monitors started telling us about higher than normal database activity, we started investigating the issue there.  

What did we learn?

It turns out, our monitoring on the memcached systems isn’t as good as we thought it was. Had we known that the one of the memcached server was out of commission, we would’ve been able to identify the problem, and fix it. Rather than investigating what was causing the spike in the database usage. 

 

 

Leave a Reply