Author Archives: Pressable

Caching layer degradation – Fixed and Stable

We’ve found some issues with our memcached cluster, which we’re working to resolve as soon as possible. Symptoms of this are slower sites, and sometimes pages that will return a “504 Timeout” page. 

Sorry for the issues, we’re working to resolve these issues ASAP. 

 

Update February 12, 2013 8:46 PM: The problem was fixed at 7:30PM CST, we’ve been monitoring the situation for the past hour, and things have been stable. We consider the issue resolved. 

Database and Service Interruptions

We’re currently aware of an issue inside our database cluster that’s causing some slowness/unavailable sites. We’re currently working on the issue and will update you with more details as available.

 

UPDATE 10:45AM CST: The database connection issues are still ongoing and we’re continuing to investigate the cause.

Update on the botnet attack of February 7, 2013

We’re starting to get things under control. We’ve blocked 2832 unique ip addressess so far. We’re continuing to monitor the situation and isolate the customers who were affected by this, from the customer who was being attacked. 

What we know so far

  1. A customer’s website is under a botnet attack, where we are seeing 190,000 requests/second made to one ip address. 
  2. These requests seem to be coming from about 3000 unique ip addresses.
  3. Our firewall was reaching a CPU max of about 90% while this was happening, our alarms go off when it hits 51%. 
  4. Blocking all 3000 ips on the firewall is not a good idea, so we’ve “null routed” the destination ip address. 

What are we doing to bring customers back?

  1. We are assigning new ips to the affected customers (several hundred) who shared the same ip address with this customer.
  2. If we control your dns, this change will happen within the next 30 minutes. If we don’t, we’ll be contacting you to let you know what the ip address should be. 

 

January 27, 2013 all systems functioning normally again

As of 2:53 PM on January 27, 2013 All systems are functioning normally again. We had intermittent issues across our network.

Here’s what happened. 

One of our 4 memcached servers had run out of memory, and in the process locked up. This made it so that our database servers were seeing 8x the average calls. Since our monitors started telling us about higher than normal database activity, we started investigating the issue there.  

What did we learn?

It turns out, our monitoring on the memcached systems isn’t as good as we thought it was. Had we known that the one of the memcached server was out of commission, we would’ve been able to identify the problem, and fix it. Rather than investigating what was causing the spike in the database usage. 

 

 

Database Issue

We are currently experience an issue that causing sites to display database connection errors.

We are working on having this resolved as soon as possible and will update with more information as it is available.

Update:

All sites are back up now and the total down time was about 15 minutes.

If you have any questions, please send an email to help@zippykid.com and we can address your concerns via our help desk.