Howdy,
We have been flooded with help desk requests, tweets, emails, and phone calls requesting more information and have had a bit of a difficult time keeping up and getting responses out to everyone and answer all questions.
We are starting a new status blog post to help answer as many of the most frequently asked questions as we can. You can continue to receive the latest updates at the bottom of this post.
For customers who have questions/concerns regarding the outage, please join us in our chat lounge to discuss: hipchat.com/g5gQ8vl9S
Exactly what happened and what is going on?
Yesterday morning, we encountered issues with caching servers that had previously been built and optimized to handle load from the outage early last week. It was determined that this was a result of a bandwidth limitations of our internal network traffic.
As a result, we built new servers that had double the network throughput and continued to have the same issues as before. From here, we decided to subdivide caching traffic based on cluster and saw significant improvement in the situation.
After seeing improvement, we began to see internal bandwidth limitations on database servers and are currently working on adding additional database servers to help with this.
What is being done to fix it?
The first step in addressing the original issue from yesterday morning was to subdivide internal caching traffic based on cluster. Doing has helped significantly and has helped us see bottle necks in other places, most importantly our database servers.
As a result, we are working to add additional database servers as a means of addressing these internal bandwidth limitations as they pertain to databases.
What is the ETA on completing a fix?
Providing an ETA in this situation is very difficult. It relies on us knowing exactly how quickly we can get new servers up, optimized, and working reliably. These times are unknown because it needs to include some time to monitor the implementation and verify improvement.
We are all hands on deck and all working very hard to have stability restored to the system.
We do not currently have and will not likely provide an ETA in this situation. The best thing to do is to keep checking the current status at the bottom of this post.
We expect things to be working normally within a number of hours.
What is the current status?
As of 9:35AM, Jan. 21, 2015: Sites are currently up and down, intermittently. We are currently working to re-provision portions of our architecture but the rate at which we can add servers is currently limited and we are working with our provider to have our rate limits pushed up. Once new servers are up, we will begin seeing sites stay up consistently and running at normal speed.
UPDATE 10:45AM, Jan. 21, 2015: Our provider has raised our rate limits and we are able to provision servers at a more rapid pace. We are still working on getting new database servers up and will update again as soon as we have more information to share.
UPDATE 1:50PM, Jan. 21, 2015: We are currently finalizing the deployment of several new machines and are monitoring these for improvements. We are expecting to see improvements in several clusters as soon as these deploys are finished. We will update again as soon as we have more information to share.
UPDATE 5:10 PM, Jan. 21, 2015: Our team is still working to fine tune the new hardware deployments brought online. We’re continuing to monitor the situation and make adjustments as needed.
UPDATE 8:00 PM, Jan. 21, 2015: Our team is still working to get the new hardware deployments brought online and in rotation.We are expecting to see improvements in several clusters as soon as these deploys are finished. We’re continuing to monitor the situation and make adjustments as needed.
UPDATE 11:30 PM, Jan. 21, 2015: We’re continuing to monitor the situation and make adjustments as needed. Sites are currently up and down, intermittently until the new hardware deployments are brought online. We will update again as soon as we have more information to share.
UPDATE 9:15 AM, Jan. 22, 2015: After discussions with our provider, we are in the middle of rolling out changes that we are hoping will help resolve this in the near future. Thank you for hanging in there with us as we look for a resolution.
UPDATE 10:30 AM: We have isolated a single cluster that was causing trouble for the others. For the time being, all other clusters except that one are up and running “normally.” As we work on that other cluster and test changes/fixes, though, the others may be affected by it. For customers on our bode cluster, we are working on a fix right now and looking to possibly move customers off this cluster as soon as possible. We are still working on a finalized course of action for these customers.
UPDATE 1:35 PM: We have created a new cluster and have moved a large chunk of customers off of bode and onto a new cluster named “hydra.” If you previously had a site on bode and that has been moved, you will likely see it begin working within the next 1 – 2 hours as DNS changes over. We are finalizing plans for customers that do not have DNS pointed at us and will communicate this as soon as we know what we will be doing with this set of customers.
For customers who have questions/concerns regarding the outage, please join us in our chat lounge to discuss: hipchat.com/g5gQ8vl9S