RESOLVED: Chicago Data Center Outage

We wanted to provide this post as a notification that issues related to our Chicago Data Center Outage have been resolved. Our team will be continuing to work through the emails and tickets related to this issue and we’ll be providing a full postmortem tomorrow.

We sincerely appreciate your patience and understanding during this truly trying experience. The kind emails, tweets and messages we’ve received have been a true blessing.

RESOLVED: Chicago Data Center Outage

We are currently experiencing an outage in our Chicago Datacenter. All sites hosted in this data center are currently unresponsive and not loading.

We are addressing this issue now and will have things back up and operational shortly.

UPDATE 5:30PM CST: We’re still experiencing issues with the network backbone at our Chicago DC. We’re working with our provider to determine what the cause is and how we can mitigate this traffic to restore services.

UPDATE 6:05PM CST: We’re still working with our provider to determine the cause of the increased traffic along our backbone. We apologize for the delay in getting things back up and operational.

UPDATE 6:30PM CST: We’re still working with our provider to determine mitigating steps we can take for our internal network. We’ll provide an update in 30 minutes.

UPDATE 7:00PM CST: We’re currently engaging more senior members of our providers team to help troubleshoot the issues. We’ll provide an update in 30 minutes.

UPDATE 7:35PM CST: We’re still working with our provider to diagnose the current issues. We’ll provide an update in 30 minutes.

UPDATE 8:30PM CST: Our team is still working to mitigate the effects of traffic on our backbone. We’ve started to push out configuration changes which we expect to help reduce the impact, but it will be a bit longer before we know the impact of these changes. We’ll provide an update in 30 minutes.

UPDATE 10:05PM CST: Our team is still working to restore services. We’ll provide an update in 30 minutes.

UPDATE 10:40PM CST: Our team is beginning some work to bring servers back online and then will continue to evaluate issues. We do not expect services to return to normal at this time and will provide another update in 30 minutes.

UPDATE Jan, 10th, 2015 @ 12:30AM CST: Our team is still working on bringing servers back online and undergoing internal troubleshooting processes. We will continue to update the status blog as often as possible.

UPDATE Jan 10th, 2015 @ 1:25AM CST: We’re currently experiencing delays in the restoration of services as Rackspace is having provider line issues. We’ll continue our efforts as best as possible while Rackspace works to improve their provider issue. (https://status.rackspace.com/)

UPDATE Jan 10th, 2015 @ 2:45AM CST: The team is still working to bring services back online following issues from yesterday. We will continue to update the status blog as we make progress.

UPDATE Jan 10th, 2015 @ 6:05AM CST: Our team is still working to bring services back online. Some sites may have begun responding over the past hour, however, capacity is still not back to 100%. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 7:30AM CST: The team is still bringing services back online from the outage. We will provide more updates as we have them.

UPDATE Jan 10th, 2015 @ 9:00AM CST: We’re still working on bringing services back online related to this outage. We’ll provide more information as progress is made.

UPDATE Jan 10th, 2015 @ 10:00AM CST: Our team is still working to resolve issues related to connectivity and site availability. We’ll provide an update when we have more information.

UPDATE Jan 10th, 2015 @ 11:45AM CST: The team is making some configuration changes while we work to bring additional capacity online. We’ll provide an update when there is more progress.

UPDATE Jan 10th, 2015 @ 1:20PM CST: The team is making progress and we hope to have all systems back up and running soon. Some sites may have begun responding over the past hour, however, capacity is still not back to 100%. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 3:30PM CST: The team has made updates to the system that will help get things running and stable. Capacity is still not back to 100%. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 6:20PM CST: Sites are beginning to be served now but you may encounter the intermittent 502 error while things continue to settle. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 9:20PM CST: Customers with sites on our “Thor” cluster should see their sites being served properly now. We’re finalizing some work on our “Bode” and “Galaxy01” clusters and expect those to be functional shortly. We’ll provide more updates as we progress.

UPDATE Jan 11th, 2015 @ 9:15AM CST: We wanted to get another note out to let everyone know that we’ve seen stability restored (as of last night) and are currently seeing all systems online. We sincerely regret the experience provided over the last several days, but do appreciate your continued patience. Tomorrow we’ll be providing a more detailed analysis of the issues and our steps to correct these issues moving forward.

High number of bot traffic causing sites to not respond

We are currently experiencing a high number of bot traffic affecting customers on the cluster “galaxy01”. This will result in some sites not being able to display or function. We are working on getting this resolve and will continue to update the status of this issue here.

If you have any questions, please submit a ticket via your my.pressable.com control panel.

Update 12:58 P.M.- We are continuing to investigate the issue causing some sites on different clusters that are getting 502 errors. If you have any questions please submit a ticket in your my.pressable.com control panel.

UPDATE 3:05PM CST: At this time we’ve seen stability across the systems for the last 90 minutes. We’re still continuing our investigation into the cause of these issues, but wanted to let you know some normalcy has returned. Our apologies for the issues this week.

RESOLVED: Network Connectivity Issues Impacting Availability

We’re currently experiencing an issue impacting our edge devices that’s causing sites and internal systems to be inaccessible. We’re investigating with our provider and will provide an update when we have more information.

UPDATE 10:20PM CST: Our provider is investigating the issue and believes it may be related to a DDOS attack against another customer of theirs which is saturating the internal networks at the datacenter. We’re continuing to work with our partners to restore services ASAP.

UPDATE 10:30PM CST: Our provider is informing us that the attack appears to have subsided for the time being. At this time our systems are beginning to return to normal and we’re continuing to monitor for any lingering issues.

UPDATE 10:50PM CST: At this time we’re seeing our systems operating at normal levels. We’ll continue to monitor for issues, but do not expect there to be any problems related to this incident.

Service Availability Issue on Galaxy01

We’re currently experiencing an issue with a cluster named “Galaxy01” that’s causing high load across the system which results in sites being down or displaying 502 error messages. We’re investigating the root cause of this issue and will provide an updates as we have them.

UPDATE 11:15AM CST: We’re continuing to investigate issues causing higher than normal load on the systems. We’re also working to bring new hardware online which we believe may help to alleviate some issues.

UPDATE 12:25PM CST: We’re currently in the process of bringing the new hardware online which is causing some downtime while it starts taking traffic. We apologize for the issue and expect the new hardware to be 100% functional shortly.

UPDATE 1:00PM CST: The new hardware is online and we’re starting to see usage return to normal. We’ll provide an update when things are more stable on the systems.

UPDATE 3:25PM CST: We’ve seen stability in the systems for the past hour and are continuing to monitor systems. If you notice any issues, please let us know.

Configuration Issue Causing High Load

We are currently experiencing a configuration issue that is causing higher than normal load on some clusters in our network. This is resulting in some sites being unavailable or extremely slow to load and is unrelated to the attacks from earlier.

We will update again as soon as this has been cleared up and begins subsiding.

If you have questions or concerns, please submit a ticket via the https://my.pressable.com control panel.

UPDATE 5:05PM CST: At this time we’re seeing stability return to our systems. However, we’re still investigating with partners to determine the root cause of issues earlier. We’ll provide an update when we’re confident in the stability of the systems.

UPDATE 5:53PM CST: We are continuing to see a configuration issue that is causing higher than normal load on some clusters in our network. This is resulting in some sites being unavailable or extremely slow to load and is unrelated to the attacks from earlier. We will update again as soon as this has been cleared up and begins subsiding.

UPDATE 6:40PM CST: Servers have again returned to normal levels, but until we’ve determined the root cause we won’t call this resolved. Please stay tuned to the status blog for updates.

UPDATE 8:55PM CST: Our team is still working with providers to determine the root cause of these issues. We’ve continued to see relatively normal operating levels across our machines, but until we’re able to determine the root cause we won’t be out of the woods. Thanks for your continued patience on a very trying day for all involved.

Botnet Attack Causing Issues Across Network

Howdy,

Our servers are currently under attack and this has caused some of our databases to run behind and result in sites not displaying, updating, or functioning properly.

We have already begun banning IPs before they hit out load balancers and are working toward reducing the impact and footprint of this.

We will update the status blog again once this attack has subsided.

If you have questions or concerns, please contact our helpdesk via your panel at https://my.pressable.com.

Thank you!

UPDATE – 01/05/2015 @ 11:56AM Central:

We are seeing the slave database servers as caught up and sites should be functioning properly now. If you are still experiencing issues, please contact us via https://my.pressable.com

UPDATE – 01/05/2015 @ 12:25AM Central:

The slave databases have fallen behind again and we are attempting to get them caught back up. Although they had originally caught back up, they are now behind once more and you may be experiencing some issues (logging in to your site, content updates not display, etc.). We will update once we have seen sustained stability and synchronization on these database servers.

Please let me know if you have any questions, thanks!

Emergency Maintenance: December 24, 2014 at 3:00AM CST

On December 23, 2014 an issue was discovered with a database cluster serving sites using the “cartwheel” hostname. The issue discovered has been impacting performance of these sites as well as causing intermittent downtime issues. In order to correct the issue, we need to perform an emergency maintenance on these servers. This maintenance will require a period of downtime while we work to repair the servers and restore service levels back to normal levels.

MAINTENANCE INFORMATION: The maintenance window is expected to begin at 3:00AM CST on December 24, 2014 and expected to last approximately 2 hours. We DO NOT expect sites to be down during this entire period, but there may be periods of connectivity loss.

We will provide updates to the status of this maintenance in this post.

UPDATE 12/24/2014 @ 6:50AM CST: At this time we have completed the maintenance and are starting to see traffic return to sites. Our apologies about the short notice and extended periods of downtime. If you have any questions, please don’t hesitate to reach out.

Network Degredation Impacting Site Availability

We’re currently experiencing an issue with our network causing a complete degradation of services and loss of traffic. We’re working with our provider to determine the cause and source of the issues, but early signs point to a targeted attack against our systems. We’ll provide more details as they become available.

UPDATE 3:25PM CST: We’re still working with our provider to determine the source of increased traffic and to correct any issues.

UPDATE 3:55PM CST: At this time things appear to have stabilizied. However, we’re still working with our provider to determine the root cause of the issues and put any nesseceary measures in place to prevent any similar issues.

Site Availability issue on Galaxy01

We’re currently experiencing an issue with our Galaxy01 cluster of servers. This issue is causing intermittent 500/502 errors while processes fail to load. We’re currently evaluating the system and working to restore services to normal operational levels as quickly as possible. We’ll update this post with more information as we have it.

UPDATE 10:00AM CST: The team is still working to restore this cluster of servers to 100%. We’re getting closer as some new firewall rules come online and additional capacity. We’ll provide another update shortly.

UPDATE 10:45AM CST: At this time services are beginning to return to normal across the affected servers. We’re still waiting on some new rules to finish processing, but things are trending in a positive direction.

Pressable Status

Updates and Notifications from the Pressable system operators

RESOLVED: Chicago Data Center Outage

RESOLVED: Chicago Data Center Outage

High number of bot traffic causing sites to not respond

RESOLVED: Network Connectivity Issues Impacting Availability

Service Availability Issue on Galaxy01

Configuration Issue Causing High Load

Botnet Attack Causing Issues Across Network

Emergency Maintenance: December 24, 2014 at 3:00AM CST

Network Degredation Impacting Site Availability

Site Availability issue on Galaxy01