Author Archives: Roberto

Network Connectivity Issues

UPDATE 11/20/15 @ 11:00AM Central:
The connectivity issue that Rackspace experienced between their ORD (Chicago) data center and Time Warner Cable has been resolved, and all services have returned to normal. At this time, all sites should be accessible. If you experience any issues with your sites, please open a support ticket (support@pressable.com) so that we may assist you.

INITIAL NOTICE:
We are currently experiencing network connectivity issues in sites based out of our ORD datacenter.

We are working with Rackspace to see when they expect these issues to be resolved and will update here as we know more.

If you have questions or concerns, please submit a ticket via https://my.pressable.com

UPDATE 11/20/15 @ 6:00AM Central:
Rackspace has informed us that they are experiencing connectivity and latency issues from Time Warner Cable. We are staying in close contact with them as this issue persists and will continue to update our status blog accordingly.

UPDATE 11/20/15 @ 7:15AM Central:
Service is returning to normal at this point. We are working on bringing all servers back to normal and as a result, sites may be still experience issues loading. We will update again as soon as service has been restored completely.

Emergency Security Maintenance

Howdy,

We received notice this afternoon that our provider, Rackspace, has identified a vulnerability in Xen Hypervisor. This vulnerability has been patched by Rackspace but it requires a reboot of cloud servers in order to for it to take.

You can read more about the vulnerability and requirements here:

https://community.rackspace.com/general/f/53/t/5187
http://venom.crowdstrike.com

We have elected to reboot cloud servers in our network at our discretion rather than allowing Rackspace to reboot using a maintenance window approach.

We will begin the process of rebooting cloud servers tonight, May 13th, at 10:00 PM Central time. We expect this process to take several hours and customers will see intermittent outages of varying length.

We will update this page when the maintenance is completed.

If you have any questions or concerns, please contact our help desk.

We apologize in advance for any inconvenience this may cause and are working toward addressing this as smoothly as possible.

Thank you!

UPDATE – May 14th, 2015 @ 2:57 AM Central: We have completed the reboots that are necessary as a part of this vulnerability patch. All systems are back up, running, and stable. Thank you for your patience and please let us know if you have any questions. Thanks!

XSS Vulnerability Affecting Multiple WordPress Plugins

The Sucuri Blog has notified users of multiple WordPress plugins that are vulnerable to Cross-site Scripting (XSS) attacks. Listed are some of the more popular plugins used in the WordPress community:

https://blog.sucuri.net/2015/04/security-advisory-xss-vulnerability-affecting-multiple-wordpress-plugins.html

The nature of this vulnerability makes it difficult to patch completely/comprehensively because so many plugins use the functions listed as being misused.

We highly recommend logging into your WordPress Dashboard and updating any plugins that have available updates.

If you have any questions or concerns, please contact our help desk by submitting a ticket via your https://my.pressable.com panel.

Chicago Datacenter Issue – RESOLVED

Howdy,

We just finished dealing with an issue in our Chicago datacenter that was causing several other clusters to experience instability. Our “Ursa” cluster was taking on an extreme amount of traffic that looks to be, in large part, a bot attack.

This happened as a result of the “Ursa” cluster having a set of tools not running appropriately that detects and mitigates issues like this.

We’ve cleared this up and all sites are now back up and running appropriately.

If you have any questions or concerns, please contact our help desk via your https://my.pressable.com control panel.

Thank you!

Rackspace Scheduled Critical Maintenance

This is a notice that Rackspace will be performing critical security related updates to many cloud server host machines in order to patch vulnerabilities in Xen Hypervisor.

You can read more about this maintenance here:

https://community.rackspace.com/general/f/53/t/4978

These patches/updates will require host machines to be rebooted, subsequently causing cloud servers hosted on them to require a reboot as well.

As it relates to our customers, here are the maintenance windows that we have been provided with and can expect we will begin seeing server reboots occur based on cluster:

  • Hyperion, Pegasus, Cartwheel Clusters
    • Tuesday, March 3rd 01:00 – Tuesday, March 3rd 05:00 EST COMPLETE
  • Galaxy01, Thor, Bode, Ursa, Hydra Clusters
    • Wednesday, March 4th 22:00 – Thursday, March 5th 06:00 CST
    • Thursday, March 5th 22:00 – Friday, March 6th 02:00 CST

To find out which cluster your sites are on, please reference our knowledge base article on identifying which cluster your site is on.

We definitely understand these kinds of outages are not ideal but we are hoping this early notice is helpful in the way of being able to notify your users, visitors, and customers.

If you have any questions, please feel free to contact the help desk via your my.pressable.com control panel.

Thank you!

IAD Datacenter Issues

We are experiencing an issue with sites in our IAD Datacenter that is causing them to not load appropriately.

We believe this is related to issues that Rackspace is currently having with their cloud block storage services. We are working with them directly and awaiting further details regarding this issue and will update the status blog with more information as it becomes available.

If you have questions or concerns, please create a help desk ticket via your https://my.pressable.com panel or join us in our community lounge at http://chat.pressable.com for updates while we await further details.

UPDATE Feb 27, 2015 @ 5:00 AM Central: We were able to confirm that this is an issue occurring at Rackspace with their Cloud Block Storage service. You can find more details and information on their status page: https://status.rackspace.com/index/viewincidents?start=1425013200

We will update again as more information is available.

UPDATE Feb 27, 2015 @ 5:55 AM Central: Rackspace has resolved the issue on their end and we are now working on re-establishing stability on our end. We will update again as soon as this is taken care of.

UPDATE Feb 27, 2015 @ 6:10 AM Central: We have now restored functionality across our IAD datacenter and all sites are now functioning normally. If you continue to experience problems, please submit a help desk ticket via your https://my.pressable.com panel.

Current Outage Breakdown and Full Information

Howdy,

We have been flooded with help desk requests, tweets, emails, and phone calls requesting more information and have had a bit of a difficult time keeping up and getting responses out to everyone and answer all questions.

We are starting a new status blog post to help answer as many of the most frequently asked questions as we can. You can continue to receive the latest updates at the bottom of this post.

For customers who have questions/concerns regarding the outage, please join us in our chat lounge to discuss: hipchat.com/g5gQ8vl9S

Exactly what happened and what is going on?

Yesterday morning, we encountered issues with caching servers that had previously been built and optimized to handle load from the outage early last week. It was determined that this was a result of a bandwidth limitations of our internal network traffic.

As a result, we built new servers that had double the network throughput and continued to have the same issues as before. From here, we decided to subdivide caching traffic based on cluster and saw significant improvement in the situation.

After seeing improvement, we began to see internal bandwidth limitations on database servers and are currently working on adding additional database servers to help with this.

What is being done to fix it?

The first step in addressing the original issue from yesterday morning was to subdivide internal caching traffic based on cluster. Doing has helped significantly and has helped us see bottle necks in other places, most importantly our database servers.

As a result, we are working to add additional database servers as a means of addressing these internal bandwidth limitations as they pertain to databases.

What is the ETA on completing a fix?

Providing an ETA in this situation is very difficult. It relies on us knowing exactly how quickly we can get new servers up, optimized, and working reliably. These times are unknown because it needs to include some time to monitor the implementation and verify improvement.

We are all hands on deck and all working very hard to have stability restored to the system.

We do not currently have and will not likely provide an ETA in this situation. The best thing to do is to keep checking the current status at the bottom of this post.

We expect things to be working normally within a number of hours.

What is the current status?

As of 9:35AM, Jan. 21, 2015: Sites are currently up and down, intermittently. We are currently working to re-provision portions of our architecture but the rate at which we can add servers is currently limited and we are working with our provider to have our rate limits pushed up. Once new servers are up, we will begin seeing sites stay up consistently and running at normal speed.

UPDATE 10:45AM, Jan. 21, 2015: Our provider has raised our rate limits and we are able to provision servers at a more rapid pace. We are still working on getting new database servers up and will update again as soon as we have more information to share.

UPDATE 1:50PM, Jan. 21, 2015: We are currently finalizing the deployment of several new machines and are monitoring these for improvements. We are expecting to see improvements in several clusters as soon as these deploys are finished. We will update again as soon as we have more information to share.

UPDATE 5:10 PM, Jan. 21, 2015: Our team is still working to fine tune the new hardware deployments brought online. We’re continuing to monitor the situation and make adjustments as needed.

UPDATE 8:00 PM, Jan. 21, 2015: Our team is still working to get the new hardware deployments brought online and in rotation.We are expecting to see improvements in several clusters as soon as these deploys are finished. We’re continuing to monitor the situation and make adjustments as needed.

UPDATE 11:30 PM, Jan. 21, 2015: We’re continuing to monitor the situation and make adjustments as needed. Sites are currently up and down, intermittently until the new hardware deployments are brought online. We will update again as soon as we have more information to share.

UPDATE 9:15 AM, Jan. 22, 2015: After discussions with our provider, we are in the middle of rolling out changes that we are hoping will help resolve this in the near future. Thank you for hanging in there with us as we look for a resolution.

UPDATE 10:30 AM: We have isolated a single cluster that was causing trouble for the others. For the time being, all other clusters except that one are up and running “normally.” As we work on that other cluster and test changes/fixes, though, the others may be affected by it. For customers on our bode cluster, we are working on a fix right now  and looking to possibly move customers off this cluster as soon as possible. We are still working on a finalized course of action for these customers.

UPDATE 1:35 PM: We have created a new cluster and have moved a large chunk of customers off of bode and onto a new cluster named “hydra.” If you previously had a site on bode and that has been moved, you will likely see it begin working within the next 1 – 2 hours as DNS changes over. We are finalizing plans for customers that do not have DNS pointed at us and will communicate this as soon as we know what we will be doing with this set of customers.

For customers who have questions/concerns regarding the outage, please join us in our chat lounge to discuss: hipchat.com/g5gQ8vl9S

Slow/Unresponsive Sites

Howdy,

We are investigating an issue causing slow/unresponsive sites and page loads resulting in 502s for some sites on our network.

As soon as we have further information we will update this status blog.

If you have questions or concerns, please contact us via your https://my.pressable.com control panel.

UPDATE 9:54AM Central: We are still looking into a root cause for this issue. Sites will continue to go/up down while we troubleshoot and clear this up.

UPDATE 11:06AM Central: Service has returned to normal at this point and sites will be loading properly. We are still working to identify the root cause of the issue and ensure that it has been properly adressed. For now, sites are up and functional and we will keep an eye out for any further potential issues.

UPDATE 12:02 PM: We are continuing to investigate the issue from this morning to find a resolution. The team is working now to make sure this gets resolved as quickly as possible.

UPDATE 1:00 PM: Our team is still working to bring services back to normal functionality. Some sites may have begun responding over the past hour, however, our system is still not back to 100%. We’ll provide more updates as we progress.

UPDATE 2:10 PM: We are continuing to investigate the issue from this morning to find a resolution. The team is working now to make sure this gets resolved as quickly as possible. If you haven’t done so, please submit a support ticket via your my.pressable.com control panel or send an email to help@pressable.com and we can answer anything other questions you may have.

UPDATE 6:50PM Central: We are continuing to see some sites function normally and others experience issues. Our team is working on addressing the issues across the network and is making progress. We hope to have things back up and running shortly. We will update again as soon as more information is available.

UPDATE 9:00PM Central: At this time we have new caching servers up for each galaxy in our network. These are helping with load but sites are still flipping between up and down intermittently. We will update again once these new caching servers are stable and returning site speeds to normal.

UPDATE 12:00AM Central, Jan. 21, 2015: At this time we are seeing most all sites back up and running. If you are still experiencing issues, please let us know and we will address them accordingly. We are still working on maintaining stability and speed at the moment.

UPDATE 5:40AM Central, Jan. 21, 2015: Though most clusters remain up and running intermittently, sites are continuing to go in and out occasionally. We are continuing to work on the stability and speed issues that are currently at hand. Doing so will help bring sites back online consistently again.

UPDATE 9:36AM, Jan 21st, 2015: We have posted a full breakdown of the current situation and will be providing further updates on this status blog post:
http://status.pressable.com/2015/01/21/current-outage-breakdown-and-full-information/

Please see the link above for further updates to this situation

RESOLVED: Chicago Data Center Outage

We are currently experiencing an outage in our Chicago Datacenter. All sites hosted in this data center are currently unresponsive and not loading.

We are addressing this issue now and will have things back up and operational shortly.

UPDATE 5:30PM CST: We’re still experiencing issues with the network backbone at our Chicago DC. We’re working with our provider to determine what the cause is and how we can mitigate this traffic to restore services.

UPDATE 6:05PM CST: We’re still working with our provider to determine the cause of the increased traffic along our backbone. We apologize for the delay in getting things back up and operational.

UPDATE 6:30PM CST: We’re still working with our provider to determine mitigating steps we can take for our internal network. We’ll provide an update in 30 minutes.

UPDATE 7:00PM CST: We’re currently engaging more senior members of our providers team to help troubleshoot the issues. We’ll provide an update in 30 minutes.

UPDATE 7:35PM CST: We’re still working with our provider to diagnose the current issues. We’ll provide an update in 30 minutes.

UPDATE 8:30PM CST: Our team is still working to mitigate the effects of traffic on our backbone. We’ve started to push out configuration changes which we expect to help reduce the impact, but it will be a bit longer before we know the impact of these changes. We’ll provide an update in 30 minutes.

UPDATE 10:05PM CST: Our team is still working to restore services. We’ll provide an update in 30 minutes.

UPDATE 10:40PM CST: Our team is beginning some work to bring servers back online and then will continue to evaluate issues. We do not expect services to return to normal at this time and will provide another update in 30 minutes.

UPDATE Jan, 10th, 2015 @ 12:30AM CST: Our team is still working on bringing servers back online and undergoing internal troubleshooting processes. We will continue to update the status blog as often as possible.

UPDATE Jan 10th, 2015 @ 1:25AM CST: We’re currently experiencing delays in the restoration of services as Rackspace is having provider line issues. We’ll continue our efforts as best as possible while Rackspace works to improve their provider issue. (https://status.rackspace.com/)

UPDATE Jan 10th, 2015 @ 2:45AM CST: The team is still working to bring services back online following issues from yesterday. We will continue to update the status blog as we make progress.

UPDATE Jan 10th, 2015 @ 6:05AM CST: Our team is still working to bring services back online. Some sites may have begun responding over the past hour, however, capacity is still not back to 100%. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 7:30AM CST: The team is still bringing services back online from the outage. We will provide more updates as we have them.

UPDATE Jan 10th, 2015 @ 9:00AM CST: We’re still working on bringing services back online related to this outage. We’ll provide more information as progress is made.

UPDATE Jan 10th, 2015 @ 10:00AM CST: Our team is still working to resolve issues related to connectivity and site availability. We’ll provide an update when we have more information.

UPDATE Jan 10th, 2015 @ 11:45AM CST: The team is making some configuration changes while we work to bring additional capacity online. We’ll provide an update when there is more progress.

UPDATE Jan 10th, 2015 @ 1:20PM CST: The team is making progress and we hope to have all systems back up and running soon. Some sites may have begun responding over the past hour, however, capacity is still not back to 100%. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 3:30PM CST: The team has made updates to the system that will help get things running and stable. Capacity is still not back to 100%. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 6:20PM CST: Sites are beginning to be served now but you may encounter the intermittent 502 error while things continue to settle. We’ll provide more updates as we progress.

UPDATE Jan 10th, 2015 @ 9:20PM CST: Customers with sites on our “Thor” cluster should see their sites being served properly now. We’re finalizing some work on our “Bode” and “Galaxy01” clusters and expect those to be functional shortly. We’ll provide more updates as we progress.

UPDATE Jan 11th, 2015 @ 9:15AM CST: We wanted to get another note out to let everyone know that we’ve seen stability restored (as of last night) and are currently seeing all systems online. We sincerely regret the experience provided over the last several days, but do appreciate your continued patience. Tomorrow we’ll be providing a more detailed analysis of the issues and our steps to correct these issues moving forward.