Temporarily Support Outage

Our upstream support provider reported two short outages totaling seven minutes on Tue Aug 29 from 22:03 – 22:05 and again from 16:36 – 16:41.

If you sent a ticket to help@pressable.com it may need to be resent.

This report only effects support tickets and is not the Pressable infrastructure itself which remains fully operational.

Our service provider is reporting all systems are up. We are monitoring the situation throughout the day and will report back if anything changes.

 

 

5/18 Partial Outage (Resolved)

4:00 PM UTC : The Partial Outage was caused by a DDoS attack on a specific customer’s site that we have now mitigated. We will continue to monitor the situation but as of now this outage has been resolved.

3:30 PM UTC : We are aware of and investigating a partial outage impacting some customer sites.

Investigating Outage

UPDATE: 2017-07-19 18:26 UTC

On Thursday July 13, 2017 a subset of Pressable customer sites (including our own site, Pressable.com) experienced an outage caused by a failure in a database server. Customers with sites reliant on this database server experienced 42 minutes of downtime. A smaller subset of the impacted sites experienced a further 15 minutes of downtime 2.5 hours after resolution of the first outage.

Cause

The investigation to the underlying cause for the failure of the database server is ongoing. We know that several database queries were allowed to create temporary tables on disk that never completed, resulting in more than 1TB of disk space to be consumed in a very short period of time. The database server disk became 100% full, which led to the database server failing.

Pressable has failover and redundant systems in place, but promoting a replicated database “slave” to become the database “master” is not an automated process.

While the underlying cause that led to the database failure may not have been avoidable, gaps in our alerting caused the outage to last far longer than it needed to or should have.

The 15 minute outage caused 2.5 hours later was our fault. Once the original database master failed, our engineers worked to reinitialize it as a replicated slave of the new master. Unfortunately, the engineer used the new master server to create a backup to import on the slave. This resulted in read/write locks against the databases and tables.

What We’re Doing

  • We’ve made updates to our alerting to ensure that the right engineers are paged when hosts and services that are critical to serving site traffic trigger monitoring.
  • We’ve also deployed updates to set a maximum allowed query execution time and queries that reach or exceed that are “killed”. When architecting our new platform, we avoided adding this in favor of wanting to provide an environment that was more flexible for working with larger datasets (importing, exporting, querying).
  • Reviewing processes and procedures for recovering from situations like this and implementing tools that are more automated and remove the potential for error.

We’d like to apologize to our customers that were impacted by this outage.

This is the first failure of a database master server that caused downtime for customer sites since launching our “v2” platform over 16 months ago. Several of the tools, safeguards, and features built into the platform worked in this scenario. Some, unfortunately, didn’t or failed in ways we didn’t think possible or exposed gaps in process and alerting.


Around 3:00 am CST we were getting reports of users not being able to login to the WordPress dashboard, or their sites giving a 500 error. Members of our systems team were notified immediately. We were able to resolve the issue after approximately 30 to 40 minutes.

We will provide everyone with a post mortem after we have finished investigating the issue. We are still looking into the cause.

4th of July Help Desk Schedule

Our Help Desk will have reduced staffing for Independence Day (United States). Please note our holiday support hours and consider addressing any potential issues in advance:

  • Tuesday, July 4, 2017: Reduced staffing, expect longer response times

Our Help Desk will return to its regular schedule on Wednesday, July 5, 2017. If you have any questions or concerns about this notice, please contact our Help Desk. Thank you.

Fixed: Email Delivery Issue

An email-delivery issue with our help desk portal recently caused some emails to go undelivered. This issue affected a small number of customers who communicated via email. No issues are present at this time, and full functionality has been restored.

Issues with SSL Certificates (RESOLVED)

Update: 7:00PM Central: The issue with our SSL provider has been resolved. New domains should have no issues with being assigned SSL certificates moving forward.

8:00 AM Central: We are currently experiencing issues with issuing new SSL certificates from our third party provider. We are actively working with them in order to get the issue resolved so that new SSL certs can be issued to your domains. Please contact the helpdesk for more information.