UPDATE: 2017-07-19 18:26 UTC
On Thursday July 13, 2017 a subset of Pressable customer sites (including our own site, Pressable.com) experienced an outage caused by a failure in a database server. Customers with sites reliant on this database server experienced 42 minutes of downtime. A smaller subset of the impacted sites experienced a further 15 minutes of downtime 2.5 hours after resolution of the first outage.
Cause
The investigation to the underlying cause for the failure of the database server is ongoing. We know that several database queries were allowed to create temporary tables on disk that never completed, resulting in more than 1TB of disk space to be consumed in a very short period of time. The database server disk became 100% full, which led to the database server failing.
Pressable has failover and redundant systems in place, but promoting a replicated database “slave” to become the database “master” is not an automated process.
While the underlying cause that led to the database failure may not have been avoidable, gaps in our alerting caused the outage to last far longer than it needed to or should have.
The 15 minute outage caused 2.5 hours later was our fault. Once the original database master failed, our engineers worked to reinitialize it as a replicated slave of the new master. Unfortunately, the engineer used the new master server to create a backup to import on the slave. This resulted in read/write locks against the databases and tables.
What We’re Doing
- We’ve made updates to our alerting to ensure that the right engineers are paged when hosts and services that are critical to serving site traffic trigger monitoring.
- We’ve also deployed updates to set a maximum allowed query execution time and queries that reach or exceed that are “killed”. When architecting our new platform, we avoided adding this in favor of wanting to provide an environment that was more flexible for working with larger datasets (importing, exporting, querying).
- Reviewing processes and procedures for recovering from situations like this and implementing tools that are more automated and remove the potential for error.
We’d like to apologize to our customers that were impacted by this outage.
This is the first failure of a database master server that caused downtime for customer sites since launching our “v2” platform over 16 months ago. Several of the tools, safeguards, and features built into the platform worked in this scenario. Some, unfortunately, didn’t or failed in ways we didn’t think possible or exposed gaps in process and alerting.
Around 3:00 am CST we were getting reports of users not being able to login to the WordPress dashboard, or their sites giving a 500 error. Members of our systems team were notified immediately. We were able to resolve the issue after approximately 30 to 40 minutes.
We will provide everyone with a post mortem after we have finished investigating the issue. We are still looking into the cause.