Database timeouts

Incident Report for UserVoice

Postmortem

On Friday 11/2/18 from 5AM to 5:13AM PDT, UserVoice experienced downtime.

Business Impact

During the time of the incidents end users and admins would have seen 500 errors. They wouldn’t have been able to load the admin console, use the API, interact with ideas on the front end or use the widget or Contributor Sidebar. Email would have been delayed, but no emails were lost.

Root Cause

We saw an issue similar to last Friday’s incident where one of the servers in Uservoice's database cluster experienced an application stall event. This caused a pause in database writes. Our engineering team manually removed the affected node to allow the cluster to resume operation.

What we are Doing to Prevent This

Our team has been focused, since last week, on finding the root issue that is caused one of our database clusters to stall. This work is still in progress. Once the root issue is identified, we will be implementing a fix and updating this report with the information discovered.

In the meantime, we have put increased alerting in place so that should the issue repeat, we will identify it immediately.

We understand UserVoice being down is an interruption to you and your team, and impacts your workflows! We take this downtime seriously, and are all hands on deck to get this issue fully addressed, so we can prevent it happening again.

If you have any questions or feedback for us about the incident please don’t hesitate to contact me at claire.talbott@uservoice.com.

Claire Talbott

Support Manager

Posted Nov 02, 2018 - 11:43 EDT

Resolved

This incident has been resolved.

Posted Nov 02, 2018 - 11:41 EDT

Monitoring

A temporary fix has been implemented and we are monitoring the database cluster

Posted Nov 02, 2018 - 09:03 EDT