503 errors across UserVoice

Incident Report for UserVoice

Postmortem

On Friday, February 17, 2017, our Engineering Team deployed some major structural changes intended to improve performance and reliability of our app. Unfortunately, a few issues related to these changes impacted customers between February 17 and the 24th.

API V1 Failed Authorization

Impact: Starting February 17th at 8:08PM EST, any time an API client attempted to make an authorized request to API v1, we returned a 401 Unauthorized error. This issue was resolved on February 18th at 11:02AM EST.
Root Cause: We made some changes to our internal stats-collecting system, and a particular function call inside of our API authentication code caused a silent exception. Due to the way we abstracted the stats code in non-production environments, we failed to detect this issue in testing.
What we are Doing to Prevent this: We have already identified improved tests that will allow us to catch an error like this prior to deploy. We are also working on more sophisticated status checks that will test more functionality of our APIs.

Delayed Emails, Web Hooks and Ticket Counts

Impact: Starting February 20 at 8AM EST, worker queues that process incoming and outgoing email, ticket counts, and web hooks became backed up. Admins would have noticed delays sending and receiving emails, with web hooks being pushed, and ticket counts updating on queues. The issue was resolved on 12:14PM EST on February 20th.
Root Cause: Monday morning is the busiest time of the week for our app. Our architectural changes meant that resources had to be allocated differently to worker processes. As users came online for the week, workers couldn’t keep up with the influx of queued jobs.
What we Are Doing to Prevent this: We resolved the issue, but are continue to work on infrastructure improvements that will make sure we can reliably perform each of these jobs consistently and on time for our customers, as well as performance test our workers and apps with production-level workloads.

Downtime

Impact: We had two related incidents of downtime. On February 22, from 10:42 to 10:44AM EST and again on February 24th, from 10:53AM to 11:17AM EST, the UserVoice app went down. During this time, the front and backend as well as the widget would have been inaccessible.
Root Cause: Both of these incidents were caused by similar issues with our deploy process that were introduced with our recent structural improvements.
What we are Doing to Prevent this: We have implemented several improvements to our deploy process to make each step fault-tolerant and smart enough to handle failures with our external build service.

We take issues that impact your team’s workflow seriously. Ensuring uptime is the top priority of our Engineering team. While the structural improvements that introduced these bugs will help us provide a better experience in our app and to our customers long-term, it’s unacceptable that our work caused problems for you and your team.

We are working on improved tests, alerts and monitoring to ensure that our future improvements provide a better experience and reliability for you, our customers.

Joey Nelson

Engineering Manager, Platform Operations

Posted Feb 28, 2017 - 10:25 EST

Resolved

As mentioned in a few recent incidents, we've made some significant changes to our system architecture and deploy process. Unfortunately today a hiccup in a deploy resulted in about 24 minutes of downtime to our app. Our ops team is working hard to add more fault tolerance to every level of our deploy pipeline to prevent corner cases that could cause issues like we've seen today. Early next week we'll be providing a public postmortem to provide insight into these issues and the steps we're taking to prevent them.

Posted Feb 24, 2017 - 14:31 EST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 24, 2017 - 11:44 EST

Update

The app is back up and running and we are closely monitoring our systems. We'll continue our investigation and post more details here shortly.

Posted Feb 24, 2017 - 11:20 EST

Identified

We are working to restore connectivity to our app.

Posted Feb 24, 2017 - 11:02 EST

Investigating

Users are currently seeing failed requests on UserVoice. We are investigating.

Posted Feb 24, 2017 - 11:00 EST