Elevated request timeouts

Incident Report for UserVoice

Postmortem

On Friday, February 17, 2017, our Engineering Team deployed some major structural changes intended to improve performance and reliability of our app. Unfortunately, a few issues related to these changes impacted customers between February 17 and the 24th.

API V1 Failed Authorization

Impact: Starting February 17th at 8:08PM EST, any time an API client attempted to make an authorized request to API v1, we returned a 401 Unauthorized error. This issue was resolved on February 18th at 11:02AM EST.
Root Cause: We made some changes to our internal stats-collecting system, and a particular function call inside of our API authentication code caused a silent exception. Due to the way we abstracted the stats code in non-production environments, we failed to detect this issue in testing.
What we are Doing to Prevent this: We have already identified improved tests that will allow us to catch an error like this prior to deploy. We are also working on more sophisticated status checks that will test more functionality of our APIs.

Delayed Emails, Web Hooks and Ticket Counts

Impact: Starting February 20 at 8AM EST, worker queues that process incoming and outgoing email, ticket counts, and web hooks became backed up. Admins would have noticed delays sending and receiving emails, with web hooks being pushed, and ticket counts updating on queues. The issue was resolved on 12:14PM EST on February 20th.
Root Cause: Monday morning is the busiest time of the week for our app. Our architectural changes meant that resources had to be allocated differently to worker processes. As users came online for the week, workers couldn’t keep up with the influx of queued jobs.
What we Are Doing to Prevent this: We resolved the issue, but are continue to work on infrastructure improvements that will make sure we can reliably perform each of these jobs consistently and on time for our customers, as well as performance test our workers and apps with production-level workloads.

Downtime

Impact: We had two related incidents of downtime. On February 22, from 10:42 to 10:44AM EST and again on February 24th, from 10:53AM to 11:17AM EST, the UserVoice app went down. During this time, the front and backend as well as the widget would have been inaccessible.
Root Cause: Both of these incidents were caused by similar issues with our deploy process that were introduced with our recent structural improvements.
What we are Doing to Prevent this: We have implemented several improvements to our deploy process to make each step fault-tolerant and smart enough to handle failures with our external build service.

We take issues that impact your team’s workflow seriously. Ensuring uptime is the top priority of our Engineering team. While the structural improvements that introduced these bugs will help us provide a better experience in our app and to our customers long-term, it’s unacceptable that our work caused problems for you and your team.

We are working on improved tests, alerts and monitoring to ensure that our future improvements provide a better experience and reliability for you, our customers.

Joey Nelson

Engineering Manager, Platform Operations

Posted Feb 28, 2017 - 10:26 EST

Resolved

This incident has been resolved.

Posted Feb 22, 2017 - 18:39 EST

Monitoring

We've made some changes to our deploy process and will continue to monitor application deploys closely.

Posted Feb 22, 2017 - 14:00 EST

Investigating

During an application deploy at 10:41AM EST, a significant percentage of requests to the UserVoice site and API timed out. We're currently investigating.

Posted Feb 22, 2017 - 10:49 EST