503 errors in the UserVoice admin console
Incident Report for UserVoice
Postmortem

On October 19th between 10:00 and 11:30 PDT UserVoice experienced two approximately 10 minute infrastructure outages that caused site-wide outages and system unavailability.

Business Impact

During the outage end users and admins would have been unable to load or interact with UserVoice sites or widgets.

Email would have been delayed, but no emails were lost.

Root Cause

UserVoice uses an in-memory data-store cluster (Redis) to handle asynchronous job management and transient data storage.  A recent change to one of the libraries that use this service caused a very sudden increase in its usage. The sudden usage increase caused a system failure and prevented failover to like-sized standby services.

What we are Doing to Prevent This

  • Increased sizing of our Redis cluster and added additional alerting to allow us to more quickly detect usage spikes
  • Fixed the library that wasn’t properly interacting with Redis
Posted Oct 23, 2018 - 14:13 EDT

Resolved
This incident has been resolved.
Posted Oct 19, 2018 - 18:38 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 19, 2018 - 15:07 EDT
Investigating
We are currently investigating 503 errors on UserVoice web portals and the admin console. This also affects the API and widgets.
Posted Oct 19, 2018 - 13:05 EDT
This incident affected: Web Portal (subdomain), Admin Console, UserVoice API, Helpdesk API, and Widgets.