503 errors

Incident Report for UserVoice

Postmortem

On November 13th between 13:38 and 14:14 PT, UserVoice experienced a networking infrastructure issue that caused a sitewide outage and system unavailability.

Business Impact

During the outage end users and admins would have been unable to load or interact with UserVoice sites, widgets or the API.
Email would have been delayed, but no emails were lost.

Root Cause

In the process of cleaning up unused resources in the UserVoice infrastructure an old kubernetes cluster was removed from production. The automated cleanup of this cluster unintentionally removed a networking firewall rule that allowed our active application cluster to communicate with our backend infrastructure. Initial debugging was incorrectly focused around in-cluster symptoms and we did not immediately determine proper cause of the issue. Manual restoration of a proper firewall rule allowed the service to be fully restored.

What we are Doing to Prevent This

Proper failover firewall rules are now being controlled via our infrastructure-as-code system preventing automated cleanup of old rules.
Infrastructure cleanup tasks will be scheduled during maintenance windows going forward.

We didn’t meet our own or your expectations for using UserVoice with this outage. We do apologize for the pain points this caused for you and your team. If you have any questions or concerns, please reach out and let me know.

Claire Talbott

Support Manager

claire.talbott@uservoice.com

Posted Nov 16, 2018 - 16:24 EST

Resolved

Root cause of issue (an automated firewall rule that was automatically removed incorrectly) has been discovered and fixed. A full post-mortem will be forthcoming.

Posted Nov 13, 2018 - 18:08 EST

Monitoring

We are seeing the application back up and working again. We are monitoring things closely. Our engineers are still digging into the root cause, and we will keep you updated.

If you use our support tools, incoming emails were delayed, but none were lost, and you will see those tickets being created over the next little bit.

Posted Nov 13, 2018 - 17:21 EST

Update

We want to keep you all updated while we work to resolve this issue. The application is down, and we are all hands on deck to get this issue resolved and everything back up and working for you and your customers. We will post our next status update by 2:30PM PST.

Posted Nov 13, 2018 - 17:11 EST

Investigating

We are investigating 503 errors being returned in the UserVoice admin console and on web portals.

Posted Nov 13, 2018 - 16:44 EST

This incident affected: Web Portal (subdomain) and Admin Console.