Performance issues & timeouts

Incident Report for UserVoice

Postmortem

On September 12, 2017 from 7:12AM to 4PM EDT, 43% of requests to UserVoice failed.

Business Impact

During the incident, admins and end users would have observed the following:

Slow response when loading pages or parts of the application
504 timeout errors

Root Cause

One of our log servers crashed reducing our logging capacity, and as a result our application instances attempted to log at a volume that exceeded the allocated capacity.

When the application instance was unable to the send its logs, it followed default behavior and paused until log throughput could continue.

These widespread sporadic pauses, beyond slowing down in-process requests, also caused individual instance health-checks to fail leading to frequent restarts and additional reduced capacity.

What We are Doing to Prevent This

Fixed the broken node which caused the root issue.
Updated our instance configuration to write logs asynchronously to avoid the default behavior.
Increased our log infrastructure capacity to ensure reliability if a similar software failure happened.
Identified alerts and process improvements to help us prevent this specific issue in the future, but also to help us quickly debug unexpected system behavior.

We do apologize for the pain points this caused for you and your team. It is something we take very seriously as we work to provide you with the best service possible.

If you have any additional questions in regards to this issue, please reach out to me directly at claire.talbott@uservoice.com.

Claire Talbott

Support Manager

Posted Sep 15, 2017 - 15:06 EDT

Resolved

This incident has been resolved.

Posted Sep 12, 2017 - 19:12 EDT

Monitoring

We've identified and fixed a configuration issue with our backend services. Some services would shut down if they were unable to connect to a node on our log aggregation cluster. Last night, one of our log nodes died and as traffic (and log volume) peaked this morning we began to see erratic behavior across our infrastructure. Users experienced intermittent periods of timeouts and degraded performance.

We'll continue to monitor performance and work to fill gaps in our metrics and alerting to mitigate issues like these in the future. We'll follow up with a public postmortem later this week.

Posted Sep 12, 2017 - 16:09 EDT

Update

We are continuing to investigate performance issues resulting in 5XX errors across UserVoice.

Posted Sep 12, 2017 - 12:41 EDT

Investigating

Our engineering team is focused on investigating performance issues on across UserVoice. We're still seeing bursts of timeouts. Users may see Cloudflare-branded 502 or 503 errors.

Posted Sep 12, 2017 - 09:20 EDT