RCA Capptions 1 Outage 13 November 2024

Incident Overview

On Wednesday, 13th November 2024, at approximately 14:45 CET, our monitoring systems detected an unhealthy status in one of the database replica set members. Concurrently, our support team began receiving reports from customers about Capptions servers being inaccessible.

Our technical team immediately launched an investigation. While our cloud provider flagged a potential issue with one of their routers, initial testing did not conclusively link the network event to the database issue. Restarting the affected database member temporarily resolved the issue by 15:25 CET, leading us to believe the router issue might have been a contributing factor.

However, at 17:45 CET, similar issues recurred, causing another outage. After exhausting immediate recovery options, we decided to initiate a full cluster restore. Due to the uncertainty surrounding the cloud provider’s routing issue, we switched to our backup cluster to ensure data integrity and restore service stability.

This process required creating verified backups of the primary cluster, restoring data to the backup cluster, and conducting thorough testing. By 23:45 CET, all services were fully operational.

Root Cause

Subsequent investigations revealed that the routing issue caused a rare disruption in the database cluster, leading to intermittent timeouts. While the data integrity and database connectivity remained intact, these timeouts rendered our services unresponsive.

Resolution

We mitigated the issue by transitioning to our backup cluster after ensuring data safety. Services were restored following rigorous testing.

To prevent recurrence, we are initiating a complete rebuild of the production cluster and plan to transition back from the backup cluster in the coming days under controlled conditions.

Learnings and Next Steps

Improved Incident Response:
- This incident, while rare, highlighted areas for improvement in incident handling. Although no data was lost, the extended downtime is unacceptable. We are revising our incident response processes to minimize recovery times in similar scenarios.
Faster Backup Transition:
- We will implement optimizations to enable a quicker switch between production and backup clusters, reducing downtime during critical incidents.
Enhanced Communication:
- A more robust error/status page is being developed to keep users informed during service disruptions, providing clear updates and maintaining transparency.

Acknowledgments

We extend our gratitude to our engineering and support teams for their relentless efforts during this incident. Their commitment ensured a resolution with no data loss and full service restoration. We are equally thankful to our customers for their patience and understanding during this time. Your support inspires us to improve continuously.

Thank you for trusting us with your business. We remain committed to delivering the reliable service you expect and deserve.