Incident Report: Connectivity Issues
On February 19, 2021, WellSaid experienced connectivity issues due to a scaling error provoked by our infrastructure provider. As a result, some users saw rendering errors in the Studio Interface. The root cause was resolved at approximately 5:00 pm PT on February 19, 2021. On the morning of February 22, 2021, a subset of users experienced issues introduced by the infrastructure fix deployed on Friday. For those affected, we are so sorry for the disruption this caused.
We’ve spent the last week focusing on recovery and talking with affected customers. Here, we want to provide more information about what led to the issue, the ensuing events, our response, and how we’re changing our operations moving forward.
We know that this issue, along with the recovery time, was frustrating for some of our customers. As users of the platform ourselves, we experienced that pain first-hand across our own teams. Our solutions are informed by both your feedback and that first-hand experience. We're making continuous improvements to avoid scaling and connectivity issues in the future.
The general connectivity issue has been noticed sporadically, surfacing as the user-facing "Internal service error" message displayed in Studio. The number of support tickets related to connectivity led to an increased priority and urgency for a long-term resolution. An engineering-led root cause analysis discovered that the underlying issue was related to exceeding the maximum number of active connections allowed on our postgres instance.
We took the following steps toward resolution:
- Improved/optimized database client configurations
- Performed profile database queries and checked for over-fetching
- Scaled the machine size (increase CPU, RAM, max connections)
- Improved horizontal scaling of database via read-replication
We optimized the client configurations before moving on to the scaling items. These changes were released by EOD Friday, February 19, 2021.
Our Customers team relayed a spike in support tickets on Feb 22, 2021. This prompted the following response:
- Merging and releasing a manual revert of the DB configurations
- Upgrading our managed Postgres instance
Our Engineering team then monitored to ensure the related errors and resource limitations were no longer happening. We received confirmation from several customers that they were no longer experiencing connectivity issues.
Our technical response is only one aspect of managing an incident like this one. We will also continue to provide transparency in our communication with you in the first moments of a technical disruption.
We understand that, during a technical disruption, our users are trying to determine what to do next. You need us to communicate information that helps you make critical decisions, and we are committed to keep up to date with the most relevant and accurate information about our system’s availability. For any issue spanning multiple hours, we also commit to providing status posts with a reliable and specific timeframe for the next update.
Every day you wake up and place your trust in WellSaid to run the tools that help voice your stories, and we’re so thankful for that. Your trust means so much to us, and we’re disappointed that last week we technical issues got in the way of empowering you with the best voice AI technology. We are sorry. We will get better and grow better for you.