Issues with the Gong recording system

Incident Report for Gong

Postmortem

At a Glance

On Feb. 7 and Feb. 8, 2018 the Gong recording system experienced issues due to global usage limits set by our cloud hosting provider.

Once we identified the root cause, we resolved the issues with our cloud provider's support team by increasing those limits.

We have also added monitoring and alerting to ensure such incidents will not reoccur.

Background

On Feb. 7, 2018, we received alerts indicating that the Gong recording system was late for several calls.

While investigating the issue, we discovered that some API calls we were making to our cloud provider were throttled, i.e., served at a lower pace due to increased volume on our side.
We created an emergency software patch that accumulated multiple such calls into a single call. The changes did not seem to resolve the issue.
As we upgraded our recording system to use newer infrastructure at our cloud provider a few days prior to the incident, we reverted our system to use the older infrastructure.
Subsequently, at around 2pm PST, the issues seemed to have been resolved.
We opened an emergency ticket to our cloud provider to further research the issue.

On Feb. 8, 2018, we received alerts indicating that the Gong recording system was late for several calls, and was also missing some calls. It became clear that the change made on Feb. 7 did not address the issue.

Since the initial response from the cloud provider did not provide any indication of the root issue, we connected live with the cloud provider's support staff, and researched the problem together.
After a few attempts to identify a resolution, our cloud provider's support staff realized that our recording system was blocked from launching new machines due to global service limits, i.e., our overall system usage increased beyond limits set to protect against inadvertent extraneous use.
Our cloud provider's support staff requested an urgent increase in global service limits internally, and the issue was resolved shortly thereafter.

During the process, we communicated with our customers' administrators via our status page.

Analysis and Response

After analyzing the incident, we've realized that we were not aware of our up-to-date service limits at our cloud provider, and did not increase them ahead of time to anticipate growth.

We've subsequently taken the following actions:

We've subscribed to a specialized tool that reports and monitors cloud resource usage and compares it with the up-to-date limits. This tool is now accessible to our DevOps team and our on-call people.
We've set up triggers that alert us when our actual usage approaches the service limits. Such alerts notify our on-call staff at high priority, and guide them through increasing those service limits.
We've communicated the details of the incident to our engineering, DevOps, and support teams to increase awareness of such future issues.

Summary

We take our customers' trust in us seriously, and would like to apologize for any inconvenience caused by this incident.

Posted Feb 11, 2018 - 13:46 PST

Resolved

The recording system is back to normal operation after changes made by our cloud provider. We apologize for the inconvenience.

Posted Feb 08, 2018 - 09:45 PST

Identified

We are experiencing issues launching new recorders at our cloud provider, resulting in some skipped calls. The cloud provider is aware of the issue, which stems from an increased recording volume. It has identified the root cause, and work is in progress to address the issue.

Posted Feb 08, 2018 - 09:16 PST