Ingestion lag due to storage scaling
Incident Report for IOpipe

What happened

On Saturday, December 15th, at 5:24am Eastern time, IOpipe received alarms that ingestion was slowing or halting. The incident was escalated, and the responding engineer did not see any signs of health failure, and resolved the incident. However, an hour later, alarms went off again, and this time, the responding engineer took action by scaling the backend storage as an attempt to see if this would alleviate the pressure, and it did so.

Actions taken to avoid this in the future

Scaling the storage worked to alleviate the issue, but we did not have alarms in place to catch the issue and reveal its solution in the most efficient way, and it should not have required manual intervention to avoid an incident.

We now have alarms set for low storage for any node in our cluster. We also re-enabled an alarm for high CPU that was previously disabled for a version migration, and had not been transitioned.

We will also be doing an audit of our alerts and alarms, and scheduling a regular review to ensure this continues to happen on a regular basis in the future.

Posted 3 months ago. Dec 18, 2018 - 16:37 UTC

On Saturday, December 15th, 2018 at 5am Eastern, there were issues with our ingestion pipeline, and users might have seen alerts go off, because according to ingestion, there were no invocations. We investigated and found that we needed to scale our storage, and after doing so, that resolved the issue.
Posted 3 months ago. Dec 18, 2018 - 16:29 UTC
This incident affected: Data Collection.