On Saturday, December 15th, at 5:24am Eastern time, IOpipe received alarms that ingestion was slowing or halting. The incident was escalated, and the responding engineer did not see any signs of health failure, and resolved the incident. However, an hour later, alarms went off again, and this time, the responding engineer took action by scaling the backend storage as an attempt to see if this would alleviate the pressure, and it did so.
Scaling the storage worked to alleviate the issue, but we did not have alarms in place to catch the issue and reveal its solution in the most efficient way, and it should not have required manual intervention to avoid an incident.
We now have alarms set for low storage for any node in our cluster. We also re-enabled an alarm for high CPU that was previously disabled for a version migration, and had not been transitioned.
We will also be doing an audit of our alerts and alarms, and scheduling a regular review to ensure this continues to happen on a regular basis in the future.