We're investigating a performance issue with a database
Incident Report for IOpipe
Postmortem

We narrowed down the cause of this to a large scale report that runs on a weekly schedule. It fanned out too quickly, and caused a slow down with our database cluster which affected API requests, data ingestion, and delays in alerts (including some false alerts)

Following those earlier events, we added some instances to our database. We've done further analysis and have now adjusted the weekly report scheduler to be appropriate for our current scale.

This adjustment will reduce the impact of reports on our backend. We expect this should immediately prevent this from happening again during this coming weekend.

We're also planning for maintenance this week to build out our threadpool to support more queries during periods of high traffic.

Posted 17 days ago. Mar 04, 2019 - 20:34 UTC

Resolved
All systems are back to operational. There was no data loss, but some alerts may have been delayed.
Posted 18 days ago. Mar 04, 2019 - 05:31 UTC
Monitoring
The database cluster is now recovering, and data ingestion is catching up. We're continuing to monitor it actively, and will be looking into the root cause, and solutions.
Posted 18 days ago. Mar 04, 2019 - 05:27 UTC
Investigating
We're seeing increased errors with a database which is affecting the dashboard as well as delayed ingestion of new data, and delayed alerts. No data loss is expected. We're actively investigating the issue.
Posted 18 days ago. Mar 04, 2019 - 04:54 UTC
This incident affected: Alerts, Data Collection, and Dashboard.