We narrowed down the cause of this to a large scale report that runs on a weekly schedule. It fanned out too quickly, and caused a slow down with our database cluster which affected API requests, data ingestion, and delays in alerts (including some false alerts)
Following those earlier events, we added some instances to our database. We've done further analysis and have now adjusted the weekly report scheduler to be appropriate for our current scale.
This adjustment will reduce the impact of reports on our backend. We expect this should immediately prevent this from happening again during this coming weekend.
We're also planning for maintenance this week to build out our threadpool to support more queries during periods of high traffic.