Timeline of events:
Around 8am EST, Customer started having invocation count drops and high duration on their production Lambdas.
At 8:04am, IOpipe's SRE team got an alert about increased duration of Lambdas in us-east-1
By 8:15am EST, Customer invocation counts dropped to near 50% of typical volume, while average duration doubled.
At 8:20am, IOpipe's SRE team began a rundown system check for issues on Lambda due to the increasing duration.
By 8:30am EST, Customer invocation counts had dropped to 25% of typical volume, while average duration doubled.
At 8:31am EST, IOpipe had a alarm for API Gateway S3 signers reporting timeouts in us-west-1 region. The duration of timeouts to the services was 17 seconds.
At 8:36am EST, IOpipe had a alarm for API Gateway S3 signers reporting timeouts in us-east-2 region. The duration of timeouts to the services was 11 seconds.
At 8:39am EST, IOpipe had a alarm for API Gateway S3 signers reporting timeouts in us-west-2 region. The duration of timeouts to the services was 13 seconds.
At 8:40am EST, Customer invocation counts started climbing upwards, and duration dropped.
At 8:56am EST, IOpipe had a alarm for API Gateway S3 signers reporting timeouts in us-east-1 region. The duration of timeouts to the services was 7 seconds.
By 9:00am EST, Customer invocation counts returned to normal and duration subsided
At 9:06am EST, Customer contacted IOpipe via Slack to report an issue
At 9:15am EST, IOpipe's SRE team opened a system down ticket with AWS support
At 9:27am EST, AWS informed IOpipe of Lambda related issues.
At 9:45am EST. AWS informed IOpipe that they also had CloudFront Edge issues in multiple US regions. This was confirmed via chatter on hackernews and other sites.
By 10:29am EST AWS informed IOpipe of the following:
There is an ongoing issue with CF. We had an increase in latency accessing content through one of our CloudFront edge locations in Virginia. This issue with this edge location has been resolved and the service is operating normally. What happens is, if the API us EDGE, it might pass through the impacted region’s CF’s point of presence/location. So, even if API is in different region, it might route to CF present in impacted region and hence causing timeouts. The timeouts you saw is because of the issue and issues looks resolved now.
AWS experienced infrastructure issues affecting CloudFront distributions in us-east-1, which propagated to any deployments using edge optimized endpoints. IOpipe uses edge optimized endpoints for ingesting logs and profiles from users and were subsequently affected.
From IOpipe's side, older versions of NodeJS logger plugin had unbounded network timeouts. This allowed for the functions to wait as long as needed for the response from AWS.
Once AWS resolved the internal CloudFront issue, operations returned to normal.
Corrective and Preventative measures:
To prevent the potential issue altogether it is recommended to disable the IOpipe NodeJS logger plugin for production functions.
If using logger in production, it is recommended to use the latest version that has built in network timeout, with aggressive timeout values so that if AWS has CloudFront operational issues, that the impact would be minimal to the end user.
Moving forward, IOpipe has already identified this as a potential weakness in its first iteration of the NodeJS logger plugin support for end customers in production, and has roadmap items to fix this in early 2019.