How I Fixed The Bottleneck That Killed 700+ Lambdas
Transcript
We ran a migration on our DynamoDB table that served over 700 Lambdas. One thing we didn't account for was the delays in the change stream processor that ended up taking down our game for thousands of players. Here's the pattern I implemented to fix it.
For context, we had an extreme single-table design. Over 700 Lambda functions, all pointing at a single DynamoDB table. DynamoDB Streams captured every change. A Lambda processor subscribed to that stream, pushing everything to SNS. Downstream systems subscribed to that SNS topic. When that processor fell behind, every downstream system felt it. This architecture worked fine under normal load. But it had no tolerance for spikes.
The trigger: an admin API with an AWS-enforced 30-second timeout. To beat the clock, it hammered DynamoDB with hundreds of thousands of writes. No throttling. No backoff. Just raw throughput. The stream processor couldn't keep up. Events backed up. Every downstream system was waiting on state that hadn't propagated yet. The blast radius was the entire platform. We needed a pattern that decoupled the request from the execution.
So we built one we called, ActionRunner. The ActionRunner pattern has three parts. First, data sources. These are the things that trigger work. A scheduler for cron jobs. An admin API for manual operations. ActionRunner can also trigger itself. One action can spawn others, chaining work through the same queue. Second, an SQS FIFO queue. This is the buffer. Every action gets serialized as a message and dropped into the queue. Third, the processing Lambda. This is the ActionRunner itself. It pulls messages from the queue and executes the work. The key part of the pattern is that the data sources no longer do the work. They drop a message in the queue and return immediately. ActionRunner picks it up whenever it's ready.
This pattern solves the incident in multiple ways. First, no timeout pressure. ActionRunner isn't constrained by API Gateway's 30-second limit. It can take as long as it needs by breaking the work into smaller batches, spacing out the writes, and generally being a good citizen to the database. Second, controlled throughput. If your database is under stress, you can just throttle ActionRunner by dialing down its concurrency. Lambda has built-in concurrency controls, so you set a reasonable limit and no single action can accidentally consume all your capacity. Third, queue-based load leveling. Instead of hitting your systems with everything at once, the queue absorbs the spikes so work gets processed at whatever rate is sustainable. Your downstream systems never get overwhelmed. Fourth, ordered parallelism with FIFO. Each message group processes in order, but different groups can process in parallel. Depending on your workload, you can balance by operation, by user, or whatever makes sense. Other operations keep flowing while only the overloaded one gets held back.
A few things to keep in mind. Design your actions to be idempotent. Running the same action twice should produce the same result. SQS can deliver messages more than once. Your code needs to handle that. Use the message group ID strategically. Same group gives you strict ordering, different groups give you parallel processing, so pick based on your use case. You'll also want a dead letter queue so that after a set number of retries, failed messages get moved there instead of blocking the queue. Hook up a CloudWatch alarm on it too. If you're not monitoring failures, you don't have a production system. Log everything. ActionRunner should emit structured logs for every action it processes. When something goes wrong, you'll thank yourself later.
ActionRunner decouples request from execution. You submit a task, it gets queued, and a background processor handles it with proper throttling. No more migrations that cascade into platform-wide incidents. The weekly outages stopped. The implementation is straightforward: an SQS FIFO queue, a Lambda processor with concurrency limits, and idempotent actions. ActionRunner solved the biggest problem. We also started batch-publishing to SNS, parallelized the stream processor, and began the process of splitting our single table into a table per bounded context.
Serialization matters for this pattern. Watch the benchmarks next.
If you liked this, don't forget to subscribe and head over to codewithstu.tv to find me on socials.