In distributed systems development, complexities often manifest in production in unexpected ways. Recently, we faced a significant challenge in our platform: a system designed to process counts from physical devices was showing unacceptable latency, multiple timeouts, and—most critically—a recurring series of database deadlocks.
This post details the problem, our initial approaches, and the final solution that allowed us to scale efficiently.

The Original Scenario: A Critical Data Flow

Our system is responsible for reading counts from physical devices. This data flows through several microservices before being sent to an AWS Kinesis stream.
Subsequently, an AWS Lambda consumes these Kinesis events, makes calls to various internal and external APIs, builds structured rows, and finally persists them in a relational SQL database.

Diagram 1: Original Data Flow

The Production Alarm Symptoms

The problems we began observing were clear and harmful:

Excessive Latency: Data taking hours to be processed, compromising freshness.
Constant Timeouts: The processing Lambda, with a 30-second execution limit, often exceeded this threshold during API calls, row building, or database operations.
Recurring Deadlocks: The most concerning symptom, indicating severe database contention that froze processing.

Unraveling the Main Culprit: The Deadlocks

We decided to prioritize resolving deadlocks, since fixing them would directly impact overall stability and latency.

Root Cause: Concurrent Database Operations

Our original strategy for saving, deleting, or updating rows in the database relied on transactions. The core logic was essentially:

Pre-existing Approach:

Open a transaction.
Select the rows involved.
If a row exists, update it; otherwise, insert it.
Use similar logic for deletions.

This operation was sometimes done row-by-row, other times in batches (bulk). Both approaches generated a high number of round trips to the database, consuming time and resources.

First Attempt: Simplification and Batching

Our first hypothesis was that the “select then insert/update” logic complexity contributed to the deadlocks. We proposed a simpler and more efficient operation:

Strategy:

Delete all related rows (conditionally, whether they exist or not).
Insert all new rows in a single batch.
Do everything within one transaction.
If the batch was too large, split it into smaller chunks.

Diagram 2: First Database Optimization Attempt

Result: Although the logic was cleaner, deadlocks and timeouts persisted. Unclosed transactions due to timeouts left active locks, creating a domino effect of waiting processes and new timeouts.

Second Attempt: Removing Explicit Transactions and Using `MERGE`

Since the issue persisted, we concluded that explicit transaction handling and multiple round trips were still the cause.

Strategy:

Remove explicit transactions.
Use database features such as ON CONFLICT (if available) or MERGE (as in our case) to perform “insert or update” in a single statement, reducing the need for multiple SELECT before INSERT/UPDATE.
Keep processing in small batches (chunks).

Result: There was a significant reduction in the frequency of deadlocks and timeouts, suggesting we were on the right track. However, they weren’t completely eliminated. Each MERGE query still implied an internal transaction at the database engine level, and concurrent operations on the same rows could still cause contention—though less frequently.

The Final Solution: Decoupling and Serialization with Kinesis Partition Keys

At this point, a brainstorming session led us to a radically different solution—one inspired by distributed system design patterns and leveraging AWS Kinesis capabilities.

The core idea was to guarantee that operations on the same entity (a database row with a unique key) were never processed concurrently. To achieve this, we redesigned the data flow:

Separation of Responsibilities:

The original processor Lambda would only consume from the input Kinesis, call APIs, and generate the final data rows.
These generated rows would not be written directly to the database but instead sent to a second Kinesis stream.

The “Persister” and Kinesis Partition Keys:

A new Lambda (Persister) would be solely responsible for consuming from the second Kinesis and persisting data to the database.
Crucially, when sending rows to the second Kinesis, we used each row’s unique key as the Kinesis Partition Key. Kinesis guarantees that all events with the same Partition Key go to the same shard and are therefore processed by the same Persister Lambda instance—in the exact order received.

Diagram 3: Optimized Architecture with Dual Kinesis Streams and Partition Keys

This approach allowed us to:

Eliminate Contention: By serializing operations per Partition Key, two Persister Lambda instances could never modify the same database row concurrently. Goodbye, deadlocks!
Optimize Batching: The Persister Lambda could process larger batches. If it received multiple events for the same key within a batch, it kept only the most recent version, reducing redundant INSERT/UPDATE operations.
Increase Parallelism: We discovered inefficient Partition Key distribution elsewhere, with just a few dominant keys. Using unique row keys achieved much finer distribution, allowing Kinesis to spread load effectively across many shards—and enabling our Persister Lambda to scale horizontally with ease.
Result: A resounding success. Deadlocks disappeared completely. Row processing became predictable and fast. We were able to increase batch size and parallelization while preserving entity-level order and drastically improving system performance.

Addressing API Call Timeouts

Once deadlocks were resolved, we focused on the recurring API timeouts that remained a bottleneck in the first processor Lambda.

API Optimization Strategies

Reducing Redundancy (In-Memory Caching): We identified patterns of repeated identical API calls within a single batch. Implementing simple in-memory caching avoided duplicate calls, significantly reducing network requests.

Before After
100 calls (50 unique, 50 repeated) 50 calls (only unique ones)
Table 1: API Call Comparison
Batch Endpoints: We collaborated with API teams to create endpoints that accepted lists of identifiers or parameters. This allowed us to fetch multiple resources with a single call instead of one per resource.
Concurrency in Go: Leveraging Go’s lightweight concurrency (goroutines), we modified API calls to run in parallel instead of sequentially, reducing total processing time—especially when multiple APIs were involved.
Optimizing the APIs Themselves: During debugging, we found that the APIs being called also had issues:

Before	After
100 calls (50 unique, 50 repeated)	50 calls (only unique ones)
Table 1: API Call Comparison

Connection bottlenecks: Exhausted HTTP and database connection pools; optimized limits.
Inefficient HTTP clients: Centralized and improved client configurations.
Malformed requests: Fixed inefficient or incorrect HTTP requests.
Better visibility: Enhanced logging for easier problem tracing.
Result: Combined, these optimizations greatly reduced timeout occurrences. For rare cases where an API temporarily failed, we relied on Lambda’s native retry mechanisms and a Dead Letter Queue (DLQ) to ensure eventual successful processing without data loss.

Conclusions: From Crisis to Scalability

The journey from recurring deadlocks and extreme latency to a stable, scalable system was a valuable lesson in distributed systems engineering. The key factors were:

Accurate diagnosis: Don’t assume the problem lies where the symptom appears.
Iterative approach: Test small changes and measure impact.
Bold re-architecture: Don’t fear redesigning key components when incremental fixes aren’t enough.
Leverage platform strengths: The strategic use of Kinesis Partition Keys was a true game-changer for serializing critical operations.

Today, our system not only processes device counts reliably and in real time but is also ready to scale to much higher data volumes—demonstrating the power of thoughtful architecture and continuous optimization.

Lessons Learned and Next Steps

While this intervention successfully resolved our critical production issues, the diagnostic and implementation process exposed deeper architectural and cultural challenges—our next frontiers for optimization and system maturity.

1. Kinesis Partition Management and the “Hot Shards” Problem

One key discovery was the suboptimal use of AWS Kinesis in several services. We found systems using very few Partition Keys (sometimes even a single static one) for large data volumes.

This creates “hot shards”—shards receiving a disproportionate share of traffic—negating Kinesis’s parallelism and scalability benefits. A crucial next step is auditing our data streams and applying fine-grained, evenly distributed partitioning across the platform, similar to the solution we implemented here.

2. The Proliferation of HTTP Clients and the Circuit Breaker Opportunity

We noticed a concerning pattern: many services create new HTTP client instances indiscriminately for each call. This not only wastes resources (no connection reuse) but makes resilience nearly impossible to manage.

Our short-term fix was centralizing HTTP client use, but the long-term opportunity is greater. We plan to build a centralized, enriched HTTP client with built-in resilience patterns—most importantly, the Circuit Breaker.
A Circuit Breaker detects when a dependent service is failing (e.g., returning 5xx errors or timeouts) and “opens the circuit,” pausing requests for a short time. This gives the failing service breathing room to recover instead of being overwhelmed by retries—improving the stability of the entire ecosystem.

3. Technical Debt and a Needed Mindset Shift

Finally, this incident exposed substantial technical debt. More importantly, it revealed a mindset gap: many components were not built with production-scale in mind.

The absence of comprehensive load testing (simulating real production peaks) allowed these problems to grow silently. The next step is not just technical—it’s cultural.
We must foster a mindset that prioritizes scalability, resilience, and observability as first-class, non-functional requirements—not as afterthoughts. This means considering load testing, resilience patterns, and data distribution from the very inception of every new feature.