RCA & COE for Latency Spike and Timeouts on 30th Dec’25

Incident Summary

Scroll inside to view more
Field
Value
Date
30th December 2025
Duration
~ 2 minutes (17:04 to 17:06)
Impact
Increased latency and 504s on Juspay Order Create, Order Status and Transaction APIs
Detection
Internal monitoring alerts
Cause of Incidence
Multiple database instances restarted at the same time
Severity Level
High

On 30 December 2025, Juspay's Order Create, Order Status and Transaction APIs experienced a spike in latency and 504 responses for approximately two minutes. The issue affected all merchants and was caused by a transient disruption in one of our Aurora database clusters.

Root Cause Analysis

On checking the logs, we could see that all instances of our Aurora database cluster restarted and our database queries had increased latency for around 2 minutes.

We do not rely on the writer instance being available to serve merchant traffic. Additionally, when one or more reader instances fail, our application retries the queries with available healthy reader instances. However, during the same time all the reader instances also restarted. Because of this, we observed an increase in latency in our APIs.

We raised a ticket with AWS to understand the root cause behind the restarts of the instances in our database cluster. AWS said that the restarts happened because the VDL (Volume Durable LSN) was stuck, and that it can happen due to a spike in the write workload. However, our write workload is constant and there was no change in the traffic or query pattern before or during the issue period. We are still working with AWS to investigate the root cause of the stale VDL.

Why did the incident occur?

All the instances (readers and writer) in one of our Aurora database clusters restarted at the same time, which caused latency spikes in our API.

Why do we need writer/reader database instances to serve traffic?

Our writes go through a KV cache layer, and hence we do not need the writer instance to be available to serve traffic. However, we still need to read some data from our DB if it's not available in our KV layer.

Why did the instances restart?

According to the AWS response, the VDL was stuck, which triggered restarts in the instances. This is default behavior in AWS Aurora to ensure consistency in the cluster.

Why was the VDL stuck?

AWS mentioned that the write workload was high, but there was no change in traffic or query volume before or during the incident.

Resolution & Corrective Actions

First, to eliminate the possibility of high write workload as a cause, we have decreased the rate of writes to our database as an immediate fix. We are able to do this since we have a KV cache layer that can absorb write workload spikes.

Second, we will further optimize writes to our database by batching updates. Since most of our traffic can be served with recent data, our architecture is already designed to minimize queries to our database by using a KV cache layer. So as the third corrective action, we will make changes to ensure that we can continue to serve traffic in case our database cluster is completely unavailable for short durations.

Scroll inside to view more
Category
Corrective Action
ETA
Immediate
Decrease the rate of writes to our database
Done
Medium Term
We will optimize writes to the database by batching updates
Jan 30, 2026
Medium Term
Changes to serve traffic for recent/new orders even when database cluster is completely unavailable for short durations
Feb 13, 2026
Last updated 2 months ago