A core capability for building low-latency platforms is quickly detecting and reacting to issues.
It sounds odd, how does resiliency factor into low latency?
Stay with me.
Automatic retries are common in modern platforms. I've shared my cautions about blind retries in financial systems, but in many cases, retries are a solid resiliency practice.
The problem?
Retries often become the only resiliency mechanism.
The overreliance on retries leads to shortcuts around key fundamentals.
✅ Readiness Probes
✅ Health Checks
✅ Graceful Shutdown
When faults occur, requests fail, get retried, and eventually succeed. New requests repeat this process until someone fixes the fault.
But every retry adds a delay.
This is especially true when multiple retries are needed. Every retry adds extra transit and error-handling time.
With multiple failures across a microservices-based platform, retries can add up.
Retries are great, but the full collection of resiliency practices powers reliable and fast platforms.