Topics / Reliability

Reliability

Practical patterns for building systems that keep working when things go wrong — retries, timeouts, circuit breakers, graceful degradation, compensating transactions, and the operational habits that separate stable platforms from fragile ones.

14 posts

July 16, 2026 Sometimes the most resilient thing a system can do isn’t retry reliability
July 9, 2026 Should retries and timeouts live in your application or your service mesh? reliability
May 21, 2026 Health-check the listener your gRPC traffic actually uses reliability
May 7, 2026 YOLO Is a Terrible Strategy for Validating Production Changes reliability
April 16, 2026 Are you using traffic mirroring in production? If not, try it out. reliability
March 19, 2026 When You Go to Production with gRPC, Make Sure You’ve Solved Load Distribution First reliability
March 12, 2026 You may be building for availability, but are you building for resiliency? reliability
December 19, 2025 Canary deployments are an operational superpower, but the complexity they bring isn’t for everyone. reliability
November 28, 2025 Does resource usage within your application or database suddenly spike periodically? Does it cause system slowdown? reliability
November 21, 2025 When you shut down an application instance, don't stop the listener immediately — that's how you end up with failed requests during every application rollout. 😢 reliability
November 14, 2025 A common issue I see when teams first adopt gRPC is managing persistent connections, especially during failovers. reliability
October 27, 2025 Have you heard of Store and Forward? It’s a resiliency design prevalent in card & bank payments, telecommunications, and other industries. reliability
September 5, 2025 A core capability for building low-latency platforms is quickly detecting and reacting to issues. reliability
August 8, 2025 I can't count how often I've seen issues made worse by minor oversights—like not setting a timeout value. ⏱️ reliability