Benjamin Cane
Portrait of Benjamin Cane
Benjamin Cane
November 21, 2025

When you shut down an application instance, don't stop the listener immediately — that's how you end up with failed requests during every application rollout. 😢

🛑 The Common Mistake:

I've seen many shutdown implementations that stop the listener as soon as the shutdown signal is received.

The assumption is usually:

“Stopping the listener will fail readiness probes, and traffic will be redirected.”

That's half right…

It will trigger traffic redirection, but not immediately.

⏱️ Probe Intervals Matter:

Readiness probes (Kubernetes), Load Balancer health checks, & service mesh probes all run at fixed intervals.

In Kubernetes, the default is 10 seconds.

That means it can take up to 10 seconds for the platform to detect an unhealthy status and adjust traffic.

Longer if the failure threshold is greater than 1.

💥 What Happens During Those 10 Seconds?

New traffic still goes to the unhealthy instance.

And because you stopped the listener, every request to that instance fails for 10 seconds.

Some clients retry and land on another instance.

Some will not.

Either way, every rollout will result in failed requests that could have been avoided.

✅ What You Should Do Instead

When shutting down an instance:

1️⃣ Keep the listener running; Don’t slam the door shut.

2️⃣ Fail readiness probes; Report failures from the readiness endpoint, but allow new requests to other endpoints.

3️⃣ Wait for traffic to drain; Let in-flight requests finish, and let the platform stop routing new requests.

4️⃣ Then stop the listener; Only when it's safe.

This is a graceful shutdown.

🧠 Final Thoughts

Resiliency isn't only about surviving failures, it's also about preventing them.

Handle shutdown properly, and you can roll out new code without ever failing a request.

Back to the feed

Next Post

  • November 28, 2025 Does resource usage within your application or database suddenly spike periodically? Does it cause system slowdown?

Previous Posts

  • November 14, 2025 A common issue I see when teams first adopt gRPC is managing persistent connections, especially during failovers.
  • November 7, 2025 A dangerous mindset I’ve seen—and been guilty of—is assuming code doesn't change.
  • October 31, 2025 ⚡️Does saving 1 millisecond really matter? Answer: more than you’d think.

Made with Eleventy and a dash of #Bengineering energy.