A common issue I see when teams first adopt gRPC is managing persistent connections, especially during failovers.
🤔 The Problem:
gRPC is fast thanks to protobuf and how it handles connections, mainly:
- Persistent connections that avoid repeated TCP handshakes
- Sending multiple requests over a single HTTP/2 connection.
However, these performance optimizations are also a source of failover challenges.
😫 Challenges with Failover:
Let’s say you’ve just implemented gRPC and want to trigger a manual failover for your service.
For many, failover typically happens at the load-balancer level, which works fine for HTTP/1.
When you take an instance down, new requests go to another instance.
However, with gRPC over HTTP/2, connections stay open and are reused, which means existing connections continue to send requests to the old instances even during failover.
Unless your load-balancer understands HTTP/2 and gRPC, failover will not work as it used to.
🛠️ Failover with gRPC
For proper failover, you’ve got two main options:
- Use a load balancer that understands HTTP/2 and gRPC, such as an AWS Application Load Balancer vs. a Network Load Balancer, Envoy vs. HAProxy.
- Cycle connections periodically—force clients to reconnect and redistribute the load.
Both options get the job done, but the first is overall cleaner.
💡 Final Thoughts:
There is a lot to love about gRPC: strong contracts, outstanding performance, and client-server simplicity.
But it takes work to operationalize it. Nobody tells you that upfront, though.