Benjamin Cane

#Bengineering 🧐

Practical notes from Benjamin Cane on distributed systems, reliability, architecture, and engineering leadership. Read the latest post, browse the archive, or subscribe by RSS.

41 posts August 8, 2025 → May 21, 2026
Portrait of Benjamin Cane

Follow the feed

#Bengineering 🧐

Get new posts in your reader, or connect on LinkedIn for the short version.

Subscribe via RSS Connect on LinkedIn

Latest Post

Portrait of Benjamin Cane
Benjamin Cane
May 21, 2026

One of the easiest ways to break a gRPC service in production is health-checking the wrong listener.

A common issue I see teams run into when adopting gRPC is leaving readiness checks pointed at their HTTP listener while production traffic actually flows through gRPC.

Everything looks fine until it suddenly doesn’t.

🤔 The Problem

Many gRPC services run two listeners: one for HTTP and one for gRPC.

The HTTP listener often exists for metrics, liveness checks, and management APIs. Teams moving to gRPC often reuse the HTTP health checks they set up for their REST-based services.

It’s generally a good idea to reuse what you already have, but in this case, it can be misleading.

⚠️ Health-Check What Serves Traffic

If customers connect through gRPC, your first readiness check should too.

Your HTTP listener can be perfectly healthy while the gRPC listener is misconfigured, hung, or otherwise failing.

Meanwhile, Kubernetes, load balancers, and dashboards might all show green. ✅

This happens more often than people think.

🩺 Better Ways to Monitor gRPC

There are better ways to monitor your gRPC service.

gRPC Health Probe ✅

Use a real gRPC health check request against the listener.

This validates the actual serving path and confirms the service can respond over gRPC.

A strong default option.

Build a Status gRPC Service 📋

Expose an internal status method in your gRPC API.

This gives you flexibility to check deeper dependencies, such as database readiness, downstream systems, internal state, and maintenance toggles.

It’s more work, but more control.

Use a Single Shared Listener ☝️

Because gRPC runs on top of HTTP/2, many languages and frameworks can serve HTTP and gRPC traffic on the same listener.

That means an HTTP health endpoint may be acceptable because it checks the same network path. It still does not fully validate gRPC behavior, but it is better than checking an entirely separate listener.

🧠 Final Thoughts

gRPC is awesome.

But making a service production-ready means revisiting configurations inherited from REST services.

  • Health checks
  • Load balancing behavior
  • Connection management
  • Contracts
  • Operational tooling

None of these changes are difficult. They’re just easy to miss.

Keep Reading

  • May 14, 2026 Weighted load balancing has saved me more times than I can count
  • May 7, 2026 YOLO Is a Terrible Strategy for Validating Production Changes
  • April 30, 2026 Deterministic routing is one of the most effective ways distributed systems reduce consistency problems at scale

All Posts

  • May 21, 2026 Health-check the listener your gRPC traffic actually uses
  • May 14, 2026 Weighted load balancing has saved me more times than I can count
  • May 7, 2026 YOLO Is a Terrible Strategy for Validating Production Changes
  • April 30, 2026 Deterministic routing is one of the most effective ways distributed systems reduce consistency problems at scale
  • April 23, 2026 When you think of microservices, you probably think of centralized shared services. But there's another valid pattern that is rarely discussed
  • April 16, 2026 Are you using traffic mirroring in production? If not, try it out.
  • April 9, 2026 Agent Skills Are Becoming the Best Way to Capture Institutional Knowledge
  • April 2, 2026 Saved Prompts Are Dead. Agent Skills Are the Future.
  • March 26, 2026 Generating Code Faster Is Only Valuable If You Can Validate Every Change With Confidence
  • March 19, 2026 When You Go to Production with gRPC, Make Sure You’ve Solved Load Distribution First
  • March 12, 2026 You may be building for availability, but are you building for resiliency?
  • March 5, 2026 When your coding agent doesn’t understand your project, you’ll get junk
  • February 26, 2026 You can have 100% Code Coverage and still have ticking time bombs in your code. 💣
  • February 19, 2026 Getting More Out of Agentic Coding Tools
  • February 12, 2026 Why is Infrastructure-as-Code so important? Hint: It's correctness
  • February 5, 2026 Optimizing the team’s workflow can be more impactful than building business features
  • January 29, 2026 I follow an architecture principle I call The Law of Collective Amnesia
  • January 22, 2026 Performance testing without a target is like running a race with no finish line
  • January 15, 2026 Many teams think performance testing means throwing traffic at a system until it breaks. That approach is fine, but it misses how systems are actually stressed in the real world.
  • January 8, 2026 Pre-populating caches is a “bolt-on” cache-optimization I've used successfully in many systems. It works, but it adds complexity
  • January 1, 2026 Don't be afraid to build a tool. Just don't become too attached to it.
  • December 26, 2025 One of the toughest engineering skills to develop is accepting a decision you disagree with. 😖
  • December 19, 2025 Canary deployments are an operational superpower, but the complexity they bring isn’t for everyone.
  • December 12, 2025 Everyone has bias, yes, even you. 🫵
  • December 5, 2025 Do you use Architecture Decision Records? I’m a big fan, and I think they’re a best practice every engineering org should adopt.
  • November 28, 2025 Does resource usage within your application or database suddenly spike periodically? Does it cause system slowdown?
  • November 21, 2025 When you shut down an application instance, don't stop the listener immediately — that's how you end up with failed requests during every application rollout. 😢
  • November 14, 2025 A common issue I see when teams first adopt gRPC is managing persistent connections, especially during failovers.
  • November 7, 2025 A dangerous mindset I’ve seen—and been guilty of—is assuming code doesn't change.
  • October 31, 2025 ⚡️Does saving 1 millisecond really matter? Answer: more than you’d think.
  • October 27, 2025 Have you heard of Store and Forward? It’s a resiliency design prevalent in card & bank payments, telecommunications, and other industries.
  • October 24, 2025 When Building Low-Latency, High-Scale Systems, Push as Much Processing as Possible to Later
  • October 10, 2025 Coding is a small part of software engineering.
  • October 3, 2025 Should I be an individual contributor or a people leader?
  • September 26, 2025 Improve performance and reduce chances of request failures with this one simple trick! Avoid cross-region calls.
  • September 19, 2025 Did you know Kube-proxy doesn’t perform load-balancing itself? It’s iptables (by default).
  • September 12, 2025 You’ve heard of feature flags, but what about operational flags? ⏯️
  • September 5, 2025 A core capability for building low-latency platforms is quickly detecting and reacting to issues.
  • August 22, 2025 Sometimes when I tell people that logging can impact a microservices response time, I get strange looks. 🤨
  • August 15, 2025 How many times have you seen analytics on an operational database create issues? I’ve seen it far too often.
  • August 8, 2025 I can't count how often I've seen issues made worse by minor oversights—like not setting a timeout value. ⏱️

Practical engineering notes by Benjamin Cane.