Benjamin Cane

#Bengineering 🧐

Practical notes from Benjamin Cane on distributed systems, reliability, architecture, and engineering leadership. Read the latest post, browse the archive, or subscribe by RSS.

39 posts August 8, 2025 → May 7, 2026
Portrait of Benjamin Cane

Follow the feed

#Bengineering 🧐

Get new posts in your reader, or connect on LinkedIn for the short version.

Subscribe via RSS Connect on LinkedIn

Latest Post

Portrait of Benjamin Cane
Benjamin Cane
May 7, 2026

YOLO is a terrible strategy for validating production changes.

How many times have you seen it?

Your platform is running smoothly. No alerts, no issues. Then suddenly, something breaks.

After digging in, you discover the cause: another system you depend on made a change, and that change broke your platform.

They didn’t notice it broke. You did, much too late…

How many times have you been the cause of another platform breaking?

🥶 Cold Reality

I wish the above scenario were rare, but it happens constantly across the technology industry.

It happens between internal teams, third-party integrations, and shared infrastructure teams.

These scenarios make you wonder, “How was that change validated?”

Maybe they tested it, and their validation had gaps. Maybe they did little validation at all. If any.

Either way, the result is the same: they validated their change with 100% of production traffic. Bad plan.

💡 Better Ways to Validate Changes

There are many ways teams can reduce production risk when rolling out changes, and the best teams combine the following approaches.

Canary Releases 🐤

I talk about canary deployments often.

Instead of moving 100% of traffic at once, move small percentages gradually and observe behavior closely.

That observed part matters. Look at error rates, latency changes (beyond normal platform warmup), resource spikes, and unexpected retries. All of these indicate customer impact.

Canary deployments are one of the best ways to reduce the blast radius of changes, identify problems quickly, and self-correct.

Shadow Traffic 🪞

Traffic mirroring sends production traffic to a new version before routing live traffic there.

Responses are ignored, but you observe behavior and monitor the same signals you would with a canary release without sacrificing a customer request.

Synthetic Traffic 🤖

Synthetic traffic simulates user behavior continuously. It’s great for monitoring customer experience, but also a great way to validate new deployments.

Route synthetic traffic to upgraded instances first and verify behavior before moving real traffic. If it fails with synthetic traffic, it likely won’t survive real traffic.

Smoke Tests 😶‍🌫️

The classic approach. After deployment, run a small set of fast tests to confirm the platform is fundamentally working.

Smoke tests don’t need to be fancy; they can be shell scripts, API calls, read-only requests, a test file, or full end-to-end validation.

Their purpose is simple: to quickly catch obvious breakage.

🧠 Final Thoughts

Don’t think of the above methods as mutually exclusive choices. Combine them.

Some platforms I work on combine canary releases, shadow traffic, and synthetic traffic. Others use smoke tests plus canary releases.

The more layers of validation you have, the more likely you are to catch issues before your customers do. Because having your customers validate changes for you is a poor strategy.

Keep Reading

  • April 30, 2026 Deterministic routing is one of the most effective ways distributed systems reduce consistency problems at scale
  • April 23, 2026 When you think of microservices, you probably think of centralized shared services. But there's another valid pattern that is rarely discussed
  • April 16, 2026 Are you using traffic mirroring in production? If not, try it out.

All Posts

  • May 7, 2026 YOLO Is a Terrible Strategy for Validating Production Changes
  • April 30, 2026 Deterministic routing is one of the most effective ways distributed systems reduce consistency problems at scale
  • April 23, 2026 When you think of microservices, you probably think of centralized shared services. But there's another valid pattern that is rarely discussed
  • April 16, 2026 Are you using traffic mirroring in production? If not, try it out.
  • April 9, 2026 Agent Skills Are Becoming the Best Way to Capture Institutional Knowledge
  • April 2, 2026 Saved Prompts Are Dead. Agent Skills Are the Future.
  • March 26, 2026 Generating Code Faster Is Only Valuable If You Can Validate Every Change With Confidence
  • March 19, 2026 When You Go to Production with gRPC, Make Sure You’ve Solved Load Distribution First
  • March 12, 2026 You may be building for availability, but are you building for resiliency?
  • March 5, 2026 When your coding agent doesn’t understand your project, you’ll get junk
  • February 26, 2026 You can have 100% Code Coverage and still have ticking time bombs in your code. 💣
  • February 19, 2026 Getting More Out of Agentic Coding Tools
  • February 12, 2026 Why is Infrastructure-as-Code so important? Hint: It's correctness
  • February 5, 2026 Optimizing the team’s workflow can be more impactful than building business features
  • January 29, 2026 I follow an architecture principle I call The Law of Collective Amnesia
  • January 22, 2026 Performance testing without a target is like running a race with no finish line
  • January 15, 2026 Many teams think performance testing means throwing traffic at a system until it breaks. That approach is fine, but it misses how systems are actually stressed in the real world.
  • January 8, 2026 Pre-populating caches is a “bolt-on” cache-optimization I've used successfully in many systems. It works, but it adds complexity
  • January 1, 2026 Don't be afraid to build a tool. Just don't become too attached to it.
  • December 26, 2025 One of the toughest engineering skills to develop is accepting a decision you disagree with. 😖
  • December 19, 2025 Canary deployments are an operational superpower, but the complexity they bring isn’t for everyone.
  • December 12, 2025 Everyone has bias, yes, even you. 🫵
  • December 5, 2025 Do you use Architecture Decision Records? I’m a big fan, and I think they’re a best practice every engineering org should adopt.
  • November 28, 2025 Does resource usage within your application or database suddenly spike periodically? Does it cause system slowdown?
  • November 21, 2025 When you shut down an application instance, don't stop the listener immediately — that's how you end up with failed requests during every application rollout. 😢
  • November 14, 2025 A common issue I see when teams first adopt gRPC is managing persistent connections, especially during failovers.
  • November 7, 2025 A dangerous mindset I’ve seen—and been guilty of—is assuming code doesn't change.
  • October 31, 2025 ⚡️Does saving 1 millisecond really matter? Answer: more than you’d think.
  • October 27, 2025 Have you heard of Store and Forward? It’s a resiliency design prevalent in card & bank payments, telecommunications, and other industries.
  • October 24, 2025 When Building Low-Latency, High-Scale Systems, Push as Much Processing as Possible to Later
  • October 10, 2025 Coding is a small part of software engineering.
  • October 3, 2025 Should I be an individual contributor or a people leader?
  • September 26, 2025 Improve performance and reduce chances of request failures with this one simple trick! Avoid cross-region calls.
  • September 19, 2025 Did you know Kube-proxy doesn’t perform load-balancing itself? It’s iptables (by default).
  • September 12, 2025 You’ve heard of feature flags, but what about operational flags? ⏯️
  • September 5, 2025 A core capability for building low-latency platforms is quickly detecting and reacting to issues.
  • August 22, 2025 Sometimes when I tell people that logging can impact a microservices response time, I get strange looks. 🤨
  • August 15, 2025 How many times have you seen analytics on an operational database create issues? I’ve seen it far too often.
  • August 8, 2025 I can't count how often I've seen issues made worse by minor oversights—like not setting a timeout value. ⏱️

Practical engineering notes by Benjamin Cane.