I can't count how often I've seen issues made worse by minor oversights—like not setting a timeout value. ⏱️
Timeouts, max lifetimes, retries, idle connection limits, pool sizes—whether you are making HTTP calls or talking to a Database, there are settings that determine the behavior of clients and servers.
These settings are not usually the cause of issues, but they can amplify them when left untouched.
A real example I've seen:
Service A calls Service B. If the request fails, Service A has logic to handle the error gracefully and respond to users.
The problem?
The HTTP request timeout value was left default. When Service B was experiencing issues, requests failed, but not for 5 minutes (the default)
By then, users have already timed out and retried, amplifying the problem by overwhelming Service A with a backlog of stalled HTTP requests to Service B.
A lack of timeout did not create the issue but worsened it, even preventing Service A from processing requests that didn't need to call Service B.
The fix? 🛠
Awareness of and tuning these settings is important, but let's be honest—they are easy to overlook—I've done it a million times.
Introducing a culture of testing failures is more effective:
🧪 Create failure scenarios with unit tests
⛓️💥 Embrace chaos testing with fault injection in functional tests
People forget things; tests catch what we forget.