Pre-populating caches is a “bolt-on” cache-optimization I've used successfully in many systems.
It works, but it adds complexity, which is why most teams avoid it.
📖 Context
For context, in this post, I’m talking about scenarios where one system requires data from another system, I.E., the source of record (SOR). The data is needed frequently, and the decision to cache has already been made.
A good traditional approach is the cache-aside pattern, which maintains a local cache of data.
That cache is populated organically by checking for records as needed, finding that the data is not cached, fetching it from the SOR, and storing the result.
A pro of this approach is that the cache is transient. If it's dropped, it's ok because you can always go back to the SOR, albeit with a performance penalty.
But slow is better than broken.
🤔 Why?
Calls to the SOR are problematic for low-latency or random-access workloads.
When 9 out of 10 requests all want the same data, you’ll have infrequent cache misses. But when 9 out of 10 requests all require different data, you’ll have more cache misses, which reduces the effectiveness of caching.
Pre-populating caches is a way to avoid those cache misses by trading off latency for complexity.
⚙️ How?
Caveat: I use pre-population purely as a bolt-on optimization, not a core dependency.
Typically, I keep the cache-aside path as the primary mechanism.
If anything goes wrong (and it will), there is always the option to go to the SOR for data (slow > broken).
A key decision is whether to pull the data or listen for it.
I prefer the SOR publishes updates as they occur, but platform constraints or circumstances may require you to pull the data.
Pub/sub works great when the SOR publishes, but other options exist as well (webhooks, files) with their own trade-offs.
Use whatever makes sense for your environment.
⚠️ Why Not?
Implementing pre-populating a cache can be easier said than done, as a lot can go wrong.
What happens if you lose a message or two?
What happens when you’re rebuilding the cache (errors or new instances)? How do you repopulate?
The cache-aside will cover any dropped messages, but implementing republish mechanisms is complicated.
You can’t rely solely on deltas; at some point, you'll need to republish the entire dataset.
Building all of these systems is complicated; there's more to monitor, patch, and manage.
If the latency hit and traffic volume to the SOR are not a concern, then that complexity is not worth it.
🧠 Final Thoughts
Pre-populating caches can be a significant performance win, but it can also be an operational overhead.
If your data is primarily static (changing infrequently), the overhead can be worthwhile.
If your data changes frequently, stick with cache-aside (and aggressive TTLs), or no cache at all.