Kubernetes War Stories: Lessons from Production
Real incidents, real fixes. What running 50+ microservices on EKS taught me about container orchestration.
Running Kubernetes in production is like commanding a fleet — everything looks orderly until the first real storm hits.
Over the past two years managing our payroll engine on AWS EKS, I've accumulated a collection of war stories that I wish someone had told me earlier.
The OOMKilled Cascade
It started on a Monday morning. One pod got OOMKilled. Then another. Then the entire namespace started thrashing.
The root cause? A memory leak in our PDF generation service that only manifested with documents over 200 pages. Our resource limits were set correctly, but we hadn't accounted for the burst pattern.
Lesson: Set resource requests conservatively but limits generously. Monitor the delta between the two. When they start converging, you have a problem brewing.
The DNS Resolution Bottleneck
Our services were experiencing random 5-second timeouts. Not consistently — just enough to make debugging maddening.
Turns out, CoreDNS was the bottleneck. With 50+ services all resolving each other's names, the default CoreDNS deployment was overwhelmed.
Fix:
ndots: 2 to our pod DNS config to reduce unnecessary search domain lookupsLatency dropped by 40% overnight.
Rolling Updates Gone Wrong
We had a deployment that passed all CI checks but caused a cascading failure in production. The new version changed a serialization format that was backward-incompatible.
What we implemented after:
Kubernetes gives you the tools. But the strategy is on you.