November 20, 20248 min

Kubernetes War Stories: Lessons from Production

Real incidents, real fixes. What running 50+ microservices on EKS taught me about container orchestration.

KubernetesDevOpsAWS

Running Kubernetes in production is like commanding a fleet — everything looks orderly until the first real storm hits.

Over the past two years managing our payroll engine on AWS EKS, I've accumulated a collection of war stories that I wish someone had told me earlier.

The OOMKilled Cascade

It started on a Monday morning. One pod got OOMKilled. Then another. Then the entire namespace started thrashing.

The root cause? A memory leak in our PDF generation service that only manifested with documents over 200 pages. Our resource limits were set correctly, but we hadn't accounted for the burst pattern.

Lesson: Set resource requests conservatively but limits generously. Monitor the delta between the two. When they start converging, you have a problem brewing.

The DNS Resolution Bottleneck

Our services were experiencing random 5-second timeouts. Not consistently — just enough to make debugging maddening.

Turns out, CoreDNS was the bottleneck. With 50+ services all resolving each other's names, the default CoreDNS deployment was overwhelmed.

Fix:

Scaled CoreDNS horizontally
Enabled NodeLocal DNSCache
Added ndots: 2 to our pod DNS config to reduce unnecessary search domain lookups

Latency dropped by 40% overnight.

Rolling Updates Gone Wrong

We had a deployment that passed all CI checks but caused a cascading failure in production. The new version changed a serialization format that was backward-incompatible.

What we implemented after:

Canary deployments with automatic rollback via ArgoCD
Contract testing between services
A "shadow traffic" stage before full rollout

Kubernetes gives you the tools. But the strategy is on you.

Back to Captain's Log