5 Lessons From Running Production Kubernetes for 3 Years

Three years ago, we migrated our first workload to Kubernetes. Today, we run 200+ microservices across multiple clusters. Here are the five lessons that would have saved us countless sleepless nights.

1. Set Resource Limits on Everything

The single most impactful thing you can do is set memory and CPU limits on every container. Without them, one runaway process can take down an entire node through OOM kills, cascading failures across unrelated services. We learned this during a holiday weekend incident that took down our payment processing for 47 minutes.

2. Pod Disruption Budgets Are Non-Negotiable

Node upgrades and spot instance reclaims will happen. Without PDBs, Kubernetes might evict all your pods simultaneously. Set minAvailable to at least 1 for every production deployment. Your on-call engineers will thank you.

3. Network Policies Are Your Best Friend

By default, every pod can talk to every other pod. In a 200-service mesh, that’s a security nightmare. We implemented deny-all default policies and explicit allow rules. It took two weeks to roll out and caught three misconfigured services that were accidentally calling production databases from staging.

4. Don’t Trust the HPA Defaults

The Horizontal Pod Autoscaler’s default settings are too aggressive for most workloads. A 15-second cooldown means your cluster will thrash during normal traffic spikes. We settled on 3-minute scale-up and 10-minute scale-down windows. Test your HPA with realistic load patterns, not synthetic benchmarks.

5. Observability Before Optimization

Before optimizing anything, make sure you can see what’s happening. Prometheus metrics, structured logging, and distributed tracing form the three pillars. We invested six weeks in observability infrastructure before touching a single service, and it paid for itself within the first month when we identified a memory leak that had been silently degrading performance for months.

The best infrastructure is the kind you don’t have to think about — until it fails. Then you need to understand it deeply, quickly.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

mylife

A personal blog about life, code, and everything in between

mylife © 2026 Yasiga

Designed with WordPress