We write about Kubernetes operations, cloud-native infrastructure, GitOps workflows, and the real-world lessons we pick up while building and running platforms for our clients. No fluff just the things we wish someone had told us earlier.
The scaling challenges that surfaced after the RayService migration: log volume from spiky autoscaling, storage exhaustion in Loki and Longhorn, runaway Redis GCS growth, and latency in a cost-optimized multi-cloud setup.
Read more →
Why we finally migrated from the RayCluster CRD to RayService — and how serveConfigV2, an external GCS on a highly available Dragonfly store, and custom Docker images got us close to zero-downtime AI workloads.
The limitations and pitfalls of running Ray on Kubernetes with the RayCluster CRD: secret management headaches, GPU nodes that couldn't see their GPUs, broken downscaling, and the trade-offs of upgrades and advanced deployments.
How we lifted and shifted Mixedbread's Ray-based AI workloads onto multi-cloud Kubernetes with the KubeRay Operator — and the head-node fragility we had to engineer around with ArgoCD, Argo Events, and Argo Workflows.
Setting up AWS SSO with Google Workspace turned into a deep troubleshooting rabbit hole — until an aging, distant IAM Identity Center turned out to be the surprising culprit.
We measured the network traffic of multi-cloud Kubernetes clusters with Wireshark and did the math on egress fees — here's what cross-cloud worker nodes actually cost you.
Managed Kubernetes usually ties you to a single cloud vendor. Here's the architecture we use at Berops to run truly cloud-agnostic, multi-cloud clusters — without long-term lock-in.
Building a Kubernetes learning platform meant solving multi-tenancy: isolation, security, and resource fairness. Here's why I tried vCluster, abandoned it despite its brilliance, and landed on Kubevirt virtualization.
When your container is a scratch image with no shell and no tools, kubectl exec gets you nowhere. Here's how to debug a running Pod from the node itself — via cgroups, namespaces, strace, and nsenter.
An interview question — "you join a DevOps team with no security; what's your first step?" — sparked this take on why DevOps is already security-conscious, where it goes wrong, and how to fix a broken security culture.
We benchmarked Linkerd mTLS, WireGuard VPN, and Cilium IPsec on a hybrid Kubernetes cluster to see which encryption method costs you the most in latency, throughput, and CPU.