An effort to create a fully functional Kubernetes cluster with 1 million active nodes. The article details the challenges and solutions for scaling Kubernetes to this size, covering networking, state management (etcd), and the scheduler.
vLLM Production Stack provides a reference implementation on how to build an inference stack on top of vLLM, allowing for scalable, monitored, and performant LLM deployments using Kubernetes and Helm.