GPU Passthrough with Kata Containers
Running ML workloads securely on shared infrastructure requires proper isolation. Docker containers alone aren't enough—a malicious container could potentially escape to the host. Here's how we solved this with Kata Containers and VFIO GPU passthrough.
The Problem: Container Isolation Isn't Enough
Standard Docker containers share the host kernel. While Linux namespaces and cgroups provide reasonable isolation for most workloads, they're not suitable when running untrusted code on expensive hardware.
In Vectorlay's model, anyone can submit workloads to run on provider GPUs. We need:
- →Hardware-level isolation: Workloads can't escape to the host or access other workloads
- →Direct GPU access: Near-native performance, no virtualization overhead on the GPU itself
- →Container ergonomics: Users still push OCI images, same workflow they know
Kata Containers: VMs with Container UX
Kata Containers run each container inside a lightweight VM. You get the security of virtualization with the developer experience of containers.
┌─────────────────────────────────────────────────────────┐ │ Host System │ │ │ │ ┌───────────────────────┐ ┌───────────────────────┐ │ │ │ Kata MicroVM │ │ Kata MicroVM │ │ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ │ │ │ Container A │ │ │ │ Container B │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ Your workload │ │ │ │ Other workload │ │ │ │ │ └─────────────────┘ │ │ └─────────────────┘ │ │ │ │ │ │ │ │ │ │ Guest Kernel │ │ Guest Kernel │ │ │ └───────────────────────┘ └───────────────────────┘ │ │ │ │ │ │ QEMU/KVM QEMU/KVM │ │ │ │ │ │ ────────────┴──────────────────────────┴─────────── │ │ Host Kernel │ └─────────────────────────────────────────────────────────┘
Key characteristics:
- →Separate kernel: Each container runs its own guest Linux kernel
- →Fast boot: Optimized guest images boot in ~100ms
- →OCI compatible: Works with any container image, integrates with containerd
- →Hardware passthrough: PCIe devices can be passed directly to the VM
VFIO: Direct GPU Access
VFIO (Virtual Function I/O) is a Linux kernel framework that allows userspace programs to directly access PCIe devices. We use it to pass GPUs directly to Kata VMs.
┌────────────────────────────────────────────────────┐ │ Host System │ │ ┌──────────────────────────────────────────────┐ │ │ │ Kata MicroVM │ │ │ │ ┌────────────────────────────────────────┐ │ │ │ │ │ Your Container │ │ │ │ │ │ │ │ │ │ │ │ nvidia-smi ────▶ RTX 4090 │ │ │ │ │ │ (direct access via VFIO) │ │ │ │ │ └────────────────────────────────────────┘ │ │ │ │ │ │ │ │ Guest NVIDIA Driver (545, 550, etc.) │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ │ VFIO passthrough │ │ (IOMMU translation) │ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Physical GPU │ │ │ │ NVIDIA RTX 4090 │ │ │ └──────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────┘
Why Not nvidia-docker?
The traditional approach (nvidia-docker) shares the host's GPU drivers with containers. This has problems for our use case:
❌ Shared kernel attack surface
Container escapes can access host NVIDIA drivers and potentially other GPUs.
❌ Driver version lock-in
All containers must use the host's driver version. Can't run different CUDA versions on the same machine.
❌ Resource interference
Containers can see and potentially interfere with each other's GPU processes.
With VFIO passthrough:
✓ Complete isolation
Each VM has exclusive access to its assigned GPU. No shared drivers, no shared memory.
✓ Per-workload drivers
Each VM runs its own NVIDIA driver version. CUDA 11 and CUDA 12 can run on the same host.
✓ Native performance
The GPU isn't virtualized—it's passed directly to the VM. Same performance as bare metal.
Custom Guest Images
Kata Containers need a guest kernel and rootfs image. We build custom images with everything needed for GPU workloads:
# Guest image contents ├── Ubuntu 22.04 (minimal) ├── NVIDIA Driver (545, 550, etc.) ├── containerd ├── nvidia-container-toolkit └── virtiofsd (for filesystem passthrough)
Driver Version Matrix
Different GPU generations need different driver versions. We maintain images for multiple driver versions:
| Driver | CUDA | Supported GPUs |
|---|---|---|
| 535 | 12.2 | RTX 30xx, 40xx |
| 545 | 12.3 | RTX 30xx, 40xx |
| 550 | 12.4 | RTX 30xx, 40xx, 50xx |
Build Process
We use libguestfs to build custom guest images:
#!/bin/bash # build-guest.sh # Start with Ubuntu cloud image wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img # Resize for our needs qemu-img resize jammy-server-cloudimg-amd64.img 15G # Customize with virt-customize virt-customize -a jammy-server-cloudimg-amd64.img \ --install containerd,nvidia-driver-$DRIVER_VERSION \ --run-command "nvidia-ctk runtime configure --runtime=containerd" \ --selinux-relabel # Output: kata-gpu-guest-$DRIVER_VERSION.raw
Host Configuration
Key insight: the host doesn't need NVIDIA drivers. GPUs are bound to VFIO at boot, and the guest VM runs its own drivers.
IOMMU Setup
VFIO requires IOMMU (Intel VT-d or AMD-Vi) for memory translation:
# /etc/default/grub GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt" # Load VFIO modules before NVIDIA # /etc/modules-load.d/vfio.conf vfio vfio_iommu_type1 vfio_pci # Bind GPU to VFIO at boot # /etc/modprobe.d/vfio.conf options vfio-pci ids=10de:2684,10de:22ba
Kata Configuration
# /etc/kata-containers/configuration.toml [hypervisor.qemu] path = "/usr/bin/qemu-system-x86_64" kernel = "/usr/share/kata-containers/vmlinux.container" image = "/opt/vectorlay/kata-gpu-guest-545.raw" # Enable VFIO for GPU passthrough hotplug_vfio_on_root_bus = true pcie_root_port = 2 # Memory and CPU defaults default_memory = 8192 default_vcpus = 4
Performance Characteristics
With VFIO passthrough, GPU operations run at native speed. The overhead is only in:
- →VM boot: ~100-200ms for the microVM to start
- →CPU operations: ~2-5% overhead for virtualized CPU
- →Memory access: Negligible with EPT/NPT
- →GPU compute: Zero overhead—direct hardware access
For inference workloads where GPU time dominates, total overhead is typically under 1%.
Continue Reading
The Big Picture
Overview of the distributed GPU overlay network
The Control Plane
WebSocket server, node registration, and job queues
The Agent
Node software, heartbeats, and dependency management
GPU Passthrough with Kata(You are here)
VFIO, microVMs, and hardware isolation
Fault Tolerance
Health checks, failover, and self-healing
Fault Tolerance
How Vectorlay detects failures, routes around unhealthy nodes, and keeps your inference workloads running.
Continue reading