Back to blog
Use CaseDecember 28, 2024• 10 min read

Real-Time AI Inference

Some applications can't wait 2 seconds for a response. Gaming, live video processing, trading systems—they need AI that responds in milliseconds. Here's how to build them.

<50ms
p99 latency
Global
edge deployment
10K+
requests/sec

The Latency Budget

Real-time means different things to different applications:

  • <16ms: Interactive games (60fps frame time)
  • <100ms: Live video, voice assistants, autocomplete
  • <200ms: Chat, recommendations, content moderation
  • <500ms: Search, analytics dashboards

Your latency budget gets split across: network RTT, queue wait, model inference, and response serialization. On a distributed platform, minimizing each component matters.

Example: Real-Time Object Detection

Let's build a live video analysis system that detects objects in real-time. The goal: process 30 frames per second with detection results returned before the next frame arrives.

The Model: YOLOv8

YOLOv8 is perfect for real-time detection. On an RTX 4090, it processes a 640x640 image in ~5ms at FP16 precision.

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

RUN pip install ultralytics fastapi uvicorn python-multipart

COPY server.py /app/server.py
WORKDIR /app

# Pre-download model
RUN python -c "from ultralytics import YOLO; YOLO('yolov8n.pt')"

EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]
# server.py
import torch
from fastapi import FastAPI, File, UploadFile
from ultralytics import YOLO
import numpy as np
import cv2
from io import BytesIO

app = FastAPI()

# Load model at startup
model = YOLO('yolov8n.pt')
model.to('cuda')

# Warmup
dummy = np.zeros((640, 640, 3), dtype=np.uint8)
model(dummy)

@app.post("/detect")
async def detect(file: UploadFile = File(...)):
    # Read image
    contents = await file.read()
    nparr = np.frombuffer(contents, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    
    # Run inference
    results = model(img, verbose=False)[0]
    
    # Extract detections
    detections = []
    for box in results.boxes:
        detections.append({
            "class": results.names[int(box.cls)],
            "confidence": float(box.conf),
            "bbox": box.xyxy[0].tolist()
        })
    
    return {"detections": detections}

@app.get("/health")
async def health():
    return {"status": "healthy"}

Deployment

cluster = client.clusters.create(
    name="yolo-realtime",
    image="your-registry/yolov8-server:latest",
    gpu_type="rtx-4090",
    replicas=4,  # More replicas = lower queue wait
    port=8080,
    health_check_path="/health"
)

Optimization Strategies

1. Model Optimization

  • Use TensorRT: NVIDIA's inference optimizer can 2-3x your throughput
  • Quantize to INT8: ~2x faster with minimal accuracy loss for many models
  • Use smaller model variants: YOLOv8n (nano) vs YOLOv8x (extra-large)
  • Batch requests: Process multiple frames together when possible
# Export to TensorRT
from ultralytics import YOLO

model = YOLO('yolov8n.pt')
model.export(format='engine', half=True, device=0)

# Use the TensorRT model
model = YOLO('yolov8n.engine')

2. Reduce Network Latency

  • Deploy close to users: VectorLay has nodes in multiple regions
  • Use efficient serialization: Protobuf or MessagePack instead of JSON
  • Compress payloads: Especially for image/video data
  • Keep connections alive: HTTP/2 or WebSockets eliminate connection overhead

3. Eliminate Queue Wait

The biggest latency killer is waiting in a queue behind other requests. Solutions:

  • Overprovision: Run more replicas than your steady-state needs
  • Use smaller batches: Lower throughput but faster individual requests
  • Implement request priorities: Skip the queue for time-sensitive requests

WebSocket Streaming

For continuous real-time applications (like live video), use WebSockets to eliminate per-request overhead:

# server.py with WebSocket support
from fastapi import FastAPI, WebSocket
import asyncio

@app.websocket("/stream")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    
    while True:
        # Receive frame
        data = await websocket.receive_bytes()
        
        # Decode and process
        nparr = np.frombuffer(data, np.uint8)
        img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
        results = model(img, verbose=False)[0]
        
        # Send results
        detections = [...]  # Extract as before
        await websocket.send_json({"detections": detections})
# Client-side streaming
import asyncio
import websockets
import cv2

async def stream_video():
    uri = "wss://your-cluster.vectorlay.dev/stream"
    async with websockets.connect(uri) as websocket:
        cap = cv2.VideoCapture(0)  # Webcam
        
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            
            # Send frame
            _, buffer = cv2.imencode('.jpg', frame)
            await websocket.send(buffer.tobytes())
            
            # Receive detections
            response = await websocket.recv()
            detections = json.loads(response)
            
            # Draw bounding boxes and display
            for det in detections["detections"]:
                # ... draw boxes
            cv2.imshow('Live Detection', frame)

asyncio.run(stream_video())

Latency Breakdown

Here's a typical latency breakdown for a real-time inference request:

ComponentUnoptimizedOptimized
Network RTT50-200ms10-30ms
Request parsing5-10ms1-2ms
Queue wait0-500ms0-10ms
Model inference20-50ms5-15ms
Response serialization5-10ms1-2ms
Total80-770ms17-59ms

Use Cases

  • Gaming: Real-time object detection for player assistance, anti-cheat, or NPC behavior
  • Live video: Content moderation, face blur, automatic captions
  • Robotics: Vision processing for autonomous navigation
  • Financial: Real-time sentiment analysis, fraud detection
  • IoT: Edge inference for security cameras, industrial sensors

Monitoring Latency

Track these metrics to ensure you're hitting your latency targets:

  • p50, p95, p99 latency: Median and tail latencies
  • Queue depth: How many requests are waiting
  • Inference time: Pure model execution time
  • GPU utilization: Are you leaving performance on the table?

VectorLay's dashboard shows all these metrics in real-time, and you can set up alerts when latency exceeds your thresholds.

Next Steps

Build real-time AI today

Start with $10 in free credits. No credit card required.

Get Started