LLM Inference at Scale with VectorLay | VectorLay Blog

Why Self-Host LLMs?

OpenAI and Anthropic are great, but there are compelling reasons to self-host:

Cost: At scale, self-hosting is 10-50x cheaper per token
Privacy: Your data never leaves your infrastructure
Customization: Fine-tune models for your specific use case
Control: No rate limits, no API changes, no vendor lock-in
Latency: Deploy closer to your users

Choosing the Right Model

The LLM landscape moves fast. Here are our current recommendations:

Model	VRAM	Best For	GPU Rec
Llama 3.1 8B	~16GB	General purpose, fast	RTX 4090
Llama 3.1 70B	~140GB	Complex reasoning	2-4x A100
Mistral 7B	~14GB	Speed + quality balance	RTX 4090
Mixtral 8x7B	~90GB	MoE efficiency	2x A100
Qwen 2.5 72B	~145GB	Multilingual, coding	2-4x A100
DeepSeek Coder 33B	~66GB	Code generation	2x RTX 4090

Deployment with vLLM

vLLM is the gold standard for LLM serving. It provides PagedAttention for efficient memory management, continuous batching for high throughput, and tensor parallelism for large models.

Dockerfile

FROM vllm/vllm-openai:latest

# Pre-download model weights
RUN python -c "from huggingface_hub import snapshot_download; \
    snapshot_download('meta-llama/Meta-Llama-3.1-8B-Instruct')"

ENV MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct"
ENV MAX_MODEL_LEN=8192
ENV GPU_MEMORY_UTILIZATION=0.95

EXPOSE 8080

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "${MODEL_NAME}", \
     "--port", "8080", \
     "--max-model-len", "${MAX_MODEL_LEN}", \
     "--gpu-memory-utilization", "${GPU_MEMORY_UTILIZATION}"]

Deploying on VectorLay

import vectorlay

client = vectorlay.Client(api_key="YOUR_API_KEY")

cluster = client.clusters.create(
    name="llama-3-8b",
    image="your-registry/vllm-llama:latest",
    gpu_type="rtx-4090",
    replicas=3,
    port=8080,
    health_check_path="/health",
    env={
        "HUGGING_FACE_HUB_TOKEN": "hf_xxx",  # For gated models
    }
)

cluster.wait_until_ready()

Making Requests

vLLM exposes an OpenAI-compatible API, so you can use any OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-cluster.vectorlay.dev/v1",
    api_key="YOUR_VECTORLAY_API_KEY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about distributed systems"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Performance Benchmarks

We benchmarked Llama 3.1 8B on VectorLay infrastructure:

Metric	RTX 4090	RTX 3090	A100 40GB
Tokens/second	85	52	110
Time to first token	42ms	68ms	35ms
Max concurrent users	150+	80	200+
Cost per 1M tokens	$0.10	$0.12	$0.25

Benchmarks: Llama 3.1 8B Instruct, vLLM 0.5.x, context length 4096, 50 concurrent requests

Cost Comparison

Let's compare the cost of processing 10 million tokens (roughly 7.5 million words):

Provider	Model	Cost (10M tokens)
OpenAI	GPT-4o	$50.00
OpenAI	GPT-4o-mini	$1.50
Anthropic	Claude 3 Sonnet	$30.00
Together AI	Llama 3.1 8B	$1.80
VectorLay	Llama 3.1 8B	$1.00

Production Patterns

High Availability

For production workloads, run multiple replicas across different nodes:

cluster = client.clusters.create(
    name="llama-prod",
    image="your-registry/vllm-llama:latest",
    gpu_type="rtx-4090",
    replicas=5,  # 5 replicas for redundancy
    # VectorLay automatically spreads across different physical nodes
)

Graceful Degradation

If a node fails, VectorLay automatically routes traffic to healthy replicas and schedules a replacement. Your application sees increased latency for a few seconds, but no errors.

Request Retries

import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
async def generate_completion(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://your-cluster.vectorlay.dev/v1/chat/completions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [{"role": "user", "content": prompt}],
            },
            timeout=60.0
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]

Running Larger Models

For models that don't fit on a single GPU (like Llama 70B or Mixtral), use tensor parallelism. VectorLay supports multi-GPU deployments:

# Dockerfile for Llama 70B on 2x A100
FROM vllm/vllm-openai:latest

ENV MODEL_NAME="meta-llama/Meta-Llama-3.1-70B-Instruct"
ENV TENSOR_PARALLEL_SIZE=2

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "${MODEL_NAME}", \
     "--tensor-parallel-size", "${TENSOR_PARALLEL_SIZE}", \
     "--port", "8080"]

cluster = client.clusters.create(
    name="llama-70b",
    image="your-registry/vllm-llama-70b:latest",
    gpu_type="a100-80gb",
    gpus_per_replica=2,  # 2 GPUs per replica for tensor parallelism
    replicas=2
)

Real-World Use Cases

Customer support: AI agents that handle 90%+ of tickets
Code assistants: IDE integrations for code completion and review
Content generation: Blog posts, product descriptions, marketing copy
Data extraction: Parse unstructured documents into structured data
Search & RAG: Combine with vector databases for retrieval-augmented generation

LLM Inference at Scale