Back to blog
Use CaseDecember 28, 2024• 12 min read

LLM Inference at Scale

Run Llama 3, Mistral, Mixtral, and other open-source models on distributed GPUs. This guide covers deployment, optimization, and production patterns.

85
tokens/sec
42ms
time to first token
150+
concurrent users
$0.10
per 1M tokens

Why Self-Host LLMs?

OpenAI and Anthropic are great, but there are compelling reasons to self-host:

  • Cost: At scale, self-hosting is 10-50x cheaper per token
  • Privacy: Your data never leaves your infrastructure
  • Customization: Fine-tune models for your specific use case
  • Control: No rate limits, no API changes, no vendor lock-in
  • Latency: Deploy closer to your users

Choosing the Right Model

The LLM landscape moves fast. Here are our current recommendations:

ModelVRAMBest ForGPU Rec
Llama 3.1 8B~16GBGeneral purpose, fastRTX 4090
Llama 3.1 70B~140GBComplex reasoning2-4x A100
Mistral 7B~14GBSpeed + quality balanceRTX 4090
Mixtral 8x7B~90GBMoE efficiency2x A100
Qwen 2.5 72B~145GBMultilingual, coding2-4x A100
DeepSeek Coder 33B~66GBCode generation2x RTX 4090

Deployment with vLLM

vLLM is the gold standard for LLM serving. It provides PagedAttention for efficient memory management, continuous batching for high throughput, and tensor parallelism for large models.

Dockerfile

FROM vllm/vllm-openai:latest

# Pre-download model weights
RUN python -c "from huggingface_hub import snapshot_download; \
    snapshot_download('meta-llama/Meta-Llama-3.1-8B-Instruct')"

ENV MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct"
ENV MAX_MODEL_LEN=8192
ENV GPU_MEMORY_UTILIZATION=0.95

EXPOSE 8080

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "${MODEL_NAME}", \
     "--port", "8080", \
     "--max-model-len", "${MAX_MODEL_LEN}", \
     "--gpu-memory-utilization", "${GPU_MEMORY_UTILIZATION}"]

Deploying on VectorLay

import vectorlay

client = vectorlay.Client(api_key="YOUR_API_KEY")

cluster = client.clusters.create(
    name="llama-3-8b",
    image="your-registry/vllm-llama:latest",
    gpu_type="rtx-4090",
    replicas=3,
    port=8080,
    health_check_path="/health",
    env={
        "HUGGING_FACE_HUB_TOKEN": "hf_xxx",  # For gated models
    }
)

cluster.wait_until_ready()

Making Requests

vLLM exposes an OpenAI-compatible API, so you can use any OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://your-cluster.vectorlay.dev/v1",
    api_key="YOUR_VECTORLAY_API_KEY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem about distributed systems"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Performance Benchmarks

We benchmarked Llama 3.1 8B on VectorLay infrastructure:

MetricRTX 4090RTX 3090A100 40GB
Tokens/second8552110
Time to first token42ms68ms35ms
Max concurrent users150+80200+
Cost per 1M tokens$0.10$0.12$0.25

Benchmarks: Llama 3.1 8B Instruct, vLLM 0.5.x, context length 4096, 50 concurrent requests

Cost Comparison

Let's compare the cost of processing 10 million tokens (roughly 7.5 million words):

ProviderModelCost (10M tokens)
OpenAIGPT-4o$50.00
OpenAIGPT-4o-mini$1.50
AnthropicClaude 3 Sonnet$30.00
Together AILlama 3.1 8B$1.80
VectorLayLlama 3.1 8B$1.00

Production Patterns

High Availability

For production workloads, run multiple replicas across different nodes:

cluster = client.clusters.create(
    name="llama-prod",
    image="your-registry/vllm-llama:latest",
    gpu_type="rtx-4090",
    replicas=5,  # 5 replicas for redundancy
    # VectorLay automatically spreads across different physical nodes
)

Graceful Degradation

If a node fails, VectorLay automatically routes traffic to healthy replicas and schedules a replacement. Your application sees increased latency for a few seconds, but no errors.

Request Retries

import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
async def generate_completion(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://your-cluster.vectorlay.dev/v1/chat/completions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                "messages": [{"role": "user", "content": prompt}],
            },
            timeout=60.0
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]

Running Larger Models

For models that don't fit on a single GPU (like Llama 70B or Mixtral), use tensor parallelism. VectorLay supports multi-GPU deployments:

# Dockerfile for Llama 70B on 2x A100
FROM vllm/vllm-openai:latest

ENV MODEL_NAME="meta-llama/Meta-Llama-3.1-70B-Instruct"
ENV TENSOR_PARALLEL_SIZE=2

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "${MODEL_NAME}", \
     "--tensor-parallel-size", "${TENSOR_PARALLEL_SIZE}", \
     "--port", "8080"]
cluster = client.clusters.create(
    name="llama-70b",
    image="your-registry/vllm-llama-70b:latest",
    gpu_type="a100-80gb",
    gpus_per_replica=2,  # 2 GPUs per replica for tensor parallelism
    replicas=2
)

Real-World Use Cases

  • Customer support: AI agents that handle 90%+ of tickets
  • Code assistants: IDE integrations for code completion and review
  • Content generation: Blog posts, product descriptions, marketing copy
  • Data extraction: Parse unstructured documents into structured data
  • Search & RAG: Combine with vector databases for retrieval-augmented generation

Next Steps

Start building with LLMs

Get $10 in free credits—enough to process ~100 million tokens.

Start Building