ahad.

Sovereignty at Scale: Engineering Production-Grade RAG on Bare Metal

AK
Ahad KhanAgentic AI Engineer
October 24, 2024
11 min read
InfrastructureLLMOn-PremvLLMVectorDB

Most "Production RAG" guides assume you’re comfortable sending your company’s intellectual property to a third-party API. For those of us in FinTech, Healthcare, or GovTech, that’s a non-starter. We don’t just have "data privacy concerns"—we have air-gapped requirements and GDPR/HIPAA mandates that make SaaS LLMs a legal liability.

When we moved our RAG stack from an OpenAI-based pipeline to a sovereign, on-prem cluster, we weren't just looking for security. We were looking for latency determinism. We wanted to eliminate the "Internet Tax"—that variable 500ms to 2s delay caused by network hops and SaaS rate-limiting.

Loading diagram...

By the time we finished the migration, we were processing 300M tokens per month with a 40% reduction in end-to-end latency and a 70% reduction in long-term TCO.

The Sovereign Stack: Architectural Overview

Our architecture is designed for a "Dark Site" environment. Every component—from the inference engine to the vector database—runs on our own silicon. We treat AI not as a service, but as core infrastructure.

Loading diagram...

1. The Inference Engine: Why vLLM Wins

We benchmarked TGI (Text Generation Inference), Triton, and vLLM. We ultimately standardized on vLLM for one reason: PagedAttention.

In standard Transformers, KV cache memory is fragmented and wasted—often up to 60-80% of VRAM. vLLM treats VRAM like virtual memory in an OS. This allowed us to increase our throughput by 24x compared to standard HuggingFace implementations.

Loading diagram...

Deployment Pattern

We deploy our models as Dockerized services. Here is the configuration we shipped for Llama 3.1 8B utilizing AWQ quantization to maximize our VRAM budget:

deploy_vllm.sh
1#!/bin/bash
2set -e
3
4# Requirements: NVIDIA Container Toolkit installed on host
5# This script starts a vLLM instance with Llama 3.1 8B AWQ
6docker run --gpus all \
7 -v ~/.cache/huggingface:/root/.cache/huggingface \
8 -p 8000:8000 \
9 --ipc=host \
10 vllm/vllm-openai:latest \
11 --model casperhansen/llama-3-8b-instruct-awq \
12 --quantization awq \
13 --max-model-len 16384 \
14 --gpu-memory-utilization 0.90 \
15 --enforce-eager

Senior Insight: We use --enforce-eager in on-prem setups to avoid the CUDA graph overhead during the initial warm-up. It saves us roughly 3 minutes of "hanging" during cold starts in our CI/CD pipeline.

2. Hardware Sizing & The VRAM Budget

Calculating your memory budget is the difference between a stable system and a constant Out of Memory (OOM) loop. We use a strict calculation to avoid over-provisioning expensive H100 nodes.

Total VRAM = (Model Weights) + (KV Cache) + (Activation Memory) + (System Overhead)

Loading diagram...

VRAM Footprint Table (Llama 3.1 Series)

Model SizePrecisionWeights (GB)Recommended GPUContext Window
8BFP1616 GB1x RTX 4090 (24GB)8k
8BINT4 (AWQ)5.5 GB1x RTX 3060 (12GB)32k
70BFP16140 GB2x A100 (80GB)8k
70BINT4 (AWQ)40 GB1x A100 (80GB)128k
405BINT4230 GB8x H100 (80GB)32k

3. Local Embeddings: The TEI Advantage

In RAG, the embedding step is often the silent latency killer. Standard Python-based embedding servers (FastAPI + Transformers) are too slow, often clocking 150ms+ per request. We shipped HuggingFace Text-Embeddings-Inference (TEI), a Rust-based solution.

Using the BGE-M3 model on an A10G GPU, we achieved:

  • Python/Transformers: 120ms - 200ms per request.
  • TEI (Rust): 8ms - 12ms per request.
  • Throughput: Supports continuous batching, handling 1,000+ requests/sec on a single card.

4. Vector DB: Choosing for On-Prem

For on-premise, the database must handle high-speed indexing and local NVMe optimization. We chose Qdrant for its Rust-based performance and its ability to handle on-disk HNSW indexing, which is significantly more memory-efficient than Milvus for smaller clusters.

Vector Performance (1M Vectors, 768-dim)

FeatureQdrantMilvusPGVector (Postgres)
Search Latency (P99)18ms22ms75ms
Indexing Speed15k vectors/sec20k vectors/sec4k vectors/sec
ArchitectureRust (Single Binary)Go/DistributedC (Postgres Plugin)
Memory EfficiencyHigh (On-disk HNSW)ModerateLow

5. Security: The Air-Gapped Implementation

In a "Dark Site," pip install or huggingface-cli download will fail. We built a local model registry using Harbor and MinIO.

Before data is indexed, it must be scrubbed. We use Microsoft Presidio for local PII redaction. We don't just "mask" data; we anonymize it to ensure that even if the vector database is compromised, no identifying information is retrievable.

Loading diagram...
src/local_scrub.py
1from presidio_analyzer import AnalyzerEngine
2from presidio_anonymizer import AnonymizerEngine
3
4def local_scrub(text: str) -> str:
5 """
6 Scrub PII from text using Microsoft Presidio locally.
7 """
8 analyzer = AnalyzerEngine()
9 anonymizer = AnonymizerEngine()
10
11 # Identify PII (Names, SSNs, Phone Numbers) locally
12 results = analyzer.analyze(
13 text=text,
14 language='en',
15 entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"]
16 )
17
18 # Anonymize using local logic
19 anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
20 return anonymized.text
21
22if __name__ == "__main__":
23 # Example: "Contact John Doe at 555-0199"
24 # Becomes: "Contact <PERSON> at <PHONE_NUMBER>"
25 sample_text = "Contact John Doe at 555-0199 or john.doe@example.com"
26 print(f"Scrubbed: {local_scrub(sample_text)}")

6. Advanced Retrieval: The "Golden Path"

Naive RAG (Top-K vector search) is a toy. In production, we found that vector search alone only hit 71% accuracy on our internal documentation. By implementing a "Cross-Encoder" re-ranker, we effectively turned retrieval into a two-stage ranking problem.

Loading diagram...
  1. Stage 1: Retrieve 50 documents via Vector Search (latency: 20ms).
  2. Stage 2: Re-rank top 5 via BGE-Reranker-v2-m3 (latency: 60ms).

This increased our Hit Rate @ 1 by 22% and significantly reduced hallucinations because the LLM was only ever seeing the most relevant context.

7. Local Evaluation: LLM-as-a-Judge

You cannot use GPT-4 to evaluate an on-prem RAG pipeline if the data is sensitive. We deployed a Llama 3.1 70B (Quantized) instance as a "Judge" model. We use the Ragas framework to measure the "RAG Triad": Faithfulness, Answer Relevance, and Context Precision.

Loading diagram...
src/rag_eval.py
1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevance
3from langchain_community.llms import VLLM
4from datasets import Dataset
5
6# Point to our local 70B Judge model
7judge_llm = VLLM(
8 model="meta-llama/Meta-Llama-3.1-70B-Instruct-AWQ",
9 tensor_parallel_size=2, # Spanned across 2x A100s
10 trust_remote_code=True,
11 vllm_kwargs={"quantization": "awq"}
12)
13
14# Define evaluation dataset
15data_samples = {
16 "question": ["How do I reset my password?"],
17 "answer": ["You can reset your password via the portal."],
18 "contexts": [["The internal portal allows password resets under settings."]],
19 "ground_truth": ["Users should use the internal portal settings for password resets."]
20}
21dataset = Dataset.from_dict(data_samples)
22
23# Run evaluation using local LLM
24results = evaluate(
25 dataset=dataset,
26 metrics=[faithfulness, answer_relevance],
27 llm=judge_llm
28)
29
30print(f"Faithfulness Score: {results['faithfulness']:.2f}")
31print(f"Answer Relevance Score: {results['answer_relevance']:.2f}")

8. Agentic RAG: Self-Correction Loops

Instead of a linear pipeline, we use a loop where the LLM can "critique" its own retrieval. This is implemented using LangGraph. If the "Judge" determines the context is irrelevant, the agent rewrites the query and tries again.

Loading diagram...
src/agentic_loop.py
1from typing import TypedDict, List, Literal
2from langgraph.graph import StateGraph, END
3
4class RAGState(TypedDict):
5 query: str
6 documents: List[str]
7 is_relevant: bool
8
9def retrieve_docs(state: RAGState):
10 # Logic to call Qdrant or local VectorDB
11 return {"documents": ["Context chunk 1"]}
12
13def grade_documents(state: RAGState):
14 # Logic to call local 8B model to grade relevance
15 # For demo, we'll assume it's relevant
16 return {"is_relevant": True}
17
18def transform_query(state: RAGState):
19 # Logic to rewrite query using LLM
20 return {"query": f"Rephrased: {state['query']}"}
21
22def generate_answer(state: RAGState):
23 # Final generation logic
24 return state
25
26workflow = StateGraph(RAGState)
27
28# Define nodes
29workflow.add_node("retrieve", retrieve_docs)
30workflow.add_node("grade", grade_documents)
31workflow.add_node("transform", transform_query)
32workflow.add_node("generate", generate_answer)
33
34# Build graph
35workflow.set_entry_point("retrieve")
36workflow.add_edge("retrieve", "grade")
37
38def decide_to_generate(state: RAGState) -> Literal["generate", "transform"]:
39 return "generate" if state["is_relevant"] else "transform"
40
41workflow.add_conditional_edges(
42 "grade",
43 decide_to_generate,
44 {"generate": "generate", "transform": "transform"}
45)
46workflow.add_edge("transform", "retrieve")
47workflow.add_edge("generate", END)
48
49app = workflow.compile()
50
51if __name__ == "__main__":
52 inputs = {"query": "How to reset password?", "documents": [], "is_relevant": False}
53 for output in app.stream(inputs):
54 print(output)

9. Data Ingestion: Local Parsing with Unstructured.io

Ingesting complex PDFs and Excel files locally is a major hurdle. We use the unstructured library with local Tesseract and Poppler. Using "Chipper" (Unstructured's vision model) improved our table extraction accuracy by 40% over standard OCR.

Loading diagram...
deploy_unstructured.sh
1#!/bin/bash
2# Deploy Unstructured API locally for document parsing
3docker run -p 8001:8000 -d --name unstructured-api \
4 downloads.unstructured.io/unstructured-io/unstructured-api:latest \
5 --port 8000 --host 0.0.0.0

10. TCO Analysis: The Break-Even Point

The "Internet Tax" isn't just latency; it's cost.

  • Managed (GPT-4o + Pinecone): ~$15.00 per 1M tokens (input + output + storage).
  • On-Prem (1x NVIDIA H100): ~$35,000 CAPEX + $3,000/year OPEX (power/cooling).
Loading diagram...

At a volume of 250M tokens per month, the H100 cluster pays for itself in ~10.5 months. For organizations processing >1B tokens/month, on-prem costs are typically 75% lower than SaaS.

Lessons Learned

  1. Quantization is not optional. Running FP16 in production is a waste of silicon. AWQ provides the best balance of speed and accuracy, maintaining 99% of FP16 performance on benchmarks like MMLU.
  2. The KV Cache is the bottleneck. If you aren't using an engine with PagedAttention (like vLLM), you are leaving 50% of your throughput on the table. We saw our requests per second jump from 4 to 96 just by switching engines.
  3. Embeddings are the silent killer. High embedding latency ruins the user experience. Move to TEI and run it on a small dedicated GPU (like an L4) to keep it separate from your main inference compute.
  4. Air-gapping is an operational tax. Everything takes 3x longer when you have to manually move Docker images and model weights via a jump box. Factor this into your sprint planning.

What's Next?

We are currently testing AMD ROCm 6.0 support on MI300X hardware. With 192GB of HBM3 memory, we can fit a Llama 3 70B (FP16) on a single card, potentially doubling our density per rack. In an on-prem setup, where rack space and power are fixed resources, these efficiency gains are the only way to scale without building a new data center. We are also looking into Intel Gaudi 3 for TCO optimization, as it offers a better price-to-performance ratio for RAG workloads that don't require the full NVIDIA software ecosystem.