Most "Production RAG" guides assume you’re comfortable sending your company’s intellectual property to a third-party API. For those of us in FinTech, Healthcare, or GovTech, that’s a non-starter. We don’t just have "data privacy concerns"—we have air-gapped requirements and GDPR/HIPAA mandates that make SaaS LLMs a legal liability.
When we moved our RAG stack from an OpenAI-based pipeline to a sovereign, on-prem cluster, we weren't just looking for security. We were looking for latency determinism. We wanted to eliminate the "Internet Tax"—that variable 500ms to 2s delay caused by network hops and SaaS rate-limiting.
By the time we finished the migration, we were processing 300M tokens per month with a 40% reduction in end-to-end latency and a 70% reduction in long-term TCO.
The Sovereign Stack: Architectural Overview
Our architecture is designed for a "Dark Site" environment. Every component—from the inference engine to the vector database—runs on our own silicon. We treat AI not as a service, but as core infrastructure.
1. The Inference Engine: Why vLLM Wins
We benchmarked TGI (Text Generation Inference), Triton, and vLLM. We ultimately standardized on vLLM for one reason: PagedAttention.
In standard Transformers, KV cache memory is fragmented and wasted—often up to 60-80% of VRAM. vLLM treats VRAM like virtual memory in an OS. This allowed us to increase our throughput by 24x compared to standard HuggingFace implementations.
Deployment Pattern
We deploy our models as Dockerized services. Here is the configuration we shipped for Llama 3.1 8B utilizing AWQ quantization to maximize our VRAM budget:
1#!/bin/bash2set -e3
4# Requirements: NVIDIA Container Toolkit installed on host5# This script starts a vLLM instance with Llama 3.1 8B AWQ6docker run --gpus all \7 -v ~/.cache/huggingface:/root/.cache/huggingface \8 -p 8000:8000 \9 --ipc=host \10 vllm/vllm-openai:latest \11 --model casperhansen/llama-3-8b-instruct-awq \12 --quantization awq \13 --max-model-len 16384 \14 --gpu-memory-utilization 0.90 \15 --enforce-eagerSenior Insight: We use --enforce-eager in on-prem setups to avoid the CUDA graph overhead during the initial warm-up. It saves us roughly 3 minutes of "hanging" during cold starts in our CI/CD pipeline.
2. Hardware Sizing & The VRAM Budget
Calculating your memory budget is the difference between a stable system and a constant Out of Memory (OOM) loop. We use a strict calculation to avoid over-provisioning expensive H100 nodes.
Total VRAM = (Model Weights) + (KV Cache) + (Activation Memory) + (System Overhead)
VRAM Footprint Table (Llama 3.1 Series)
| Model Size | Precision | Weights (GB) | Recommended GPU | Context Window |
|---|---|---|---|---|
| 8B | FP16 | 16 GB | 1x RTX 4090 (24GB) | 8k |
| 8B | INT4 (AWQ) | 5.5 GB | 1x RTX 3060 (12GB) | 32k |
| 70B | FP16 | 140 GB | 2x A100 (80GB) | 8k |
| 70B | INT4 (AWQ) | 40 GB | 1x A100 (80GB) | 128k |
| 405B | INT4 | 230 GB | 8x H100 (80GB) | 32k |
3. Local Embeddings: The TEI Advantage
In RAG, the embedding step is often the silent latency killer. Standard Python-based embedding servers (FastAPI + Transformers) are too slow, often clocking 150ms+ per request. We shipped HuggingFace Text-Embeddings-Inference (TEI), a Rust-based solution.
Using the BGE-M3 model on an A10G GPU, we achieved:
- Python/Transformers: 120ms - 200ms per request.
- TEI (Rust): 8ms - 12ms per request.
- Throughput: Supports continuous batching, handling 1,000+ requests/sec on a single card.
4. Vector DB: Choosing for On-Prem
For on-premise, the database must handle high-speed indexing and local NVMe optimization. We chose Qdrant for its Rust-based performance and its ability to handle on-disk HNSW indexing, which is significantly more memory-efficient than Milvus for smaller clusters.
Vector Performance (1M Vectors, 768-dim)
| Feature | Qdrant | Milvus | PGVector (Postgres) |
|---|---|---|---|
| Search Latency (P99) | 18ms | 22ms | 75ms |
| Indexing Speed | 15k vectors/sec | 20k vectors/sec | 4k vectors/sec |
| Architecture | Rust (Single Binary) | Go/Distributed | C (Postgres Plugin) |
| Memory Efficiency | High (On-disk HNSW) | Moderate | Low |
5. Security: The Air-Gapped Implementation
In a "Dark Site," pip install or huggingface-cli download will fail. We built a local model registry using Harbor and MinIO.
Before data is indexed, it must be scrubbed. We use Microsoft Presidio for local PII redaction. We don't just "mask" data; we anonymize it to ensure that even if the vector database is compromised, no identifying information is retrievable.
1from presidio_analyzer import AnalyzerEngine2from presidio_anonymizer import AnonymizerEngine3
4def local_scrub(text: str) -> str:5 """6 Scrub PII from text using Microsoft Presidio locally.7 """8 analyzer = AnalyzerEngine()9 anonymizer = AnonymizerEngine()10 11 # Identify PII (Names, SSNs, Phone Numbers) locally12 results = analyzer.analyze(13 text=text, 14 language='en', 15 entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"]16 )17 18 # Anonymize using local logic19 anonymized = anonymizer.anonymize(text=text, analyzer_results=results)20 return anonymized.text21
22if __name__ == "__main__":23 # Example: "Contact John Doe at 555-0199" 24 # Becomes: "Contact <PERSON> at <PHONE_NUMBER>"25 sample_text = "Contact John Doe at 555-0199 or john.doe@example.com"26 print(f"Scrubbed: {local_scrub(sample_text)}")6. Advanced Retrieval: The "Golden Path"
Naive RAG (Top-K vector search) is a toy. In production, we found that vector search alone only hit 71% accuracy on our internal documentation. By implementing a "Cross-Encoder" re-ranker, we effectively turned retrieval into a two-stage ranking problem.
- Stage 1: Retrieve 50 documents via Vector Search (latency: 20ms).
- Stage 2: Re-rank top 5 via
BGE-Reranker-v2-m3(latency: 60ms).
This increased our Hit Rate @ 1 by 22% and significantly reduced hallucinations because the LLM was only ever seeing the most relevant context.
7. Local Evaluation: LLM-as-a-Judge
You cannot use GPT-4 to evaluate an on-prem RAG pipeline if the data is sensitive. We deployed a Llama 3.1 70B (Quantized) instance as a "Judge" model. We use the Ragas framework to measure the "RAG Triad": Faithfulness, Answer Relevance, and Context Precision.
1from ragas import evaluate2from ragas.metrics import faithfulness, answer_relevance3from langchain_community.llms import VLLM4from datasets import Dataset5
6# Point to our local 70B Judge model7judge_llm = VLLM(8 model="meta-llama/Meta-Llama-3.1-70B-Instruct-AWQ",9 tensor_parallel_size=2, # Spanned across 2x A100s10 trust_remote_code=True,11 vllm_kwargs={"quantization": "awq"}12)13
14# Define evaluation dataset15data_samples = {16 "question": ["How do I reset my password?"],17 "answer": ["You can reset your password via the portal."],18 "contexts": [["The internal portal allows password resets under settings."]],19 "ground_truth": ["Users should use the internal portal settings for password resets."]20}21dataset = Dataset.from_dict(data_samples)22
23# Run evaluation using local LLM24results = evaluate(25 dataset=dataset,26 metrics=[faithfulness, answer_relevance],27 llm=judge_llm28)29
30print(f"Faithfulness Score: {results['faithfulness']:.2f}")31print(f"Answer Relevance Score: {results['answer_relevance']:.2f}")8. Agentic RAG: Self-Correction Loops
Instead of a linear pipeline, we use a loop where the LLM can "critique" its own retrieval. This is implemented using LangGraph. If the "Judge" determines the context is irrelevant, the agent rewrites the query and tries again.
1from typing import TypedDict, List, Literal2from langgraph.graph import StateGraph, END3
4class RAGState(TypedDict):5 query: str6 documents: List[str]7 is_relevant: bool8
9def retrieve_docs(state: RAGState):10 # Logic to call Qdrant or local VectorDB11 return {"documents": ["Context chunk 1"]}12
13def grade_documents(state: RAGState):14 # Logic to call local 8B model to grade relevance15 # For demo, we'll assume it's relevant16 return {"is_relevant": True}17
18def transform_query(state: RAGState):19 # Logic to rewrite query using LLM20 return {"query": f"Rephrased: {state['query']}"}21
22def generate_answer(state: RAGState):23 # Final generation logic24 return state25
26workflow = StateGraph(RAGState)27
28# Define nodes29workflow.add_node("retrieve", retrieve_docs)30workflow.add_node("grade", grade_documents)31workflow.add_node("transform", transform_query)32workflow.add_node("generate", generate_answer)33
34# Build graph35workflow.set_entry_point("retrieve")36workflow.add_edge("retrieve", "grade")37
38def decide_to_generate(state: RAGState) -> Literal["generate", "transform"]:39 return "generate" if state["is_relevant"] else "transform"40
41workflow.add_conditional_edges(42 "grade", 43 decide_to_generate,44 {"generate": "generate", "transform": "transform"}45)46workflow.add_edge("transform", "retrieve")47workflow.add_edge("generate", END)48
49app = workflow.compile()50
51if __name__ == "__main__":52 inputs = {"query": "How to reset password?", "documents": [], "is_relevant": False}53 for output in app.stream(inputs):54 print(output)9. Data Ingestion: Local Parsing with Unstructured.io
Ingesting complex PDFs and Excel files locally is a major hurdle. We use the unstructured library with local Tesseract and Poppler. Using "Chipper" (Unstructured's vision model) improved our table extraction accuracy by 40% over standard OCR.
1#!/bin/bash2# Deploy Unstructured API locally for document parsing3docker run -p 8001:8000 -d --name unstructured-api \4 downloads.unstructured.io/unstructured-io/unstructured-api:latest \5 --port 8000 --host 0.0.0.010. TCO Analysis: The Break-Even Point
The "Internet Tax" isn't just latency; it's cost.
- Managed (GPT-4o + Pinecone): ~$15.00 per 1M tokens (input + output + storage).
- On-Prem (1x NVIDIA H100): ~$35,000 CAPEX + $3,000/year OPEX (power/cooling).
At a volume of 250M tokens per month, the H100 cluster pays for itself in ~10.5 months. For organizations processing >1B tokens/month, on-prem costs are typically 75% lower than SaaS.
Lessons Learned
- Quantization is not optional. Running FP16 in production is a waste of silicon. AWQ provides the best balance of speed and accuracy, maintaining 99% of FP16 performance on benchmarks like MMLU.
- The KV Cache is the bottleneck. If you aren't using an engine with PagedAttention (like vLLM), you are leaving 50% of your throughput on the table. We saw our requests per second jump from 4 to 96 just by switching engines.
- Embeddings are the silent killer. High embedding latency ruins the user experience. Move to TEI and run it on a small dedicated GPU (like an L4) to keep it separate from your main inference compute.
- Air-gapping is an operational tax. Everything takes 3x longer when you have to manually move Docker images and model weights via a jump box. Factor this into your sprint planning.
What's Next?
We are currently testing AMD ROCm 6.0 support on MI300X hardware. With 192GB of HBM3 memory, we can fit a Llama 3 70B (FP16) on a single card, potentially doubling our density per rack. In an on-prem setup, where rack space and power are fixed resources, these efficiency gains are the only way to scale without building a new data center. We are also looking into Intel Gaudi 3 for TCO optimization, as it offers a better price-to-performance ratio for RAG workloads that don't require the full NVIDIA software ecosystem.
Read Next
How I Set Up an On-Prem Agentic AI Stack with Open-Source Embeddings and Fully Local Inference
A practical guide to building a fully on-prem agentic AI system using open-source embeddings and local LLM inference — no APIs, no cloud, complete data control.
Qwen 3.5 in Production: Running with vLLM and Deploying Local Inference on Azure VM
A deep dive into deploying Qwen 3.5 with vLLM for high-throughput inference and running cost-efficient local inference on Azure VMs with GPU acceleration.
Moving Beyond Naive RAG: How We Built a 90% Hit-Rate Pipeline for Production
Basic vector search fails in production. Learn how we engineered a multi-stage RAG pipeline with hybrid search, re-ranking, and agentic loops to achieve 90%+ accuracy.