Building effective RAG systems for enterprise use cases requires moving beyond the naive "chunk-and-retrieve" pattern. In this post, I walk through the architecture of my Agentic RAG system, built specifically for manufacturing document Q&A.
Why "Agentic"?
Standard RAG follows a linear path: embed query → retrieve chunks → generate answer. The problem? Complex manufacturing questions like "Compare the LOTO procedures for Machine A and Machine B" require decomposition — breaking one query into multiple retrieval steps.
The agentic approach introduces three specialized agents:
- Router Agent: Classifies the user's intent — is this a direct factual lookup, a retrieval task, or a multi-part comparison?
- Retriever Agent: Executes semantic search (or multiple searches for decomposed queries) against the Milvus Lite vector store
- Generator Agent: Synthesizes the retrieved context into a response with mandatory page-level citations
Going Offline-First
Manufacturing facilities often operate in air-gapped environments with no cloud connectivity. This system runs entirely on-premise:
- Ollama serves the LLM locally (Llama 2, Mistral, etc.)
- Milvus Lite acts as the vector database — no Docker required, runs as a local file
- FastAPI provides the REST interface
What I Learned
The biggest challenge was getting citation accuracy right. Manufacturing documents have strict terminology (LOTO, PPE, pressure ratings), and the LLM needs to reproduce these exactly. Forcing mandatory source attribution with page numbers solved this — hallucination rates dropped significantly once the model was constrained to only cite retrieved chunks.
Try It
The full source code is on GitHub. You can have it running locally in under 5 minutes with the quick start guide.
Read Next
I Run an AI Agent on a VPS. Here's My Actual Setup
A walkthrough of my real OpenClaw deployment: 13 Telegram topics, GPT-5.2 on Azure free tier, heartbeat-driven morning briefings, Playwright browser automation, memsearch semantic recall, and a Second Brain that auto-captures everything I text. Pulled directly from my live droplet.
How I Set Up an On-Prem Agentic AI Stack with Open-Source Embeddings and Fully Local Inference
A practical guide to building a fully on-prem agentic AI system using open-source embeddings and local LLM inference — no APIs, no cloud, complete data control.
MiA-RAG: Mindscape-Aware Retrieval-Augmented Generation for Long-Context Reasoning
MiA-RAG introduces a mindscape-aware embedder and retriever that inject global semantic context into RAG pipelines, dramatically improving long-document QA accuracy and retrieval recall.