Project · RAG backend
Production-Grade Document RAG Backend
I built the backend for a document-intelligence assistant: upload PDFs, ask questions in plain language, and get answers grounded in the source with the exact passages cited. Underneath is an async retrieval pipeline — OCR and layout parsing, hybrid search, two-stage reranking, then generation — orchestrated across dedicated worker pools, instrumented end to end, and built behind a provider interface so the AI model layer is swappable rather than load-bearing.
Many documents, many fragments — one grounded, cited answer.
Ask questions of your own documents — and trust the answer.
Organisations sit on thousands of documents — contracts, reports, scanned forms in more than one language, with tables that carry half the meaning — and no good way to ask questions across them. A generic chatbot is a non-starter: an answer you can't trace back to a source can't be trusted, and a backend that hard-wires one AI provider is a liability the day that provider's price, latency, or availability changes. The hard part isn't calling a model; it's turning messy real-world documents into grounded, cited answers, keeping a heavy multi-stage pipeline responsive under load, and being able to see exactly why any given answer came out the way it did.
One async pipeline, from upload to cited answer.
An async API takes the upload and the question, then hands the work to a queue. A multi-stage pipeline ingests and indexes the document, retrieves and reranks the right passages, and generates a grounded answer — reading and writing a small set of stores underneath.
Six stages, each with a clean input and output.
The retrieval path is deliberately staged — recover the text, index it, route the query, retrieve broadly, rerank hard, then answer from what survives. Each stage does one job and hands off the next.
PDFs go through OCR to recover clean text and tables — multilingual, including right-to-left scripts — with per-page metadata kept so every downstream step can cite an exact source.
Documents are split into structure-aware chunks that respect paragraph and line boundaries, embedded in batches, and written to the vector store alongside their metadata.
Before retrieval, each request is classified — a broad “summarise this” is served differently from a pointed question, so the pipeline doesn't run a needle-in-a-haystack search when the whole haystack is the answer.
Each query runs as hybrid search — semantic vectors balanced with keyword (BM25) — so domain terms and exact identifiers are found, not just things that sound similar. The top thirty candidates come back.
Candidates are reordered by MMR to trade relevance against diversity, then graded for real relevance by the model — dropping passages that merely look related — until ten remain.
The model answers from those passages plus recent conversation history, returns the source passages alongside the answer, and the run's token usage is recorded.
Heavy work and interactive work never share a lane.
A document upload can mean minutes of OCR; a chat question should feel instant. The two run in separate worker pools so the slow path can't starve the fast one — and the whole pipeline is wired to fail safely and queue under pressure rather than fall over.
OCR and document ingestion — CPU-bound, slow, and bursty. Low prefetch and long time limits so one big upload can't block anything else.
Work that's neither trivially fast nor heavyweight, kept on its own tier so it doesn't compete with interactive queries.
Classification, retrieval, and generation calls — I/O-bound work waiting on the model and the vector store. High prefetch keeps many requests in flight at once.
Ingest-and-classify run in parallel, fan into a merge step, then retrieval and generation run in order — a chord feeding a chain, so independent work overlaps and dependent work waits.
Each task has bounded retries with backoff and a priority, and a single failure handler marks the conversation failed instead of leaving it stuck half-done.
The HTTP layer is fully async to hold many concurrent connections cheaply; the workers run sync inside a process-based pool — the pragmatic fit for CPU-bound document work.
Bounded prefetch and per-pool concurrency mean a spike queues instead of melting the box — the queue absorbs the surge and workers drain it in priority order.
The model layer is an implementation detail.
Embeddings and generation sit behind one provider interface. The pipeline asks for an embedding or a completion; it never knows or cares which vendor answers. Today that's Gemini — tomorrow it can be another hosted provider or a model running on your own hardware, and nothing upstream changes. The single riskiest dependency in a RAG system — the AI vendor — becomes the easiest one to swap.
Third-party services, each behind a boundary.
External services do the heavy lifting — but each is reached through a seam the rest of the system can't see past, so any one of them can be replaced without a rewrite.
Embeddings and generation today — reached only through the provider interface, never called directly from the pipeline.
The same interface fronts other hosted providers or a self-hosted model; switching is configuration, not a rewrite.
Turns scanned and layout-heavy PDFs into clean, structured text and tables, across languages.
A self-hosted vector store doing the hybrid semantic + keyword search behind retrieval.
Durable storage for the original source documents the answers are grounded in.
Postgres as the system of record; Redis as the task broker and result backend.
You can answer “why was this answer slow, or wrong?”
A RAG answer is the end of a long chain of choices. When one comes out slow or off, the logs and metrics have to explain it — so observability went in from the first commit, not after the first incident.
Every service and task emits structured JSON logs with shared context — conversation id, stage, duration — shipped to a queryable store with dashboards.
Queue signals record when each task starts, succeeds, retries, or fails, with timing and outcome — so a stuck or slow stage is visible, not a mystery.
Each pipeline stage logs how long it took and, for model calls, how many tokens it cost — turning “why was this slow or expensive?” into a query.
A health endpoint and a build version make deploys and outages observable from the outside, not just from the logs.
The calls that shaped it.
The interesting part of a RAG backend isn't that it retrieves — it's the trade-offs. These are the ones I'd defend.
Pure vector search drifts on jargon and exact identifiers; blending in keyword search keeps it honest — and that matters most on multilingual, specialised documents.
Top-k by similarity isn't good enough. MMR removes near-duplicates and an LLM grade removes the merely-related, so the model reads ten strong passages, not thirty mediocre ones.
Heavy OCR and interactive queries live in separate worker pools on purpose, so a batch of big uploads can't make the chat feel broken.
An interface in front of embeddings and generation means no single vendor is load-bearing — the riskiest external dependency becomes a config line.
Structured logging and per-stage metrics went in from the start, because a RAG answer you can't trace is one you can't fix.
Every answer carries its sources. A passage that didn't make the cut can't be cited — grounding is enforced by the pipeline, not hoped for.
Have documents you wish you could just ask?
Tell me what you're sitting on, and I'll come back with whether retrieval is the right fit and what a first step looks like.