Project · RAG backend

Production-Grade Document RAG Backend

I built the backend for a document-intelligence assistant: upload PDFs, ask questions in plain language, and get answers grounded in the source with the exact passages cited. Underneath is an async retrieval pipeline — OCR and layout parsing, hybrid search, two-stage reranking, then generation — orchestrated across dedicated worker pools, instrumented end to end, and built behind a provider interface so the AI model layer is swappable rather than load-bearing.

RAG FastAPI Celery Weaviate PostgreSQL Observability Python

Many documents, many fragments — one grounded, cited answer.

Isolated worker pools

30 → 10

Retrieved, reranked to the model

2-stage

Reranking — MMR + LLM grading

The brief

Ask questions of your own documents — and trust the answer.

Organisations sit on thousands of documents — contracts, reports, scanned forms in more than one language, with tables that carry half the meaning — and no good way to ask questions across them. A generic chatbot is a non-starter: an answer you can't trace back to a source can't be trusted, and a backend that hard-wires one AI provider is a liability the day that provider's price, latency, or availability changes. The hard part isn't calling a model; it's turning messy real-world documents into grounded, cited answers, keeping a heavy multi-stage pipeline responsive under load, and being able to see exactly why any given answer came out the way it did.

The architecture

One async pipeline, from upload to cited answer.

An async API takes the upload and the question, then hands the work to a queue. A multi-stage pipeline ingests and indexes the document, retrieves and reranks the right passages, and generates a grounded answer — reading and writing a small set of stores underneath.

Upload + question PDF · prompt FastAPI (async)

Async pipeline task queue + workers ingest ∥ classify → merge

Retrieve + rerank hybrid → MMR → grade 30 → 10 passages

Grounded answer with cited sources saved to history

System of record PostgreSQL chats · messages · documents

Vector store Weaviate hybrid: vector + BM25

Broker + results Redis Celery queues

Object storage Google Cloud Storage source documents

The RAG pipeline

Six stages, each with a clean input and output.

The retrieval path is deliberately staged — recover the text, index it, route the query, retrieve broadly, rerank hard, then answer from what survives. Each stage does one job and hands off the next.

ingest Ingestion & OCR

PDFs go through OCR to recover clean text and tables — multilingual, including right-to-left scripts — with per-page metadata kept so every downstream step can cite an exact source.

Raw PDF Text + tables + metadata

Mistral OCR

index Chunk & embed

Documents are split into structure-aware chunks that respect paragraph and line boundaries, embedded in batches, and written to the vector store alongside their metadata.

Document text Embedded chunks

~1,000-char chunks · batched

classify Query routing

Before retrieval, each request is classified — a broad “summarise this” is served differently from a pointed question, so the pipeline doesn't run a needle-in-a-haystack search when the whole haystack is the answer.

A request A route

summarise vs. QA

retrieve Hybrid retrieval

Each query runs as hybrid search — semantic vectors balanced with keyword (BM25) — so domain terms and exact identifiers are found, not just things that sound similar. The top thirty candidates come back.

A question 30 candidates

vector + BM25, α = 0.5

rerank Two-stage rerank

Candidates are reordered by MMR to trade relevance against diversity, then graded for real relevance by the model — dropping passages that merely look related — until ten remain.

30 candidates 10 strongest

MMR (λ = 0.8) + LLM grade

generate Grounded generation

The model answers from those passages plus recent conversation history, returns the source passages alongside the answer, and the run's token usage is recorded.

Context + history Cited answer

via the provider interface

Async orchestration

Heavy work and interactive work never share a lane.

A document upload can mean minutes of OCR; a chat question should feel instant. The two run in separate worker pools so the slow path can't starve the fast one — and the whole pipeline is wired to fail safely and queue under pressure rather than fall over.

Heavy pool

OCR and document ingestion — CPU-bound, slow, and bursty. Low prefetch and long time limits so one big upload can't block anything else.

Medium pool

Work that's neither trivially fast nor heavyweight, kept on its own tier so it doesn't compete with interactive queries.

Light pool

Classification, retrieval, and generation calls — I/O-bound work waiting on the model and the vector store. High prefetch keeps many requests in flight at once.

Parallel, then sequential

Ingest-and-classify run in parallel, fan into a merge step, then retrieval and generation run in order — a chord feeding a chain, so independent work overlaps and dependent work waits.

Every stage fails safely

Each task has bounded retries with backoff and a priority, and a single failure handler marks the conversation failed instead of leaving it stuck half-done.

Async API, sync workers

The HTTP layer is fully async to hold many concurrent connections cheaply; the workers run sync inside a process-based pool — the pragmatic fit for CPU-bound document work.

Back-pressure by design

Bounded prefetch and per-pool concurrency mean a spike queues instead of melting the box — the queue absorbs the surge and workers drain it in priority order.

The opinionated bet

The model layer is an implementation detail.

Embeddings and generation sit behind one provider interface. The pipeline asks for an embedding or a completion; it never knows or cares which vendor answers. Today that's Gemini — tomorrow it can be another hosted provider or a model running on your own hardware, and nothing upstream changes. The single riskiest dependency in a RAG system — the AI vendor — becomes the easiest one to swap.

Integrations

Third-party services, each behind a boundary.

External services do the heavy lifting — but each is reached through a seam the rest of the system can't see past, so any one of them can be replaced without a rewrite.

Gemini

Embeddings and generation today — reached only through the provider interface, never called directly from the pipeline.

Any LLM provider

The same interface fronts other hosted providers or a self-hosted model; switching is configuration, not a rewrite.

Mistral OCR

Turns scanned and layout-heavy PDFs into clean, structured text and tables, across languages.

Weaviate

A self-hosted vector store doing the hybrid semantic + keyword search behind retrieval.

Google Cloud Storage

Durable storage for the original source documents the answers are grounded in.

PostgreSQL & Redis

Postgres as the system of record; Redis as the task broker and result backend.

Observability

You can answer “why was this answer slow, or wrong?”

A RAG answer is the end of a long chain of choices. When one comes out slow or off, the logs and metrics have to explain it — so observability went in from the first commit, not after the first incident.

Structured logs, not prints

Every service and task emits structured JSON logs with shared context — conversation id, stage, duration — shipped to a queryable store with dashboards.

Every task is instrumented

Queue signals record when each task starts, succeeds, retries, or fails, with timing and outcome — so a stuck or slow stage is visible, not a mystery.

Per-stage timing & tokens

Each pipeline stage logs how long it took and, for model calls, how many tokens it cost — turning “why was this slow or expensive?” into a query.

Health checks & versioning

A health endpoint and a build version make deploys and outages observable from the outside, not just from the logs.

Decisions

The calls that shaped it.

The interesting part of a RAG backend isn't that it retrieves — it's the trade-offs. These are the ones I'd defend.

Hybrid retrieval by default

Pure vector search drifts on jargon and exact identifiers; blending in keyword search keeps it honest — and that matters most on multilingual, specialised documents.

Rerank in two stages

Top-k by similarity isn't good enough. MMR removes near-duplicates and an LLM grade removes the merely-related, so the model reads ten strong passages, not thirty mediocre ones.

Isolate work by weight

Heavy OCR and interactive queries live in separate worker pools on purpose, so a batch of big uploads can't make the chat feel broken.

Make the model swappable

An interface in front of embeddings and generation means no single vendor is load-bearing — the riskiest external dependency becomes a config line.

Observability isn't optional

Structured logging and per-stage metrics went in from the start, because a RAG answer you can't trace is one you can't fix.

Citations are the contract

Every answer carries its sources. A passage that didn't make the cut can't be cited — grounding is enforced by the pipeline, not hoped for.

Have documents you wish you could just ask?

Tell me what you're sitting on, and I'll come back with whether retrieval is the right fit and what a first step looks like.

Get in touch