The standard RAG diagram has become liturgy. Embed. Chunk. Store in a vector database. Retrieve by cosine similarity. Stuff into context. Generate.

Reasonable default. Also, for most of the systems people actually build, the wrong architecture.

Not wrong in theory -- wrong in practice. Wrong in what it costs to operate, wrong in where it places intelligence, and wrong in what it asks you to maintain.

We built a document retrieval system for a Norwegian housing cooperative's board governance. The corpus: 48 documents, 3,800 text chunks, an 800-page legal reference, 86 board decisions across three years. The stack: SQLite FTS5, five MCP tools, zero embeddings, zero vector databases. Total footprint: 50 MB. Startup: under 100 milliseconds. Search: under 10 milliseconds.

It outperforms the vector alternative we evaluated. Not on benchmarks -- on the thing that matters: correct, cited answers to real questions, every time a user opens a session.

This is the architecture, the reasoning, and the limits.


I. The RAG Cost Stack

The canonical pipeline:

Documents --> Chunking --> Embedding Model --> Vector DB --> Top-K --> LLM --> Answer

Each box is a dependency. Each dependency has a cost:

  • Embedding model: 2-3 GB (PyTorch plus weights), GPU driver compatibility, CUDA version roulette
  • Vector database: 500 MB-2 GB, with its own failure modes -- index corruption, memory pressure, cold start
  • Chunking strategy: Weeks of tuning. Too small kills context, too large dilutes relevance
  • Embedding quality: Most models are English-first. Norwegian, Danish, Finnish get systematically worse results
  • Operational overhead: Embedding drift when models update, re-indexing costs, vector quality monitoring over time

For 50,000 multilingual documents, these costs pay for themselves. The vector database earns its keep.

But most teams aren't building that. They're building internal knowledge bases, compliance tools, board governance archives, legal research assistants, technical documentation search. Hundreds or low thousands of documents. Specific domains.

And the consumer at the end of the pipeline is an LLM.

That last point changes everything.


II. When the Consumer Is an LLM

Traditional search serves a human. One query, ten results, reformulate if nothing looks right. The system must be smart because the user has limited patience and no ability to process fifty results at once.

An LLM is a fundamentally different consumer.

It reformulates autonomously. If "varmtvannssituasjonen" returns nothing, it tries "varmvannsberedere", then "varmtvann", then "beredere" -- in the same turn, without human intervention. A human needs semantic search to bridge vocabulary gaps. An LLM does this natively.

It synthesizes broadly. Feed it fifteen chunks from different documents and it weaves a coherent answer with citations. A human needs perfect ranking. An LLM needs broad recall.

It reasons over metadata. Give it dates, document types, decision numbers, and it filters, cross-references, and infers relationships that no embedding captures.

It tolerates noise. An irrelevant chunk in position three confuses a human. An LLM reads all ten results and discards what doesn't fit. The error tolerance is fundamentally different.

This leads to a counterintuitive design principle:

When the consumer is an LLM, optimize for speed, reliability, and recall breadth -- not ranking precision.
The LLM handles precision. You handle plumbing. Stop making the plumbing smart.

III. The Librarian Pattern

We call this the Librarian Pattern: a fast, reliable, intentionally simple retrieval layer. It finds documents quickly. It does not summarize, interpret, or validate. The LLM does all of that.

[Ingestion]
  Documents --> Extractors --> Chunks + Structured Metadata --> SQLite FTS5

[Runtime] LLM --> MCP Tool Call --> SQLite Query (<10ms) --> Cited Chunks --> LLM Reasoning

SQLite FTS5 handles full-text search with BM25 ranking. Built into Python's standard library. No external database, no server process, no connection pool. Opens in milliseconds.

Structured metadata tables filter by document type, date, category, author. Board meeting protocols carry rich YAML frontmatter: meeting dates, agenda item IDs, decision numbers, attendee lists, document status. At our corpus size, this metadata is more valuable for retrieval than vector similarity.

MCP tools provide the interface between the LLM and search. The model calls search(query="maintenance obligations", doc_type="reference") and gets back JSON with cited passages.

Separate search intents are encoded in tool descriptions. The legal reference book answers forward-looking questions: "What does the law say about X?" Meeting protocols answer backward-looking ones: "What did we decide about X?" The LLM reaches for the right source naturally.

What we didn't build:

  • No embedding model (saved 2-3 GB of dependencies)
  • No vector database (saved operational complexity)
  • No semantic matching layer
  • No validation engine
  • No recommendation system
  • No confidence scoring

Every component we didn't build is a component that cannot break, cannot drift, and cannot surprise us at 2 AM.


IV. The Semantic Search Question

The obvious objection: keyword search misses semantic connections. "Heating problems" doesn't match "varmvannsberedere" (hot water heater). True -- when a human types one query and stares at results.

In the Librarian Pattern, the LLM bridges this gap naturally:

Turn 1: search("heating problems")     --> 0 results
Turn 2: search("varmvannsberedere")     --> 3 results
Turn 3: search("varmtvann")             --> 5 results
Turn 4: Synthesize all results with citations

One conversation turn. Forty milliseconds for three searches. Better coverage than a single semantic search -- because the LLM explores vocabulary with domain knowledge that no 384-dimensional embedding captures.

We added one enhancement: automatic OR fallback. When a multi-word AND query returns zero results, the engine retries with OR logic. Six lines of code. Queries that previously returned nothing now surface relevant sections.

When you actually need vectors: corpus past around 10,000 documents where BM25 ranking degrades, systematic vocabulary mismatch the LLM can't bridge, cross-lingual retrieval where embeddings handle language-switching natively, or human-facing results where ranking precision is critical.

For everything else -- and this is most internal knowledge systems -- the Librarian wins.


V. Metadata as Architecture

Vector similarity treats all text as geometry. A board meeting decision and a legal requirement are both points in embedding space. The relationship between them -- that one is a decision and the other is its legal basis -- is invisible to cosine similarity.

Structured metadata makes these relationships queryable:

-- "What did we decide about maintenance in 2025?"
SELECT * FROM decisions
WHERE description LIKE '%vedlikehold%'
AND date BETWEEN '2025-01-01' AND '2025-12-31';

-- "What does the law say about maintenance obligations?" SELECT * FROM chunks_fts WHERE chunks_fts MATCH 'vedlikeholdsplikt' AND doc_type = 'reference';

Precise, fast, deterministic. No probabilistic ranking, no temperature, no "the model sometimes retrieves the wrong thing." The LLM decides which query to run. The database executes faithfully.

Our board protocols carry frontmatter with meeting dates, attendee lists, decision IDs, document status. This is richer than any embedding. The LLM can answer "List all decisions from 2025 mentioning budget" in ten milliseconds. No vector database matches this precision on structured queries.


VI. The Numbers

Metric Librarian (FTS5) Standard RAG
Dependencies ~50 MB 2-3 GB
Cold start <100ms 5-15 seconds
Search latency <10ms 50-200ms
Memory footprint <50 MB 500 MB-1 GB
Background services None Embedding model + vector DB
Re-indexing Seconds Minutes to hours
Failure modes Disk full + embedding drift, index corruption, OOM

The startup difference isn't academic. An MCP server that takes fifteen seconds to start is a tool that gets disabled. One that starts in 100 milliseconds gets used every session.


VII. The Decision Framework

Small corpus (<5K)    Large corpus (>10K)
                   +--------------------+--------------------+
  LLM consumer     |   LIBRARIAN        |   HYBRID           |
                   |   FTS5 + metadata  |   FTS5 + vectors   |
                   +--------------------+--------------------+
  Human consumer   |   FACETED SEARCH   |   FULL RAG         |
                   |   FTS5 + UI        |   Standard pipeline |
                   +--------------------+--------------------+

The Librarian Pattern's upgrade path: add an embedding column to the same SQLite schema, merge BM25 and vector scores. One to two hours of work, not a rewrite.

Walk away from this pattern when your corpus passes 10K documents and keyword search returns too much noise. Walk away when cross-lingual retrieval is your primary use case. Walk away when humans -- not models -- consume results directly and need polished ranking on the first try.

For everything else, start with the Librarian. You can always add vectors later. You cannot easily remove them.


VIII. Where Intelligence Should Live

The RAG industry spent three years building increasingly sophisticated retrieval: re-rankers, hypothetical document embeddings, query expansion, multi-stage chains. Each layer adds intelligence to the pipeline.

But the LLM at the end is already the most capable reasoning engine available. Every intelligence layer you add upstream competes with -- and usually loses to -- the thing consuming the results.

The Librarian Pattern inverts this. Make the pipeline simple, fast, and reliable. Give the LLM clean data, good metadata, and fast tools. Let it reason.

This is the same architectural clarity that says "put business logic in the application layer, not the database" or "put rendering in the client, not the server." Each component does what it does best. The database stores and retrieves. The LLM reasons and synthesizes.

When you stop making your retrieval pipeline smart, you get something better: a system that is fast, predictable, debuggable, and cheap to operate. A system where the smartest component has full access to everything it needs, delivered in milliseconds, with no probabilistic intermediaries between the question and the answer.

A librarian. Not a colleague. Fast, reliable, and exactly as smart as it needs to be.


Production-validated. 48 documents, 3,800 chunks, 86 decisions, one 800-page legal reference. Zero embeddings. 50 MB total. Answers in under 100 milliseconds.