A client asked me to add “ask the docs” to a Rails app last quarter. The instinctive answer was Pinecone or Weaviate — a managed vector database, a second piece of infrastructure, another monthly bill, another SDK to keep up to date. I tried it for an afternoon and then deleted the gem. Postgres was already sitting there on the same box, idle most of the day, with the pgvector extension one CREATE EXTENSION away. This post is the writeup of what replaced the managed service: a single-box retrieval setup using pgvector for storage, ruby_llm for embeddings and chat, and roughly 180 lines of Ruby. I’ll cover what the architecture actually looks like, the numbers I’ve measured on production traffic, the three pitfalls that cost me a weekend each, and the threshold at which I’d give up and pay for Pinecone after all.

What “RAG on Postgres” actually means

Retrieval-augmented generation is two boring database operations dressed up in an acronym. You take a chunk of source text, ask an embedding model to turn it into a 1,536-dimensional float vector, and store it in a column. At query time you embed the user’s question the same way, find the closest stored vectors by cosine similarity, and stuff the matching text into the prompt before calling the LLM.

The “Postgres RAG pgvector” stack means the column is a vector(1536), the similarity search is ORDER BY embedding <=> $1 LIMIT k, and there is no second database. You get transactions, backups, replicas, and your existing ActiveRecord models for free. The cost is that you are stuck with whatever performance pgvector gives you on the hardware you have — which, on a Hetzner CPX21 with 50,000 chunks, has been entirely fine.

ruby_llm is the glue. It hands you a consistent interface over OpenAI, Anthropic, Gemini, and Ollama embeddings, so you can swap providers without rewriting the indexer. I covered the integration shape in dropping AI into an existing Rails app; this post assumes that scaffolding is already in place.

The mechanic, end to end

The whole system has four moving parts. None of them is clever. The cleverness, such as it is, is in not adding a fifth.

StageWhat it doesWhere it lives
ChunkerSplits source docs into ~500-token windows with 50-token overlapapp/services/rag/chunker.rb
EmbedderCalls RubyLLM.embed per chunk, stores vector(1536)Document#embed! model method
RetrieverORDER BY embedding <=> ?::vector LIMIT 8Document.nearest(query_vec) scope
ComposerBuilds prompt with retrieved chunks, streams chat completionRagChat#answer(question)

The retriever scope is the only part worth showing in code, because everyone overcomplicates it the first time:

scope :nearest, ->(vec, k: 8) {
  select("documents.*, embedding <=> '#{vec}'::vector AS distance")
    .order(Arel.sql("embedding <=> '#{vec}'::vector"))
    .limit(k)
}

The index that makes this fast is HNSW, not IVFFlat. The difference matters more than the documentation suggests:

CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

On 50,000 chunks, an HNSW query returns in 4-9ms. The same query against an IVFFlat index with lists = 100 returns in 35-60ms and degrades as the corpus grows. The build is slower — about 8 minutes for 50k rows on the CPX21 — but you build once and query forever. Recall has been good enough that I have not yet had a complaint that a relevant chunk was missed.

Three pitfalls that cost me weekends

The first was chunk size. I started at 1,000 tokens because that’s what every tutorial uses, and recall was terrible. The model would retrieve a chunk that contained the right answer buried in 800 tokens of unrelated context, and the LLM would latch onto the wrong sentence. Dropping to 500 tokens with a 50-token overlap roughly doubled the precision of answers, measured by a small eval set of 40 questions I’d written ground-truth answers for.

The second was the embedding model mismatch. I indexed with text-embedding-3-small (1,536 dims) and then, six weeks later, decided to try text-embedding-3-large (3,072 dims) without re-embedding the corpus. The query vectors and stored vectors were in different spaces. pgvector did not refuse — it happily returned nearest neighbors by Euclidean math that meant nothing. The retrieval looked plausible and was completely wrong. I now write the model name into a embedding_model column and refuse to query if it doesn’t match the configured model.

The database will do exactly what you asked. That is usually the problem.

— Postgres adage

The third was re-embedding on every save. My first version called Document#embed! in an after_save callback. Editing a typo in a 300-page manual triggered 600 embedding API calls. I now diff the chunked content and only re-embed chunks whose text actually changed. That cut my OpenAI bill from $40 a month to about $3.

Two weeks, start to “ask the docs” in production

The realistic timeline for a solo dev doing this on top of an existing Rails app is shorter than people expect, because most of the work is plumbing you already know.

Day 1-2: install pgvector on the production Postgres, write the migration, add the vector(1536) column and the embedding_model column, wire up ruby_llm with an OpenAI key. Run a single embed call from the Rails console and verify a round-trip.

Day 3-4: write the chunker. Mine is 30 lines of Ruby using a simple recursive split on headings, then paragraphs, then sentences, then a hard token cap. I avoided langchain.rb here because the abstraction it provides is not worth the dependency for a function this small.

Day 5-7: build the indexer. A Sidekiq job per document, each spawning sub-jobs per chunk. Throttle to stay under the embedding provider’s rate limit. Add the embedding_model guard. Add a reindex! rake task that wipes and rebuilds, because you will need it.

Day 8-9: write the retriever and a tiny evaluation harness. Forty questions with hand-written expected source chunks. Measure recall@8. Tune chunk size and k against that number, not against vibes.

Day 10-12: integrate the chat endpoint. ruby_llm’s streaming interface to whatever model the client wants — I default to Claude Haiku for cost, GPT-4o-mini as a fallback. The retrieved chunks go into a system prompt with explicit “cite the chunk number” instructions. Stream the answer to the browser with Turbo Streams.

Day 13-14: deploy, watch logs, fix the three things you missed. In my case: timezone bug in the indexer cron, an embedding API timeout that wasn’t retried, and a chunk that contained the word “deprecated” so often it poisoned every result containing it.

Definition of done

The system is done when three numbers hold. Recall@8 on the eval set is above 0.85 — meaning the relevant chunk is in the top eight retrieved at least 34 times out of 40. Median end-to-end latency from user question to first token of the streamed answer is under 1.5 seconds, measured at the application log layer. Monthly cost — embeddings plus chat completions plus the marginal Postgres storage — is under $20 for a corpus of 50,000 chunks and roughly 500 questions a day.

If any of those three break, the system needs work, regardless of how good a given answer feels. I do not trust subjective evaluation of RAG quality; the eval set is what tells me whether I shipped a regression.

When pgvector on one box is the wrong answer

If your corpus is heading north of two million chunks, you will outgrow HNSW on a single Postgres node before you outgrow your patience. If you need multi-tenant isolation with hard per-tenant query budgets, pgvector’s planner does not give you that knob. If your team already runs a vector DB for unrelated reasons, do not add Postgres to spite them. And if your embedding model changes weekly because you are still in the research phase, the rebuild cost is a real tax that managed services partially hide.

For everything else — and “everything else” covers nearly every B2B SaaS I have shipped — pgvector on the existing Postgres is the path of least regret.

A claim I will retract if proven wrong

The number of SaaS products that genuinely need a dedicated vector database in their first two years is, in my client experience, zero. Every “we picked Pinecone because we’ll scale” story I have audited turned out to be running under 100,000 vectors against a $70-a-month plan. Postgres with pgvector would have served the same queries in single-digit milliseconds on a box they were already paying for. If you can show me a production Rails app under three years old where the vector workload genuinely exceeds what a CPX31 with HNSW can serve, I will revise this post and credit you in it.