When teams first connect an LLM (large language model) to a product, it often feels like the hard part is over. The model can answer questions, rewrite text, draft content, and carry a conversation. But as soon as that product meets real business workflows, an uncomfortable fact shows up: by itself, the model knows almost nothing about your contracts, internal policies, knowledge base, support tickets, code, product catalog, or newly uploaded documents.

That is why so many AI applications look impressive in demos and disappointing in production. They sound confident, but they cannot ground their answers in the company’s actual data. RAG did not emerge as a fashionable acronym. It emerged as a practical answer to that gap: how do you give a model access to the right data without retraining it and without manually rebuilding prompts every time?

Why an LLM without your data is worth very little

A base LLM has three structural weaknesses.

First, it has no access to an organization’s private data. It may explain general topics well, but it does not know what is written in your procurement PDF, your Confluence pages, your Notion workspace, your Zendesk tickets, or your contracts folder.

Second, even if the relevant information once existed on the open internet, the model is not required to remember it exactly. Without an external source of truth, it starts completing answers probabilistically rather than factually.

Third, model knowledge ages quickly. For applied AI systems, this is critical. Users are not asking abstract questions about physics. They are asking about the current pricing plan, the latest version of a policy, the status of an order, an internal rule, or the current API specification.

That is the core problem of applied AI: if the model is not connected to your data, it does not know the things the business actually cares about. NIST (the U.S. National Institute of Standards and Technology) makes the same point in its Generative AI profile: for these systems, the issue is not just model intelligence, but also risk management, source reliability, and trustworthy outputs. (1)

What RAG is in one paragraph

RAG stands for Retrieval-Augmented Generation. First, the system retrieves relevant fragments from your data. Then it injects those fragments into the model’s context. Only after that does the LLM generate an answer. In other words, the model is not answering “from memory.” It is answering on top of retrieved evidence. (2)

In a good RAG application, the real value is not the LLM call itself. The value is that the search step was designed properly before the model ever sees the prompt: what should be searched, where it should be searched, how irrelevant noise gets filtered out, how many chunks should be passed to the model, and how the user can see what sources the answer relied on.

Embeddings and vector databases, without the hype

To make RAG work, text is usually converted into embeddings, numerical vectors that capture the meaning of a phrase or paragraph. That makes it possible to compare a user’s query not only by literal word overlap, but also by semantic similarity. OpenAI’s Retrieval documentation explicitly shows that the most relevant result may share few or no keywords with the original query. (2)

A vector database stores those representations and can quickly find nearby fragments. It is not a “magic memory for AI.” It is a specialized index for semantic search. In OpenAI’s hosted setup, a vector store automatically chunks, embeds, and indexes uploaded files. In Chroma, you can build the same logic locally or inside your own application, even with a custom embedding function. (2 11)

But the role of the vector database is easy to overstate. It does not replace your primary transactional database, it does not solve access control, and it does not make retrieval high-quality by default. If your document parsing is bad, your chunks are weak, and you cannot filter by tenant or document type, expensive infrastructure will not rescue the system.

What a production RAG application actually looks like

In production, RAG almost never looks like “upload a PDF, ask a question, get an answer.” A real system usually has at least five layers: ingestion, parsing, chunking, retrieval, and generation.

First, documents have to be turned into something searchable. Google’s Vertex AI RAG Engine describes this as a separate lifecycle: ingestion from local files, Cloud Storage, or Google Drive, then transformation, then indexing into a corpus. (7) That is an important signal in itself: the hardest part starts before the model.

Then comes chunking, meaning splitting a document into smaller fragments. Documents are almost never searched as a whole. They are broken into smaller units so retrieval can work more precisely. But this is also where many systems fail. If a chunk is too small, it loses context. If it is too large, you get topic soup: something that looks related, but is too broad to be useful. In OpenAI Retrieval, the default chunk size for vector stores is 800 tokens with 400 tokens of overlap, meaning neighboring chunks partially repeat each other. That is only a starting point, not a universal rule. (2)

After that, you need more than raw vector search. In 2026, a strong production pattern is no longer dense-only retrieval, but a combination of semantic search and keyword search. OpenAI exposes semantic search, query rewriting, attribute filtering, and ranking options. Weaviate treats hybrid search as a first-class pattern, running vector search and BM25 (Best Matching 25, a classic ranking algorithm for keyword search) in parallel and then fusing the results. (2 10)

The next layer is usually reranking, an extra step that reorders the retrieved fragments by relevance. This matters most when the first retrieval stage returns a broad candidate set. Anthropic’s Contextual Retrieval research points to the same production pattern: embeddings plus BM25 plus a reranker produce meaningfully fewer retrieval failures than embeddings alone. (5)

Only after that retrieved context reaches the LLM. If the product is serious, the output should include at least citations, source links, page-level references, or some clear trail showing where the answer came from. Otherwise the user gets polished text, but not something they can verify.

Chat with PDFs and knowledge bases: what changes in practice

Chat with PDF is often presented as the simplest RAG example. And for a demo, that is true enough: upload a file, split it into chunks, generate embeddings, find similar passages, pass them to the model. But in a real product, the hard parts show up immediately.

A PDF may be a scan. It may contain tables, columns, diagrams, footnotes, and broken reading order. The same paragraph, after poor parsing, turns into garbage that indexes perfectly and answers badly. That is why practical PDF RAG usually begins not with embeddings, but with reliable document extraction and structure recovery.

Corporate knowledge bases are even harder. Relevance is not the only issue. Permissions, freshness, and source quality matter just as much. One user should see their team’s internal runbooks, but not HR or legal documents. A single outdated file in the index can damage trust more than ten good answers can rebuild it.

That is why a working RAG system for a knowledge base almost always includes:

filtering by source, tenant, department, or document type;
reindexing when content changes;
logs of retrieved chunks;
offline evals, meaning test sets used to check retrieval and answer quality;
an explicit fallback strategy when relevant evidence is not found.

That is also why, in applied AI, the real work begins exactly where the demo ends.

Tools in 2026: Pinecone, Weaviate, Chroma, Qdrant, pgvector

Tool choice should be driven by the retrieval layer you are actually building, not by fashion.

Pinecone remains a strong managed option for teams that do not want to operate a separate search stack. Pinecone’s documentation emphasizes its serverless architecture, the separation between control plane and data plane, independent scaling for read and write paths, and the use of namespaces. That makes it attractive when you want a predictable managed service with minimal ops overhead. (9)

Weaviate is strong when you want hybrid search built in rather than bolted on later. In its documentation, hybrid search is not a workaround but a core pattern: vector search and BM25 run in parallel, then results are fused through relativeScoreFusion or rankedFusion. You can also tune alpha, which controls the balance between semantic and keyword signals. For knowledge-heavy systems, that is a very practical compromise between recall and precision. (10)

Chroma is a convenient lightweight starting point, especially for prototypes, evaluation pipelines, and smaller applications. It lets you use a built-in embedding function based on all-MiniLM-L6-v2, and it can quickly connect to OpenAI, Google, Mistral, Ollama, Jina, and other providers when needed. It is a good way to stand up a retrieval layer quickly without heavy infrastructure. But teams tend to evaluate it much more strictly for high-load multi-tenant production use. (11)

Qdrant is often chosen when metadata filtering, hybrid search, and an independently evolving retrieval layer become the main priorities. It is telling that even in Qdrant’s own comparison with pgvector, the argument is not about “magic speed,” but about filterable HNSW (Hierarchical Navigable Small World, a popular index structure for approximate nearest-neighbor search), hybrid search, and scenarios where vectors become a first-class layer rather than just another field inside Postgres. (12)

pgvector still makes sense as a starting point if you already run Postgres, your corpus is relatively small, and your search logic is tightly coupled to relational data. But the practical rule is simple: if you expect heavy filtering, BM25 or hybrid retrieval, growing embedding volume, and a retrieval pipeline that will evolve independently, you should design for a possible move to a dedicated vector store before that migration becomes painful. (12)

Managed RAG from model vendors: OpenAI File Search and Vertex AI RAG Engine

One of the clearest shifts in recent years is hosted RAG directly from model vendors. That matters because vendors are no longer saying, “Here is the LLM, build everything else yourself.” They are gradually absorbing the retrieval layer too.

OpenAI offers file_search and vector stores for this purpose. You can upload files into a knowledge base, let the service chunk, embed, and index them automatically, then enable file search as a tool in the Responses API. You can limit result counts, apply metadata filters, rewrite queries, tune ranking options, and return citations in the answer. For many internal assistants, that creates a very fast path to production. (2 4)

Google plays a similar role with Vertex AI RAG Engine. It is described as a data framework for context-augmented LLM applications: ingestion, transformation, indexing, retrieval, and generation. Google also puts special emphasis on grounding, meaning tying the model’s answer to verifiable sources, and on the Check grounding API, which helps verify that the answer is actually supported by the retrieved facts. (7)

The trade-off is obvious: less control over the retrieval pipeline, more platform dependency, and less freedom to tune parsing, indexing, and reranking in depth. But for many teams, that is still a perfectly rational trade if the goal is a useful application rather than a bespoke research stack.

What OpenAI, Anthropic, and Google actually recommend

If you ignore the marketing around RAG and look at what the vendors themselves recommend, the picture is surprisingly practical.

OpenAI’s position is basically that retrieval should be treated like a search system, not like “magic glue for an LLM.” Their documentation includes query rewriting, attribute filtering, ranker and score-threshold tuning, result limits, and batch file ingestion for higher throughput. That is a very engineering-heavy, not romantic, view of RAG. (2)

Anthropic goes even further and points directly at where classic RAG breaks down. Their argument is simple: a chunk without context often loses its connection to the source document. That is why they propose Contextual Retrieval: prepend a short situating explanation to each chunk before embedding it and before building the BM25 index. According to Anthropic, contextual embeddings reduce retrieval failures by 35%, contextual embeddings plus contextual BM25 reduce them by 49%, and the full stack with reranking reduces them by 67%. (5)

Google, meanwhile, is very consistent about grounding. In practice, that means retrieval is not just there to “improve the answer.” It is there to tie generation to verifiable sources. That may be the healthiest framing of the whole RAG layer: retrieval matters not because it makes the text sound smarter, but because it makes the output more checkable. (7)

When RAG is overkill: long context vs retrieval

By 2026, you cannot discuss RAG seriously without discussing long context, meaning a very large model context window. Anthropic officially rolled out 1M tokens of context for Claude Sonnet 4, and in the Contextual Retrieval research the company also makes a separate point: if your knowledge base is smaller than 200,000 tokens, roughly 500 pages, it may be simpler to skip RAG entirely and put the whole corpus directly into the prompt. With prompt caching, meaning reusing a previously prepared large context, that can reduce latency by more than 2x and cut costs by up to 90%. (5)

That is an important correction for the industry. Not every document set needs embeddings, a vector database, and a dedicated retrieval pipeline. If the corpus is small, stable, and rarely updated, long context may be simpler, more reliable, and cheaper to maintain.

But the other side is just as clear. As soon as the data grows, changes frequently, requires permissions, needs metadata filtering, spans multiple sources, or comes with latency constraints, long context stops being a universal answer. Stuffing the full corpus into every request is expensive, slow, and operationally awkward.

The practical rule looks like this:

small, stable corpus: test long context first;
large or fast-changing corpus: you will probably need RAG;
complex reasoning over a large corpus: the usual answer is hybrid, retrieval first and long context second.

RAG is a discipline, not a library

The biggest misconception around RAG sounds like this: “We’ll add Pinecone, Weaviate, or Chroma, and suddenly the AI will speak from our data.” In practice, it is almost the opposite. The vector database is important, but it is secondary. Answer quality usually breaks because of weak parsing, bad chunking, missing hybrid search, a weak reranker, dirty source data, and the total absence of evals, meaning evaluation sets and tests for retrieval and answer quality.

That is why it is more useful to think about RAG not as a module, but as a discipline for working with organizational knowledge. Good RAG finds the right context, enforces access boundaries, shows sources, survives document updates, and can honestly say “I don’t know” when no evidence was found.

That is the sense in which RAG remains central to applied AI. Not because an LLM literally cannot answer without it. But because without retrieval, grounding, and source verification, most AI applications remain a polished demo rather than a working tool.

What RAG Is and Why AI Apps Without It Stay Toys

When teams first connect an LLM to a product, it often feels like the hard part is over. The model can answer questions, rewrite text, draft content, and carry a conversation. But as soon as that product meets real business workflows, an uncomfortable fact shows up: by itself, the model knows almost nothing about your contracts, internal policies, knowledge base, support tickets, code, product catalog, or newly uploaded documents.

Why an LLM without your data is worth very little

A base LLM has three structural weaknesses.

That is the core problem of applied AI: if the model is not connected to your data, it does not know the things the business actually cares about. In its Generative AI profile, NIST explicitly frames the issue not as raw model intelligence, but as risk management, source reliability, and trustworthy outputs. (1)

What RAG is in one paragraph

Embeddings and vector databases, without the hype

What a production RAG application actually looks like

In production, RAG almost never looks like “upload a PDF, ask a question, get an answer.” A real system usually has at least five layers: ingestion, parsing, chunking, retrieval, and generation.

Then comes chunking. Documents are almost never searched as a whole. They are broken into smaller units so retrieval can work more precisely. But this is also where many systems fail. If a chunk is too small, it loses context. If it is too large, you get topic soup: something that looks related, but is too broad to be useful. In OpenAI Retrieval, the default chunk size for vector stores is 800 tokens with 400 tokens of overlap, but that is only a starting point, not a universal rule. (2)

After that, you need more than raw vector search. In 2026, a strong production pattern is no longer dense-only retrieval, but a combination of semantic search and keyword search. OpenAI exposes semantic search, query rewriting, attribute filtering, and ranking options. Weaviate treats hybrid search as a first-class pattern, running vector search and BM25 in parallel and then fusing the results. (2 10)

The next layer is usually reranking. This matters most when the first retrieval stage returns a broad candidate set. Anthropic’s Contextual Retrieval research points to the same production pattern: embeddings plus BM25 plus a reranker produce meaningfully fewer retrieval failures than embeddings alone. (5)

Chat with PDFs and knowledge bases: what changes in practice

That is why a working RAG system for a knowledge base almost always includes:

filtering by source, tenant, department, or document type;
reindexing when content changes;
logs of retrieved chunks;
offline evals on representative questions;
an explicit fallback strategy when relevant evidence is not found.

That is also why, in applied AI, the real work begins exactly where the demo ends.

Tools in 2026: Pinecone, Weaviate, Chroma, Qdrant, pgvector

Tool choice should be driven by the retrieval layer you are actually building, not by fashion.

Qdrant is often chosen when metadata filtering, hybrid search, and an independently evolving retrieval layer become the main priorities. It is telling that even in Qdrant’s own comparison with pgvector, the argument is not about “magic speed,” but about filterable HNSW, hybrid search, and scenarios where vectors become a first-class layer rather than just another field inside Postgres. (12)

Managed RAG from model vendors: OpenAI File Search and Vertex AI RAG Engine

Google plays a similar role with Vertex AI RAG Engine. It is described as a data framework for context-augmented LLM applications: ingestion, transformation, indexing, retrieval, and generation. Google also puts special emphasis on grounding and the Check grounding API, which helps verify that the answer is actually supported by the retrieved facts. (7)

What OpenAI, Anthropic, and Google actually recommend

If you ignore the marketing around RAG and look at what the vendors themselves recommend, the picture is surprisingly practical.

When RAG is overkill: long context vs retrieval

By 2026, you cannot discuss RAG seriously without discussing long context. Anthropic officially rolled out 1M tokens of context for Claude Sonnet 4, and in the Contextual Retrieval research the company also makes a separate point: if your knowledge base is smaller than 200,000 tokens, roughly 500 pages, it may be simpler to skip RAG entirely and put the whole corpus directly into the prompt. With prompt caching, that can reduce latency by more than 2x and cut costs by up to 90%. (5)

The practical rule looks like this:

small, stable corpus: test long context first;
large or fast-changing corpus: you will probably need RAG;
complex reasoning over a large corpus: the usual answer is hybrid, retrieval first and long context second.