RAG AND KNOWLEDGE BASE SYSTEMS

AI that reads your docs before it answers.

RAG systems for internal knowledge, customer support, sales enablement, and product documentation. Built on your actual content. Every answer cited. No hallucinations dressed up as confidence.

Example query a RAG system answers from your actual content, with the source documents linked.

QUICK PRIMER

What RAG is. In four steps.

RAG stands for Retrieval-Augmented Generation. The acronym is technical. The idea is simple. An AI model fetches the right passages from your documents before it answers, then composes an answer grounded in what it found. Here is the flow.

Question arrives.

Employee asks Slack. Customer types into chat. Sales rep queries the assistant. The question comes in as natural language.

Retrieve relevant passages.

The system searches your indexed content. Vector similarity, plus keyword matching, plus reranking. Returns the most relevant passages from your actual docs.

Compose grounded answer.

The retrieved passages get handed to the language model along with the question. The model writes the answer using only what was retrieved, not its training data.

Answer with citations.

The user gets a clear answer with the source passages linked. They can click through to the original document. Trust comes from traceability.

WHAT RAG IS NOT

It is not a chatbot trained on your data. The model is not memorizing your documents. Nothing about your content goes into anyone's training set. Each query retrieves fresh from your actual indexed content, then the model answers based on what it sees in that moment.

WHAT RAG IS GOOD AT

Answering questions that have answers somewhere in your documents but nobody can find them. Customer support where the same questions repeat. Sales enablement where product information is scattered. Internal Q&A where employees keep messaging managers what the wiki already says.

WHAT WE HAVE BUILT

Five RAG systems we ship most.

None of these come out of a box. Each one is built for the customer's content, stack, and audience. But the patterns repeat. If your use case looks like one of these, we already know what works and what to avoid.

Internal Q&A Assistant

An assistant inside Slack that answers employee questions about HR policy, IT procedures, product specs, and engineering runbooks. Pulls from your existing documentation and posts answers with linked sources. Reduces back-and-forth in shared channels and keeps managers from being asked the same five questions every week.

INDEXED ON

NotionConfluenceGoogle DriveGitHub Wiki

EXAMPLE QUERY

What is our remote work policy for international hires? I am asking for a candidate in Lisbon.

ANSWER

International remote hires require local employment compliance, typically handled through Deel or Remote.com. Approval needs sign-off from People Ops and the hiring manager. Standard process takes 3 to 5 business days. The full policy covers contractor versus EOR options.

Sources:remote-policy.mdinternational-hiring.md

Customer Support Self-Serve

A search and Q&A layer for your help center, product docs, and historical support tickets. Customers ask in plain language and get answers without filing a ticket. The system knows when it does not know and offers to create a ticket with the conversation already attached.

INDEXED ON

IntercomZendeskhelp centerproduct docs

EXAMPLE QUERY

How do I export my data if my account is on the Starter plan?

ANSWER

On the Starter plan, you can export your data manually from Settings > Export. Bulk and scheduled exports require the Growth plan or higher. Exports run as CSV or JSON and are usually ready within 5 minutes. There is a 50,000 row limit on Starter exports.

Sources:data-export-guide.htmlplan-comparison.html

Sales Enablement Assistant

An assistant for sales reps to query during and between calls. Product details, pricing, competitive positioning, objection handling, case studies by industry. Answers in seconds, with sources. Stops the team from misquoting product capabilities in front of prospects.

INDEXED ON

NotionSalesforceHighspotproduct docsrecorded calls

EXAMPLE QUERY

How does our SOC 2 compliance compare to Acme Corp's, and what is our standard answer for enterprise security reviews?

ANSWER

We are SOC 2 Type 2 certified, audited annually by Drata. Acme is Type 1, audited every 18 months. For enterprise reviews, lead with our security overview document and the SOC 2 report. Standard turnaround for vendor security questionnaires is 5 business days, handled by Security.

Sources:security-overview.pdfcompetitive-brief.md

Technical Documentation Search

A search layer for engineering teams over your internal code documentation, API specs, runbooks, and architecture decisions. Built into the developer environment, in Slack, or as a CLI tool. Returns code snippets, decision records, and links to the source files.

INDEXED ON

GitHubGitLabConfluenceinternal wikisADRs

EXAMPLE QUERY

Show me how we handle webhook retries in the payments service.

ANSWER

Webhook retries in payments-service use exponential backoff with jitter, starting at 30 seconds and capping at 24 hours. Max retries is 16 over 72 hours. After that, webhooks move to the dead letter queue at retries:dead. Implementation is in workers/webhooks.ts.

Sources:ADR-014-webhook-retries.mdworkers/webhooks.ts

Compliance and Policy Search

A retrieval system over your contracts, vendor agreements, compliance documentation, and legal precedent. Legal teams query in plain language and get back the exact clauses with document context. Built with strict access controls and audit logging.

INDEXED ON

DocuSigncontract repositorycompliance docscustom legal database

EXAMPLE QUERY

Find all vendor agreements that include data residency clauses requiring EU storage.

ANSWER

12 vendor agreements include EU data residency clauses. Most reference GDPR Article 44. Three include stricter clauses requiring data to stay within a specific member state. Two require deletion within 30 days of contract termination. Top vendors with these clauses are listed in order of contract value.

Sources:vendor-agreements/indexGDPR-summary.md

UNDER THE HOOD

Seven steps from question to cited answer.

The pipeline below is what makes a RAG system actually reliable. Skip any of these steps and you ship something that looks impressive in a demo and falls apart in production. We do all seven on every project.

Ingestion

We connect to your sources and pull the documents in. Notion pages, Confluence wikis, Google Drive files, support ticket archives, GitHub repos, custom databases. Each source has its own ingestion pipeline and update cadence.

TOOLSUnstructured.io, LlamaHub, custom connectors

Chunking

Documents get split into passages that fit cleanly into the model's context window. Done with semantic chunking where structure matters, fixed-size where it does not, hierarchical where documents have nested sections. The chunking strategy is one of the most underrated decisions in RAG quality.

TOOLSLangChain text splitters, Unstructured, custom chunkers

Embedding

Each chunk gets converted into a high-dimensional vector that captures its semantic meaning. Same goes for the query at runtime. The embeddings are how the system finds 'related' content even when the wording is different.

TOOLSOpenAI text-embedding-3, Voyage AI, Cohere Embed, sentence-transformers for self-hosted

Storage

Embeddings get stored in a vector database alongside metadata for filtering (source, date, access permissions, document type). Choice of database depends on scale, cost, latency, and whether you need hybrid search.

TOOLSPinecone, Weaviate, Qdrant, pgvector for Postgres-native, Chroma for smaller scale

Retrieval

When a query comes in, we embed it, run similarity search against the vector store, optionally combine with keyword search (hybrid), and pull back the top-K most relevant chunks. The retrieval step is where most accuracy problems live.

TOOLSVector DB native retrieval, BM25 hybrid via Elasticsearch or Tantivy

Reranking

The initial retrieval is fast but imperfect. A reranker re-scores the candidates against the original query using a more expensive model. The result is a tighter set of passages that actually answer the question, not just match the topic. This stage is the single biggest accuracy lever in RAG.

TOOLSCohere Rerank, Voyage Rerank, custom cross-encoder models

Generation

The reranked passages get assembled into the model's context along with the original question and a system prompt that instructs the model to answer only from the provided context. The model writes the answer and cites which passages it used. If it cannot answer from the provided context, it says so.

TOOLSAnthropic Claude, OpenAI GPT, Google Gemini, self-hosted Llama or Mistral

WHERE THE KNOWLEDGE LIVES

We connect to whatever you already use.

Most companies have their knowledge scattered across six or seven systems. We do not ask you to migrate everything into a new one. We index what you have, keep it current, and add new sources as you go.

Documentation and wikis

Where most of your written knowledge already lives.

NotionConfluenceGoogle DocsGitBookCodaSharePointTettra

Code and engineering docs

READMEs, ADRs, runbooks, and the wikis engineering actually maintains.

GitHubGitLabBitbucketInternal wikis

Support and conversation history

Past tickets, resolutions, and macros that already have the answers.

IntercomZendeskHelp ScoutFrontHubSpot Service Hub

CRM and sales content

Account history, sales collateral, competitive intel.

SalesforceHubSpotPipedriveHighspotSeismic

Files and shared drives

PDFs, slides, spreadsheets, and the messy reality of corporate file storage.

Google DriveOneDriveDropboxSharePointBox

Custom systems and databases

Internal apps, product databases, vertical software. Anything with an API or a query interface.

PostgresMongoDBSnowflakeREST APIsCustom connectors

If yours is not on this list, the answer is almost always yes anyway. The connector is the easy part.

HOW WE PICK THE STACK

The four decisions that shape every RAG project.

The tools that show up in every RAG agency's marketing are mostly interchangeable. The decisions that actually matter are about trade-offs. Here are the four we resolve on every project, and how we think about them.

Which vector database?

We tend to pick: Pinecone for managed scale, pgvector when you already use Postgres, Qdrant for self-hosted enterprise.

OPTIONS

Pinecone

Fully managed, fast, expensive at scale. Default choice for small and mid-size deployments.

Weaviate

Open source with hybrid search built in. Strong choice when you want self-hosted with native keyword + vector.

Qdrant

Open source, fast, good metadata filtering. Best for self-hosted at scale.

pgvector

Vector search inside Postgres. The right answer when you already run Postgres and do not need separate infrastructure.

Chroma

Lightweight, great for prototyping and smaller corpuses. Less battle-tested at production scale.

Which embedding model?

We tend to pick: OpenAI text-embedding-3-large for managed deployments, Voyage when retrieval quality is critical, open weights for self-hosted.

OPTIONS

OpenAI text-embedding-3

Strong general-purpose embeddings. Easy to start. Cost-effective.

Voyage AI

Often outperforms OpenAI on domain-specific tasks. Worth testing on your actual corpus.

Cohere Embed

Good multilingual support, solid retrieval quality.

Open weights (BGE, Nomic, etc)

Self-hosted embeddings when data cannot leave your infrastructure.

Do you need reranking?

We tend to pick: Almost always yes. Reranking is the single biggest accuracy lever.

OPTIONS

Cohere Rerank

Managed reranking via API. Easy to add, big quality lift on most corpuses. Default choice.

Voyage Rerank

Strong alternative to Cohere, sometimes better on technical content.

Custom cross-encoder

Fine-tuned on your domain. Costly to build, but possible when generic rerankers underperform.

No reranking

Acceptable for very small corpuses or extremely simple use cases. Rare in practice.

Which language model for generation?

We tend to pick: Claude for reasoning depth, GPT for cost-sensitive deployments, open weights for self-hosted.

OPTIONS

Anthropic Claude

Strong instruction following, large context windows, excellent at staying grounded in retrieved context.

OpenAI GPT

Mature, cost-effective at scale, wide ecosystem support.

Google Gemini

Useful when other Google services are already in the stack.

Llama, Mistral, Qwen (self-hosted)

When data sovereignty or cost at scale demand on-prem deployment.

EVALUATION

How do you know it actually works?

The biggest difference between a RAG demo and a RAG product is whether anyone measured it. We run five evaluation layers on every project. The numbers are not perfect, but they are real. They tell you when the system gets worse so you can fix it before users notice.

Retrieval recall.

When the answer is somewhere in your documents, does the system actually retrieve it? Bad recall means the model never sees the right passage, so it cannot possibly produce the right answer. Recall is the foundation everything else sits on.

We build a golden dataset of question-and-passage pairs from your real content. Then we measure how often the right passage appears in the top 10 retrieved results. Target: above 90 percent before launch.

Answer faithfulness.

When the system does answer, does the answer actually match the retrieved passages, or does the model invent things? Faithful answers cite what was actually retrieved. Unfaithful answers slip in plausible-sounding details that were not in the source.

We sample answers post-launch and check each claim in the answer against the cited sources. We also use automated faithfulness scoring from open evaluation libraries. Target: above 95 percent faithful.

Answer relevance.

Does the answer actually address what the user asked? A faithful answer can still be off-topic if the retrieval pulled in passages adjacent to the real question. Relevance catches the cases where the system answers a related question instead of the actual one.

Human review on sampled queries, plus automated relevance scoring. Target: above 92 percent relevant.

Refusal accuracy.

When the answer is not in your documents, does the system say so or does it make something up? Refusal accuracy is what separates a trustworthy RAG system from a confident liar. The system should know what it does not know.

We include adversarial questions in the golden dataset where the answer is not in the corpus. The system should refuse cleanly with a clear 'this is not in our docs' response. Target: above 98 percent refusal accuracy on out-of-scope questions.

Production drift.

Things change. Documents update. New product features ship. The system that was 95 percent accurate last month can be 80 percent accurate today if drift goes uncaught. Continuous evaluation catches drift before users do.

Automated evaluation runs weekly against the golden dataset. Real user queries get sampled for quality review monthly. Drift alerts fire when any metric drops more than 5 percentage points.

OPERATIONAL CONCERNS

Where the data lives. How it stays current.

Two questions every serious buyer asks. Where does our content go, and how do we make sure the system reflects what is actually true today, not what was true six months ago.

SECURITY AND DATA

Self-hosted when it matters.

For regulated industries and sensitive content, we deploy the full RAG pipeline on your own infrastructure. Vector database, embedding models, language model, all running where you can audit them. No data leaves your environment.

Permission-aware retrieval.

Documents inherit your existing access controls. The retrieval system filters by user permissions at query time. If a user does not have access to a document, the system cannot retrieve from it, even if the content is technically indexed.

PII redaction and audit logging.

Sensitive fields can be redacted from chunks before indexing. Every query and every retrieval gets logged with the user, timestamp, retrieved sources, and final answer. Logs live in your infrastructure.

Compliance-ready by design.

GDPR right-to-erasure flows for the documents in scope. HIPAA-compatible deployment patterns where required. SOC 2 considerations baked into the architecture. We are not your auditor, but we know what the auditors look for.

KEEPING IT FRESH

Incremental ingestion.

Documents do not get re-indexed from scratch every time. We track changes through each source's native change feeds (Notion's API, Google Drive's change tokens, Confluence webhooks) and update only what changed.

Configurable refresh cadence.

Some content needs to be current within seconds (support docs after a product change), some can refresh daily, some weekly. We set the cadence per source based on how often it actually changes.

Stale content alerts.

Documents that have not been updated in N months get flagged in the admin dashboard. Helps your team see what is rotting and decide whether to update or retire it. The system is only as good as the content it indexes.

Version awareness.

When a document changes substantially, old answers that cited the previous version get flagged. Important when a customer asks a question and you want to make sure the answer reflects current policy, not last year's policy.

COMMON QUESTIONS

What people ask before they sign.

The questions we hear on most RAG discovery calls. Answered the way we actually answer them.

Q01

How long does a RAG project take to ship?

A basic RAG deployment on a clean corpus, one or two sources, single audience, lands in three to five weeks. More complex deployments involving multiple sources, permission inheritance, custom evaluation, or self-hosting run six to twelve weeks. The bottleneck is almost always the content (cleaning, deciding what to include, deciding what to exclude), not the building.

Q02

How is this different from buying an enterprise search tool?

Enterprise search tools (Glean, Coveo, Algolia) are good products. They are also generic by design. Custom RAG is the right choice when your content has structure those tools cannot model, when your retrieval needs domain-specific tuning, when you need tight integration with your own product, or when off-the-shelf pricing does not work for your scale. We will tell you when an off-the-shelf tool is actually the better answer.

Q03

What does it cost to build and run?

Build cost scales with scope. A single-source internal Q&A system is at the lower end of our pricing. Multi-source, permission-aware, evaluation-heavy systems cost more. Run cost has two components: model usage (per query, depends on model choice and volume) and infrastructure (vector database, embedding refreshes, monitoring). For a typical mid-size deployment handling a few thousand queries per day, monthly run cost lands in the low hundreds to low thousands of dollars.

Q04

Can we self-host the whole stack?

Yes. We deploy fully self-hosted RAG systems for regulated industries and customers with strict data residency requirements. Self-hosted means open-source embeddings, open-source language models, self-hosted vector database (Qdrant or pgvector typically), all running on your infrastructure. The trade-offs are higher infrastructure cost, slower iteration, and slightly lower model quality. Worth it when compliance demands it.

Q05

How do we keep the knowledge base from going stale?

Three layers. Automated incremental ingestion catches most changes within minutes to hours. Configurable refresh schedules handle anything not covered by change feeds. Stale content alerts surface documents that have not been updated in months so your team can decide whether to update them or retire them. The system is only as fresh as the source content, so we make staleness visible.

Q06

What if a question's answer is not in our documents?

The system should say so. We build RAG systems with explicit refusal behavior. If the retrieval step does not return relevant passages above a confidence threshold, the system responds with a clear 'this is not in our docs' message and offers to escalate, create a ticket, or surface the question to the team as a content gap. Refusal accuracy is one of the five quality metrics we measure.

Q07

Will the system get worse over time as our content changes?

Without monitoring, yes. With monitoring, you catch drift before users do. We set up automated weekly evaluation runs against a golden dataset of question-and-answer pairs. When any quality metric drops more than five percentage points, an alert fires. That is the difference between a RAG system that ages well and one that quietly degrades.

Q08

Who owns the system after launch?

You do. Code, embeddings, vector database, evaluation datasets, monitoring dashboards. Everything sits in your infrastructure or accounts. If you wanted to take maintenance in-house tomorrow, you could. We hand over everything documented and walk your team through it.

Q09

What if we want to add new sources later?

Most clients do. Adding a new source after launch is a smaller project than the initial build, usually two to four weeks depending on the source. We design every RAG system to accept new sources without rebuilding the pipeline. The first one is the hardest. Each one after is incremental.

Q10

Can the RAG system also generate content, not just retrieve?

Yes, but worth a separate conversation. Pure retrieval-and-answer is the simplest pattern and the one with the strongest reliability story. Generative content workflows (writing first drafts of replies, summarizing documents, composing outbound emails) are valuable but introduce different risks. We can build either. We will tell you which fits your use case before scoping.

FROM YOUR DOCS TO PRODUCTION

Most useful RAG systems start with one corpus and one audience. Tell us yours.

Forty-five minutes. We will look at where your content lives, who needs to query it, and what success looks like for you. If RAG is the right tool, we will scope it. If a simpler approach fits better (enterprise search, internal documentation cleanup, something else), we will say so.

No pressure. Just value.