RAG AND KNOWLEDGE BASE SYSTEMS
RAG systems for internal knowledge, customer support, sales enablement, and product documentation. Built on your actual content. Every answer cited. No hallucinations dressed up as confidence.
Example query a RAG system answers from your actual content, with the source documents linked.
QUICK PRIMER
RAG stands for Retrieval-Augmented Generation. The acronym is technical. The idea is simple. An AI model fetches the right passages from your documents before it answers, then composes an answer grounded in what it found. Here is the flow.
A question arrives as natural language. The system retrieves the most relevant passages from indexed content using vector similarity and keyword matching. The language model composes an answer using only the retrieved passages. The user receives the answer with source citations linked.
Question arrives.
Employee asks Slack. Customer types into chat. Sales rep queries the assistant. The question comes in as natural language.
Retrieve relevant passages.
The system searches your indexed content. Vector similarity, plus keyword matching, plus reranking. Returns the most relevant passages from your actual docs.
Compose grounded answer.
The retrieved passages get handed to the language model along with the question. The model writes the answer using only what was retrieved, not its training data.
Answer with citations.
The user gets a clear answer with the source passages linked. They can click through to the original document. Trust comes from traceability.
WHAT RAG IS NOT
It is not a chatbot trained on your data. The model is not memorizing your documents. Nothing about your content goes into anyone's training set. Each query retrieves fresh from your actual indexed content, then the model answers based on what it sees in that moment.
WHAT RAG IS GOOD AT
Answering questions that have answers somewhere in your documents but nobody can find them. Customer support where the same questions repeat. Sales enablement where product information is scattered. Internal Q&A where employees keep messaging managers what the wiki already says.
WHAT WE HAVE BUILT
None of these come out of a box. Each one is built for the customer's content, stack, and audience. But the patterns repeat. If your use case looks like one of these, we already know what works and what to avoid.
01
An assistant inside Slack that answers employee questions about HR policy, IT procedures, product specs, and engineering runbooks. Pulls from your existing documentation and posts answers with linked sources. Reduces back-and-forth in shared channels and keeps managers from being asked the same five questions every week.
INDEXED ON
EXAMPLE QUERY
What is our remote work policy for international hires? I am asking for a candidate in Lisbon.
ANSWER
International remote hires require local employment compliance, typically handled through Deel or Remote.com. Approval needs sign-off from People Ops and the hiring manager. Standard process takes 3 to 5 business days. The full policy covers contractor versus EOR options.
02
A search and Q&A layer for your help center, product docs, and historical support tickets. Customers ask in plain language and get answers without filing a ticket. The system knows when it does not know and offers to create a ticket with the conversation already attached.
INDEXED ON
EXAMPLE QUERY
How do I export my data if my account is on the Starter plan?
ANSWER
On the Starter plan, you can export your data manually from Settings > Export. Bulk and scheduled exports require the Growth plan or higher. Exports run as CSV or JSON and are usually ready within 5 minutes. There is a 50,000 row limit on Starter exports.
03
An assistant for sales reps to query during and between calls. Product details, pricing, competitive positioning, objection handling, case studies by industry. Answers in seconds, with sources. Stops the team from misquoting product capabilities in front of prospects.
INDEXED ON
EXAMPLE QUERY
How does our SOC 2 compliance compare to Acme Corp's, and what is our standard answer for enterprise security reviews?
ANSWER
We are SOC 2 Type 2 certified, audited annually by Drata. Acme is Type 1, audited every 18 months. For enterprise reviews, lead with our security overview document and the SOC 2 report. Standard turnaround for vendor security questionnaires is 5 business days, handled by Security.
04
A search layer for engineering teams over your internal code documentation, API specs, runbooks, and architecture decisions. Built into the developer environment, in Slack, or as a CLI tool. Returns code snippets, decision records, and links to the source files.
INDEXED ON
EXAMPLE QUERY
Show me how we handle webhook retries in the payments service.
ANSWER
Webhook retries in payments-service use exponential backoff with jitter, starting at 30 seconds and capping at 24 hours. Max retries is 16 over 72 hours. After that, webhooks move to the dead letter queue at retries:dead. Implementation is in workers/webhooks.ts.
05
A retrieval system over your contracts, vendor agreements, compliance documentation, and legal precedent. Legal teams query in plain language and get back the exact clauses with document context. Built with strict access controls and audit logging.
INDEXED ON
EXAMPLE QUERY
Find all vendor agreements that include data residency clauses requiring EU storage.
ANSWER
12 vendor agreements include EU data residency clauses. Most reference GDPR Article 44. Three include stricter clauses requiring data to stay within a specific member state. Two require deletion within 30 days of contract termination. Top vendors with these clauses are listed in order of contract value.
UNDER THE HOOD
The pipeline below is what makes a RAG system actually reliable. Skip any of these steps and you ship something that looks impressive in a demo and falls apart in production. We do all seven on every project.
The RAG retrieval pipeline has seven stages: ingestion from connected sources, chunking documents into passages, embedding those passages into vectors, storing vectors in a vector database, retrieving the most relevant passages at query time, reranking those passages for accuracy, and generating a cited answer grounded in the retrieved context.
We connect to your sources and pull the documents in. Notion pages, Confluence wikis, Google Drive files, support ticket archives, GitHub repos, custom databases. Each source has its own ingestion pipeline and update cadence.
Documents get split into passages that fit cleanly into the model's context window. Done with semantic chunking where structure matters, fixed-size where it does not, hierarchical where documents have nested sections. The chunking strategy is one of the most underrated decisions in RAG quality.
Each chunk gets converted into a high-dimensional vector that captures its semantic meaning. Same goes for the query at runtime. The embeddings are how the system finds 'related' content even when the wording is different.
Embeddings get stored in a vector database alongside metadata for filtering (source, date, access permissions, document type). Choice of database depends on scale, cost, latency, and whether you need hybrid search.
When a query comes in, we embed it, run similarity search against the vector store, optionally combine with keyword search (hybrid), and pull back the top-K most relevant chunks. The retrieval step is where most accuracy problems live.
The initial retrieval is fast but imperfect. A reranker re-scores the candidates against the original query using a more expensive model. The result is a tighter set of passages that actually answer the question, not just match the topic. This stage is the single biggest accuracy lever in RAG.
The reranked passages get assembled into the model's context along with the original question and a system prompt that instructs the model to answer only from the provided context. The model writes the answer and cites which passages it used. If it cannot answer from the provided context, it says so.
WHERE THE KNOWLEDGE LIVES
Most companies have their knowledge scattered across six or seven systems. We do not ask you to migrate everything into a new one. We index what you have, keep it current, and add new sources as you go.
Documentation and wikis
Where most of your written knowledge already lives.
Code and engineering docs
READMEs, ADRs, runbooks, and the wikis engineering actually maintains.
Support and conversation history
Past tickets, resolutions, and macros that already have the answers.
CRM and sales content
Account history, sales collateral, competitive intel.
Files and shared drives
PDFs, slides, spreadsheets, and the messy reality of corporate file storage.
Custom systems and databases
Internal apps, product databases, vertical software. Anything with an API or a query interface.
If yours is not on this list, the answer is almost always yes anyway. The connector is the easy part.
HOW WE PICK THE STACK
The tools that show up in every RAG agency's marketing are mostly interchangeable. The decisions that actually matter are about trade-offs. Here are the four we resolve on every project, and how we think about them.
01
We tend to pick: Pinecone for managed scale, pgvector when you already use Postgres, Qdrant for self-hosted enterprise.
OPTIONS
Pinecone
Fully managed, fast, expensive at scale. Default choice for small and mid-size deployments.
Weaviate
Open source with hybrid search built in. Strong choice when you want self-hosted with native keyword + vector.
Qdrant
Open source, fast, good metadata filtering. Best for self-hosted at scale.
pgvector
Vector search inside Postgres. The right answer when you already run Postgres and do not need separate infrastructure.
Chroma
Lightweight, great for prototyping and smaller corpuses. Less battle-tested at production scale.
02
We tend to pick: OpenAI text-embedding-3-large for managed deployments, Voyage when retrieval quality is critical, open weights for self-hosted.
OPTIONS
OpenAI text-embedding-3
Strong general-purpose embeddings. Easy to start. Cost-effective.
Voyage AI
Often outperforms OpenAI on domain-specific tasks. Worth testing on your actual corpus.
Cohere Embed
Good multilingual support, solid retrieval quality.
Open weights (BGE, Nomic, etc)
Self-hosted embeddings when data cannot leave your infrastructure.
03
We tend to pick: Almost always yes. Reranking is the single biggest accuracy lever.
OPTIONS
Cohere Rerank
Managed reranking via API. Easy to add, big quality lift on most corpuses. Default choice.
Voyage Rerank
Strong alternative to Cohere, sometimes better on technical content.
Custom cross-encoder
Fine-tuned on your domain. Costly to build, but possible when generic rerankers underperform.
No reranking
Acceptable for very small corpuses or extremely simple use cases. Rare in practice.
04
We tend to pick: Claude for reasoning depth, GPT for cost-sensitive deployments, open weights for self-hosted.
OPTIONS
Anthropic Claude
Strong instruction following, large context windows, excellent at staying grounded in retrieved context.
OpenAI GPT
Mature, cost-effective at scale, wide ecosystem support.
Google Gemini
Useful when other Google services are already in the stack.
Llama, Mistral, Qwen (self-hosted)
When data sovereignty or cost at scale demand on-prem deployment.
EVALUATION
The biggest difference between a RAG demo and a RAG product is whether anyone measured it. We run five evaluation layers on every project. The numbers are not perfect, but they are real. They tell you when the system gets worse so you can fix it before users notice.
When the answer is somewhere in your documents, does the system actually retrieve it? Bad recall means the model never sees the right passage, so it cannot possibly produce the right answer. Recall is the foundation everything else sits on.
We build a golden dataset of question-and-passage pairs from your real content. Then we measure how often the right passage appears in the top 10 retrieved results. Target: above 90 percent before launch.
When the system does answer, does the answer actually match the retrieved passages, or does the model invent things? Faithful answers cite what was actually retrieved. Unfaithful answers slip in plausible-sounding details that were not in the source.
We sample answers post-launch and check each claim in the answer against the cited sources. We also use automated faithfulness scoring from open evaluation libraries. Target: above 95 percent faithful.
Does the answer actually address what the user asked? A faithful answer can still be off-topic if the retrieval pulled in passages adjacent to the real question. Relevance catches the cases where the system answers a related question instead of the actual one.
Human review on sampled queries, plus automated relevance scoring. Target: above 92 percent relevant.
When the answer is not in your documents, does the system say so or does it make something up? Refusal accuracy is what separates a trustworthy RAG system from a confident liar. The system should know what it does not know.
We include adversarial questions in the golden dataset where the answer is not in the corpus. The system should refuse cleanly with a clear 'this is not in our docs' response. Target: above 98 percent refusal accuracy on out-of-scope questions.
Things change. Documents update. New product features ship. The system that was 95 percent accurate last month can be 80 percent accurate today if drift goes uncaught. Continuous evaluation catches drift before users do.
Automated evaluation runs weekly against the golden dataset. Real user queries get sampled for quality review monthly. Drift alerts fire when any metric drops more than 5 percentage points.
OPERATIONAL CONCERNS
Two questions every serious buyer asks. Where does our content go, and how do we make sure the system reflects what is actually true today, not what was true six months ago.
Self-hosted when it matters.
For regulated industries and sensitive content, we deploy the full RAG pipeline on your own infrastructure. Vector database, embedding models, language model, all running where you can audit them. No data leaves your environment.
Permission-aware retrieval.
Documents inherit your existing access controls. The retrieval system filters by user permissions at query time. If a user does not have access to a document, the system cannot retrieve from it, even if the content is technically indexed.
PII redaction and audit logging.
Sensitive fields can be redacted from chunks before indexing. Every query and every retrieval gets logged with the user, timestamp, retrieved sources, and final answer. Logs live in your infrastructure.
Compliance-ready by design.
GDPR right-to-erasure flows for the documents in scope. HIPAA-compatible deployment patterns where required. SOC 2 considerations baked into the architecture. We are not your auditor, but we know what the auditors look for.
Incremental ingestion.
Documents do not get re-indexed from scratch every time. We track changes through each source's native change feeds (Notion's API, Google Drive's change tokens, Confluence webhooks) and update only what changed.
Configurable refresh cadence.
Some content needs to be current within seconds (support docs after a product change), some can refresh daily, some weekly. We set the cadence per source based on how often it actually changes.
Stale content alerts.
Documents that have not been updated in N months get flagged in the admin dashboard. Helps your team see what is rotting and decide whether to update or retire it. The system is only as good as the content it indexes.
Version awareness.
When a document changes substantially, old answers that cited the previous version get flagged. Important when a customer asks a question and you want to make sure the answer reflects current policy, not last year's policy.
COMMON QUESTIONS
The questions we hear on most RAG discovery calls. Answered the way we actually answer them.
Q01
A basic RAG deployment on a clean corpus, one or two sources, single audience, lands in three to five weeks. More complex deployments involving multiple sources, permission inheritance, custom evaluation, or self-hosting run six to twelve weeks. The bottleneck is almost always the content (cleaning, deciding what to include, deciding what to exclude), not the building.
Q02
Enterprise search tools (Glean, Coveo, Algolia) are good products. They are also generic by design. Custom RAG is the right choice when your content has structure those tools cannot model, when your retrieval needs domain-specific tuning, when you need tight integration with your own product, or when off-the-shelf pricing does not work for your scale. We will tell you when an off-the-shelf tool is actually the better answer.
Q03
Build cost scales with scope. A single-source internal Q&A system is at the lower end of our pricing. Multi-source, permission-aware, evaluation-heavy systems cost more. Run cost has two components: model usage (per query, depends on model choice and volume) and infrastructure (vector database, embedding refreshes, monitoring). For a typical mid-size deployment handling a few thousand queries per day, monthly run cost lands in the low hundreds to low thousands of dollars.
Q04
Yes. We deploy fully self-hosted RAG systems for regulated industries and customers with strict data residency requirements. Self-hosted means open-source embeddings, open-source language models, self-hosted vector database (Qdrant or pgvector typically), all running on your infrastructure. The trade-offs are higher infrastructure cost, slower iteration, and slightly lower model quality. Worth it when compliance demands it.
Q05
Three layers. Automated incremental ingestion catches most changes within minutes to hours. Configurable refresh schedules handle anything not covered by change feeds. Stale content alerts surface documents that have not been updated in months so your team can decide whether to update them or retire them. The system is only as fresh as the source content, so we make staleness visible.
Q06
The system should say so. We build RAG systems with explicit refusal behavior. If the retrieval step does not return relevant passages above a confidence threshold, the system responds with a clear 'this is not in our docs' message and offers to escalate, create a ticket, or surface the question to the team as a content gap. Refusal accuracy is one of the five quality metrics we measure.
Q07
Without monitoring, yes. With monitoring, you catch drift before users do. We set up automated weekly evaluation runs against a golden dataset of question-and-answer pairs. When any quality metric drops more than five percentage points, an alert fires. That is the difference between a RAG system that ages well and one that quietly degrades.
Q08
You do. Code, embeddings, vector database, evaluation datasets, monitoring dashboards. Everything sits in your infrastructure or accounts. If you wanted to take maintenance in-house tomorrow, you could. We hand over everything documented and walk your team through it.
Q09
Most clients do. Adding a new source after launch is a smaller project than the initial build, usually two to four weeks depending on the source. We design every RAG system to accept new sources without rebuilding the pipeline. The first one is the hardest. Each one after is incremental.
Q10
Yes, but worth a separate conversation. Pure retrieval-and-answer is the simplest pattern and the one with the strongest reliability story. Generative content workflows (writing first drafts of replies, summarizing documents, composing outbound emails) are valuable but introduce different risks. We can build either. We will tell you which fits your use case before scoping.
FROM YOUR DOCS TO PRODUCTION
Forty-five minutes. We will look at where your content lives, who needs to query it, and what success looks like for you. If RAG is the right tool, we will scope it. If a simpler approach fits better (enterprise search, internal documentation cleanup, something else), we will say so.
No pressure. Just value.

Hi, I'm Ari 👋
I can help you automate tasks and answer questions about your business.