This guide is based on Olivier Duvelleroy's NEXUS project—a personal RAG system he built in a single weekend to solve the file-limit problem with AI tools. If you haven't read that story, start there for context on why this matters.

Here, we'll walk through exactly how to build your own version.

What You'll Need

A computer with 16GB+ RAM (the model runs locally)
~10GB free disk space (for the model and index)
Basic comfort with the terminal (you'll copy/paste commands)
Documents to index (PDFs, Word docs, text files)
2-4 hours (mostly waiting for downloads)

The Architecture

Before we start, let's understand what we're building:

┌─────────────────────────────────────────────────────────────┐
│                      YOUR DOCUMENTS                         │
│            (PDFs, Word docs, text files, etc.)              │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                    CHUNKING & EMBEDDING                     │
│         Split into paragraphs → Convert to vectors          │
│              (sentence-transformers library)                │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                     FAISS VECTOR INDEX                      │
│       Fast similarity search across all your chunks         │
└─────────────────────────┬───────────────────────────────────┘
                          │
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
┌─────────────────────┐   ┌───────────────────────────────────┐
│    YOUR QUESTION    │   │         RETRIEVED CHUNKS          │
│                     │──▶│    (Top 5-10 most relevant)       │
└─────────────────────┘   └───────────────────┬───────────────┘
                                              │
                                              ▼
                          ┌───────────────────────────────────┐
                          │         LOCAL LLM (Ollama)        │
                          │   Question + Context → Answer     │
                          │           (Qwen3:8b)              │
                          └───────────────────────────────────┘

NEXUS Architecture: Index once, query instantly

The key insight: retrieval is fast (a few seconds), while generation is slower (1-2 minutes on local hardware). This is fine because you're trading speed for privacy and unlimited context.

Step 1: Install Ollama ~10 min

Ollama is a local AI runtime that makes it dead simple to run open-source models. Think of it as "Docker for LLMs."

On Mac or Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows: Download from ollama.com/download

Verify it's working:

ollama --version

Step 2: Pull a Model ~20 min

We'll use Qwen3:8b—a capable model that runs well on consumer hardware. It's 4-bit quantized, meaning it's compressed to use less memory while maintaining quality.

ollama pull qwen3:8b

This downloads about 5GB. Go grab coffee.

Test it:

ollama run qwen3:8b "What is RAG in AI?"

You should see a response about Retrieval-Augmented Generation. If your fan starts spinning, that's normal—your CPU is doing the work.

⚠️ Performance Note

Local inference is slower than cloud APIs. Expect 1-2 minutes per response on a typical laptop. This is the tradeoff for privacy and no API costs. If you need speed, you can swap in a cloud API later.

Step 3: Set Up Python Environment ~15 min

Create a clean Python environment for the project:

# Create project directory
mkdir nexus && cd nexus

# Create virtual environment
python -m venv venv

# Activate it
source venv/bin/activate  # Mac/Linux
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install langchain langchain-community sentence-transformers faiss-cpu pypdf python-docx

What we're installing:

langchain: Orchestration framework for LLM pipelines
sentence-transformers: Creates embeddings from text
faiss-cpu: Facebook's vector similarity search
pypdf, python-docx: Document parsers

Step 4: Create the Indexing Script ~30 min

This script reads your documents, splits them into chunks, creates embeddings, and stores them in a FAISS index.

Create index_documents.py:

import os
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Configuration
DOCS_PATH = "./documents"  # Put your docs here
INDEX_PATH = "./faiss_index"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

def load_documents(docs_path):
    """Load all documents from the specified directory."""
    documents = []
    
    for file_path in Path(docs_path).rglob("*"):
        if file_path.suffix.lower() == ".pdf":
            loader = PyPDFLoader(str(file_path))
        elif file_path.suffix.lower() in [".docx", ".doc"]:
            loader = Docx2txtLoader(str(file_path))
        elif file_path.suffix.lower() in [".txt", ".md"]:
            loader = TextLoader(str(file_path))
        else:
            continue
            
        try:
            documents.extend(loader.load())
            print(f"Loaded: {file_path.name}")
        except Exception as e:
            print(f"Error loading {file_path.name}: {e}")
    
    return documents

def main():
    print("Loading documents...")
    documents = load_documents(DOCS_PATH)
    print(f"Loaded {len(documents)} document pages")
    
    print("Splitting into chunks...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks")
    
    print("Creating embeddings (this takes a while)...")
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    
    print("Building FAISS index...")
    vectorstore = FAISS.from_documents(chunks, embeddings)
    
    print(f"Saving index to {INDEX_PATH}...")
    vectorstore.save_local(INDEX_PATH)
    
    print("Done! Index ready for queries.")

if __name__ == "__main__":
    main()

Step 5: Create the Query Script ~30 min

This script loads the index, retrieves relevant chunks for your question, and sends them to the local LLM.

Create query.py:

import sys
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

INDEX_PATH = "./faiss_index"
MODEL_NAME = "qwen3:8b"
TOP_K = 5  # Number of chunks to retrieve

PROMPT_TEMPLATE = """Use the following context to answer the question. 
If you cannot answer based on the context, say so. 
Always cite which documents support your answer.

Context:
{context}

Question: {question}

Answer:"""

def main():
    if len(sys.argv) < 2:
        print("Usage: python query.py 'Your question here'")
        sys.exit(1)
    
    question = " ".join(sys.argv[1:])
    
    print("Loading index...")
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    vectorstore = FAISS.load_local(
        INDEX_PATH, embeddings, allow_dangerous_deserialization=True
    )
    
    print("Connecting to Ollama...")
    llm = Ollama(model=MODEL_NAME, temperature=0.1)
    
    prompt = PromptTemplate(
        template=PROMPT_TEMPLATE,
        input_variables=["context", "question"]
    )
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": TOP_K}),
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True
    )
    
    print(f"\nQuestion: {question}\n")
    print("Thinking... (this takes 1-2 minutes)\n")
    
    result = qa_chain({"query": question})
    
    print("=" * 60)
    print("ANSWER:")
    print("=" * 60)
    print(result["result"])
    
    print("\n" + "=" * 60)
    print("SOURCES:")
    print("=" * 60)
    for doc in result["source_documents"]:
        source = doc.metadata.get("source", "Unknown")
        print(f"- {source}")

if __name__ == "__main__":
    main()

Step 6: Index Your Documents

Create a documents folder and add your files:

mkdir documents
# Copy your PDFs, Word docs, and text files here

Run the indexing script:

python index_documents.py

This will take a while depending on how many documents you have. For 100 documents, expect 10-20 minutes.

✓ Checkpoint

You should now have a faiss_index folder containing your vectorized knowledge base. This only needs to be rebuilt when you add new documents.

Step 7: Query Your Knowledge Base

Now the fun part. Ask questions:

python query.py "What did customers say about pricing in our research?"

python query.py "Summarize the main themes from the Gartner reports"

python query.py "Find quotes about AI adoption challenges"

The system will:

Convert your question to an embedding (instant)
Find the 5 most relevant chunks (instant)
Send question + context to the local LLM (1-2 min)
Return a grounded answer with sources

Real Example: What NEXUS Actually Outputs

Here's a real example from Olivier's system. He asked NEXUS to find relevant quotes for each chapter of a book he's writing about the "context problem" in enterprise data:

Part 6 — Context problem

Data exists, but meaning is lost across systems

Chapter	Expanded quote	Reference
Setup	"If a user is spending their time in Outlook or in CRM, how do you take that insight and make sure it gets to them in that application? Otherwise, it just stays disconnected from action."	Qual Transcript 2
Exploration	"All these tools have siloed, contextual metadata. A lot of the understanding sits in one person's head. When they leave or switch roles, that context disappears."	Qual Transcript 1
Practical	"We are very clear that Confluence is the source for documentation. If you want to understand definitions or logic, there is one place to go. That consistency matters when people start self-serving."	Qual Transcript 5
Synthesis	"AI models don't understand business rules or why decisions were made. Without decision memory and context, agents will act quickly but incorrectly."	Qual Transcript 5

This is the magic: In 30 seconds, NEXUS retrieved the most relevant quotes from 40+ interview transcripts, matched them to book chapters, and cited the exact source. No manual searching. No missed documents.

What's Next

This is a working foundation. Here's how to extend it:

Improvement	Difficulty	Impact
Add a simple web UI (Gradio/Streamlit)	Easy	Much better UX
Swap in GPT-4 for faster inference	Easy	10x faster responses
Add incremental indexing	Medium	Faster updates
Support images/diagrams (vision model)	Hard	Much richer context
Add memory across sessions	Medium	Conversational queries

The Hybrid Future

Olivier's insight: this doesn't have to be all-local forever. The architecture supports a hybrid approach:

Local retrieval for governance (your documents never leave your machine)
Optional cloud inference for speed/quality when content is non-sensitive

You control the tradeoff. That's the point.

"At no time should we be afraid of diving into the details these days. AI tools can guide you through implementation; you are mostly limited by your curiosity, creativity, and intent."

— Olivier Duvelleroy

You now have your own context engine. What will you ask it?

How to Build a Personal RAG System