This guide is based on Olivier Duvelleroy's NEXUS project—a personal RAG system he built in a single weekend to solve the file-limit problem with AI tools. If you haven't read that story, start there for context on why this matters.
Here, we'll walk through exactly how to build your own version.
What You'll Need
- A computer with 16GB+ RAM (the model runs locally)
- ~10GB free disk space (for the model and index)
- Basic comfort with the terminal (you'll copy/paste commands)
- Documents to index (PDFs, Word docs, text files)
- 2-4 hours (mostly waiting for downloads)
The Architecture
Before we start, let's understand what we're building:
┌─────────────────────────────────────────────────────────────┐
│ YOUR DOCUMENTS │
│ (PDFs, Word docs, text files, etc.) │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CHUNKING & EMBEDDING │
│ Split into paragraphs → Convert to vectors │
│ (sentence-transformers library) │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ FAISS VECTOR INDEX │
│ Fast similarity search across all your chunks │
└─────────────────────────┬───────────────────────────────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌───────────────────────────────────┐
│ YOUR QUESTION │ │ RETRIEVED CHUNKS │
│ │──▶│ (Top 5-10 most relevant) │
└─────────────────────┘ └───────────────────┬───────────────┘
│
▼
┌───────────────────────────────────┐
│ LOCAL LLM (Ollama) │
│ Question + Context → Answer │
│ (Qwen3:8b) │
└───────────────────────────────────┘
NEXUS Architecture: Index once, query instantly
The key insight: retrieval is fast (a few seconds), while generation is slower (1-2 minutes on local hardware). This is fine because you're trading speed for privacy and unlimited context.
Step 1: Install Ollama ~10 min
Ollama is a local AI runtime that makes it dead simple to run open-source models. Think of it as "Docker for LLMs."
On Mac or Linux:
On Windows: Download from ollama.com/download
Verify it's working:
Step 2: Pull a Model ~20 min
We'll use Qwen3:8b—a capable model that runs well on consumer hardware. It's 4-bit quantized, meaning it's compressed to use less memory while maintaining quality.
This downloads about 5GB. Go grab coffee.
Test it:
You should see a response about Retrieval-Augmented Generation. If your fan starts spinning, that's normal—your CPU is doing the work.
⚠️ Performance Note
Local inference is slower than cloud APIs. Expect 1-2 minutes per response on a typical laptop. This is the tradeoff for privacy and no API costs. If you need speed, you can swap in a cloud API later.
Step 3: Set Up Python Environment ~15 min
Create a clean Python environment for the project:
What we're installing:
- langchain: Orchestration framework for LLM pipelines
- sentence-transformers: Creates embeddings from text
- faiss-cpu: Facebook's vector similarity search
- pypdf, python-docx: Document parsers
Step 4: Create the Indexing Script ~30 min
This script reads your documents, splits them into chunks, creates embeddings, and stores them in a FAISS index.
Create index_documents.py:
Step 5: Create the Query Script ~30 min
This script loads the index, retrieves relevant chunks for your question, and sends them to the local LLM.
Create query.py:
Step 6: Index Your Documents
Create a documents folder and add your files:
Run the indexing script:
This will take a while depending on how many documents you have. For 100 documents, expect 10-20 minutes.
✓ Checkpoint
You should now have a faiss_index folder containing your vectorized knowledge base. This only needs to be rebuilt when you add new documents.
Step 7: Query Your Knowledge Base
Now the fun part. Ask questions:
The system will:
- Convert your question to an embedding (instant)
- Find the 5 most relevant chunks (instant)
- Send question + context to the local LLM (1-2 min)
- Return a grounded answer with sources
Real Example: What NEXUS Actually Outputs
Here's a real example from Olivier's system. He asked NEXUS to find relevant quotes for each chapter of a book he's writing about the "context problem" in enterprise data:
Part 6 — Context problem
Data exists, but meaning is lost across systems
| Chapter | Expanded quote | Reference |
|---|---|---|
| Setup | "If a user is spending their time in Outlook or in CRM, how do you take that insight and make sure it gets to them in that application? Otherwise, it just stays disconnected from action." | Qual Transcript 2 |
| Exploration | "All these tools have siloed, contextual metadata. A lot of the understanding sits in one person's head. When they leave or switch roles, that context disappears." | Qual Transcript 1 |
| Practical | "We are very clear that Confluence is the source for documentation. If you want to understand definitions or logic, there is one place to go. That consistency matters when people start self-serving." | Qual Transcript 5 |
| Synthesis | "AI models don't understand business rules or why decisions were made. Without decision memory and context, agents will act quickly but incorrectly." | Qual Transcript 5 |
This is the magic: In 30 seconds, NEXUS retrieved the most relevant quotes from 40+ interview transcripts, matched them to book chapters, and cited the exact source. No manual searching. No missed documents.
What's Next
This is a working foundation. Here's how to extend it:
| Improvement | Difficulty | Impact |
|---|---|---|
| Add a simple web UI (Gradio/Streamlit) | Easy | Much better UX |
| Swap in GPT-4 for faster inference | Easy | 10x faster responses |
| Add incremental indexing | Medium | Faster updates |
| Support images/diagrams (vision model) | Hard | Much richer context |
| Add memory across sessions | Medium | Conversational queries |
The Hybrid Future
Olivier's insight: this doesn't have to be all-local forever. The architecture supports a hybrid approach:
- Local retrieval for governance (your documents never leave your machine)
- Optional cloud inference for speed/quality when content is non-sensitive
You control the tradeoff. That's the point.
"At no time should we be afraid of diving into the details these days. AI tools can guide you through implementation; you are mostly limited by your curiosity, creativity, and intent."
— Olivier Duvelleroy
You now have your own context engine. What will you ask it?