RAG-Forge
RAG Pipeline Benchmarking Tool
Solo project. Built to scratch my own itch after too many manual RAG evaluations at work.
The Problem
At Observe.AI I spent weeks manually testing different chunking strategies, embedding models, and retrieval methods for our RAG pipeline. There was no easy way to answer: given MY documents and MY questions, which configuration actually works best? I built this to automate that process.
How I Built It
Grid search over the RAG config space
Tests every combination of chunking (fixed, recursive, semantic) × embedding (BGE-small, E5-small, OpenAI) × retrieval (dense, BM25, hybrid) × reranking (cross-encoder, none). Each config gets indexed into ChromaDB, retrieves answers for every QA pair, and gets scored.
Lightweight evaluation metrics
Hit rate, MRR, and context precision — computed without needing an LLM judge. Optional RAGAS evaluation if you want the full suite (faithfulness, relevance) and have an OpenAI key. The lightweight metrics are fast enough to run on a laptop.
Pareto analysis
Not all configs are better on every dimension. The tool generates a Pareto plot showing quality vs latency tradeoffs, so you can pick the config that matches your constraints — some need speed, others need accuracy.
Architecture
CLI (Typer)
├── Document Loader (txt, md, pdf)
├── Chunking
│ ├── fixed_chunk (token-based splits)
│ ├── recursive_chunk (paragraph-aware)
│ └── semantic_chunk (embedding similarity)
├── Embedding
│ ├── BGE-small-en (384d)
│ ├── E5-small-v2 (384d)
│ └── OpenAI text-embedding-3-small (1536d)
├── Indexing → ChromaDB (embedded, no server)
├── Retrieval
│ ├── Dense search (cosine similarity)
│ ├── BM25 (sparse, keyword-based)
│ └── Hybrid (0.7 dense + 0.3 BM25)
├── Reranking
│ └── Cross-encoder (ms-marco-MiniLM-L-6-v2)
├── Evaluation
│ ├── Hit rate, MRR, context precision (fast)
│ └── RAGAS suite (optional, needs OpenAI)
└── Report
├── Markdown ranked table
└── Pareto plot (matplotlib)Results
- ▸Tests 36+ configuration combinations in a single run
- ▸Generates ranked results table sorted by composite score
- ▸Pareto plot identifies quality/latency tradeoff frontier
- ▸28 tests passing, CI on Python 3.10–3.12
- ▸pip install rag-forge && rag-forge run --docs ./data --qa ./qa.csv
I kept the scope deliberately small — 3 chunkers, 3 embedders, 3 retrievers. The temptation is to support every model on Hugging Face, but then it becomes a framework instead of a tool. Staying opinionated about the config space keeps the tool fast and useful.