RAG-Forge

RAG Pipeline Benchmarking Tool

36+

configs tested

3×3×3

chunk × embed × retrieve

tests passing

View on GitHub →

My Role

Solo project. Built to scratch my own itch after too many manual RAG evaluations at work.

The Problem

At Observe.AI I spent weeks manually testing different chunking strategies, embedding models, and retrieval methods for our RAG pipeline. There was no easy way to answer: given MY documents and MY questions, which configuration actually works best? I built this to automate that process.

How I Built It

Grid search over the RAG config space

Tests every combination of chunking (fixed, recursive, semantic) × embedding (BGE-small, E5-small, OpenAI) × retrieval (dense, BM25, hybrid) × reranking (cross-encoder, none). Each config gets indexed into ChromaDB, retrieves answers for every QA pair, and gets scored.

Lightweight evaluation metrics

Hit rate, MRR, and context precision — computed without needing an LLM judge. Optional RAGAS evaluation if you want the full suite (faithfulness, relevance) and have an OpenAI key. The lightweight metrics are fast enough to run on a laptop.

Pareto analysis

Not all configs are better on every dimension. The tool generates a Pareto plot showing quality vs latency tradeoffs, so you can pick the config that matches your constraints — some need speed, others need accuracy.

Architecture

CLI (Typer)
├── Document Loader (txt, md, pdf)
├── Chunking
│   ├── fixed_chunk (token-based splits)
│   ├── recursive_chunk (paragraph-aware)
│   └── semantic_chunk (embedding similarity)
├── Embedding
│   ├── BGE-small-en (384d)
│   ├── E5-small-v2 (384d)
│   └── OpenAI text-embedding-3-small (1536d)
├── Indexing → ChromaDB (embedded, no server)
├── Retrieval
│   ├── Dense search (cosine similarity)
│   ├── BM25 (sparse, keyword-based)
│   └── Hybrid (0.7 dense + 0.3 BM25)
├── Reranking
│   └── Cross-encoder (ms-marco-MiniLM-L-6-v2)
├── Evaluation
│   ├── Hit rate, MRR, context precision (fast)
│   └── RAGAS suite (optional, needs OpenAI)
└── Report
    ├── Markdown ranked table
    └── Pareto plot (matplotlib)

Results

▸Tests 36+ configuration combinations in a single run
▸Generates ranked results table sorted by composite score
▸Pareto plot identifies quality/latency tradeoff frontier
▸28 tests passing, CI on Python 3.10–3.12
▸pip install rag-forge && rag-forge run --docs ./data --qa ./qa.csv

What I'd Do Differently

I kept the scope deliberately small — 3 chunkers, 3 embedders, 3 retrievers. The temptation is to support every model on Hugging Face, but then it becomes a framework instead of a tool. Staying opinionated about the config space keeps the tool fast and useful.

Tech Stack

PythonChromaDBSentence-Transformersrank-bm25cross-encoderTyperRichMatplotlib