Evaluation harness for AI agents. Define structured test suites with rubrics, run your agent against them, get LLM-as-judge scores, inspect full execution traces, and diff runs over time to catch regressions.

Building an agent is easy. Knowing whether yesterday’s prompt tweak made it better or worse is hard.
Most teams ship agents with vibes-based testing — a few manual prompts, no structured rubric, no regression catching. RubricLab is the missing dev-loop tool: write test cases once, score every change automatically, and see exactly where behavior drifted.
M1 — scaffold
apps/api (FastAPI), apps/web (Next.js 15), packages/shared TypeScript typesGET / and GET /health live; dev tooling (uv + ruff, pnpm + prettier + eslint)M2 — data model + storage
Suite, Case, Run, CaseResult, Trace, RubricScoreM3 — agent runner + trace capture
AgentRunner protocol (apps/api/src/rubriclab/runner.py); ResearchAgent with web_search + calculator toolsM4 — LLM-as-judge engine
M5 — FastAPI routes
/suites, /runs, /cases, /traces, /diff REST endpointsM6 — Next.js dashboard
M7 — Trace viewer + diff UI
M8 — demo packaging
docker compose up --build runs everything end-to-endrecord_demo.sh triggers two runs with a system-prompt tweak between them, captures screenshots via Playwright (falls back to manual URLs), and saves them to docs/.env.example documents the only required secret (ANTHROPIC_API_KEY)git clone https://github.com/your-org/rubriclab
cd rubriclab
cp .env.example .env # add your ANTHROPIC_API_KEY
docker compose up --build
To run the full demo and compare two runs:
./record_demo.sh
Prerequisites: Python 3.12+, uv, Node 22+, pnpm
# Install JS dependencies
pnpm install
# Start the API
cd apps/api
uv sync
uv run uvicorn rubriclab.main:app --reload --port 8000
# In another terminal, start the web app
cd apps/web
pnpm dev
M8 shipped end-to-end: all components from AgentRunner through dashboard and diff UI are live.
┌─────────────────────────┐ ┌──────────────────────────┐
│ Next.js Dashboard │ │ CLI: rubriclab run ... │
│ (suites, runs, traces) │ └────────────┬─────────────┘
└────────────┬────────────┘ │
│ REST/JSON │
▼ ▼
┌─────────────────────────────────────────────┐
│ FastAPI backend │
│ /suites /runs /cases /traces /diff │
└──┬───────────────┬────────────────┬─────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ AgentRunner │ │ JudgeEngine │ │ SQLite │
│ (calls user │ │ (Anthropic │ │ (SQLModel) │
│ agent + │ │ rubric │ │ suites, │
│ captures │ │ scoring) │ │ cases, │
│ trace) │ │ │ │ runs, │
└──────┬───────┘ └──────┬───────┘ │ traces, │
│ │ │ scores │
▼ ▼ └──────────────┘
┌──────────────────────────────┐
│ Sample Research Agent │
│ (Anthropic + tools) │
└──────────────────────────────┘
Core data model: Suite 1—* Case, Run 1—* CaseResult, CaseResult 1—1 Trace, CaseResult 1—* RubricScore. Diff = join two Runs on Case and compute per-dimension deltas.

Fully runnable end-to-end as of M8. Every step below works with
docker compose up --buildor local dev.
docker compose up (or local Python + pnpm dev)agents/research/prompt.md, hit Run again| Milestone | Description | Status |
|---|---|---|
| M1 | Scaffold + README | ✅ Done |
| M2 | SQLite data model (SQLModel) | ✅ Done |
| M3 | Agent runner + trace capture | ✅ Done |
| M4 | LLM-as-judge engine | ✅ Done |
| M5 | FastAPI routes (/suites, /runs, /cases, /traces, /diff) | ✅ Done |
| M6 | Next.js dashboard (suite browser, run trigger, results) | ✅ Done |
| M7 | Trace viewer + two-run diff UI | ✅ Done |
| M8 | Demo packaging (docker compose, record_demo.sh) | ✅ Done |
| M9 | CLI (rubriclab run --suite=demo) |
⬜ Planned |
Contributions welcome! Please open an issue before submitting a large PR.
git checkout -b feat/my-featuregit commit -m "feat: add thing"MIT © 2024 RubricLab contributors. See LICENSE.