rubric-lab

RubricLab

Evaluation harness for AI agents. Define structured test suites with rubrics, run your agent against them, get LLM-as-judge scores, inspect full execution traces, and diff runs over time to catch regressions.

Status License Python Next.js

Dashboard screenshot


What is RubricLab?

Building an agent is easy. Knowing whether yesterday’s prompt tweak made it better or worse is hard.

Most teams ship agents with vibes-based testing — a few manual prompts, no structured rubric, no regression catching. RubricLab is the missing dev-loop tool: write test cases once, score every change automatically, and see exactly where behavior drifted.

What works today (M8)

M1 — scaffold

M2 — data model + storage

M3 — agent runner + trace capture

M4 — LLM-as-judge engine

M5 — FastAPI routes

M6 — Next.js dashboard

M7 — Trace viewer + diff UI

M8 — demo packaging


Quickstart

git clone https://github.com/your-org/rubriclab
cd rubriclab
cp .env.example .env          # add your ANTHROPIC_API_KEY
docker compose up --build

To run the full demo and compare two runs:

./record_demo.sh

Local dev

Prerequisites: Python 3.12+, uv, Node 22+, pnpm

# Install JS dependencies
pnpm install

# Start the API
cd apps/api
uv sync
uv run uvicorn rubriclab.main:app --reload --port 8000

# In another terminal, start the web app
cd apps/web
pnpm dev

Architecture

M8 shipped end-to-end: all components from AgentRunner through dashboard and diff UI are live.

┌─────────────────────────┐    ┌──────────────────────────┐
│  Next.js Dashboard      │    │  CLI: rubriclab run ...  │
│  (suites, runs, traces) │    └────────────┬─────────────┘
└────────────┬────────────┘                 │
             │  REST/JSON                   │
             ▼                              ▼
      ┌─────────────────────────────────────────────┐
      │            FastAPI backend                  │
      │  /suites  /runs  /cases  /traces  /diff     │
      └──┬───────────────┬────────────────┬─────────┘
         │               │                │
         ▼               ▼                ▼
  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
  │ AgentRunner  │ │ JudgeEngine  │ │  SQLite      │
  │ (calls user  │ │ (Anthropic   │ │  (SQLModel)  │
  │  agent +     │ │  rubric      │ │  suites,     │
  │  captures    │ │  scoring)    │ │  cases,      │
  │  trace)      │ │              │ │  runs,       │
  └──────┬───────┘ └──────┬───────┘ │  traces,     │
         │                │         │  scores      │
         ▼                ▼         └──────────────┘
  ┌──────────────────────────────┐
  │  Sample Research Agent       │
  │  (Anthropic + tools)         │
  └──────────────────────────────┘

Core data model: Suite 1—* Case, Run 1—* CaseResult, CaseResult 1—1 Trace, CaseResult 1—* RubricScore. Diff = join two Runs on Case and compute per-dimension deltas.


Demo flow

Demo

Fully runnable end-to-end as of M8. Every step below works with docker compose up --build or local dev.

  1. docker compose up (or local Python + pnpm dev)
  2. Open dashboard → see preloaded “Research Agent v1” suite with 8 cases
  3. Click Run → cases stream in as they complete; pass/fail badges + scores appear live
  4. Open a failed case → trace viewer shows the agent making a wrong tool call; judge’s justification is shown inline
  5. Edit the agent’s system prompt in agents/research/prompt.md, hit Run again
  6. Open Compare runs → side-by-side diff highlights which 3 cases improved, which 1 regressed, with score deltas per rubric dimension

Roadmap

Milestone Description Status
M1 Scaffold + README ✅ Done
M2 SQLite data model (SQLModel) ✅ Done
M3 Agent runner + trace capture ✅ Done
M4 LLM-as-judge engine ✅ Done
M5 FastAPI routes (/suites, /runs, /cases, /traces, /diff) ✅ Done
M6 Next.js dashboard (suite browser, run trigger, results) ✅ Done
M7 Trace viewer + two-run diff UI ✅ Done
M8 Demo packaging (docker compose, record_demo.sh) ✅ Done
M9 CLI (rubriclab run --suite=demo) ⬜ Planned

Contributing

Contributions welcome! Please open an issue before submitting a large PR.

  1. Fork the repo
  2. Create a feature branch: git checkout -b feat/my-feature
  3. Commit with Conventional Commits: git commit -m "feat: add thing"
  4. Open a PR

License

MIT © 2024 RubricLab contributors. See LICENSE.