agent-eval-lab

AgentEval Lab

Video walkthrough: https://youtu.be/rVIvr8eDdVM 60-second overview: https://youtu.be/Rv5FMktWmzE

Open-source agent evaluation harness: run agents against datasets, capture traces, score with rubric-based LLM-judges, and view regressions in a web dashboard.

demo

What it is

AgentEval Lab is a local-first harness for testing LLM agents against structured datasets. You define tasks in YAML — each with a prompt, expected outcome, and rubric — point the CLI at an agent file, and get back a SQLite database of structured traces and per-axis rubric scores produced by a separate LLM judge. No external infrastructure required.

The web dashboard gives you three views: a runs list, a step-by-step trace inspector that shows every thought and tool call, and a compare view that diffs two runs side by side and flags any rubric axis that regressed by one point or more. The goal is to make “does this agent actually work?” a repeatable, auditable question rather than a vibe check.

Quickstart

git clone https://github.com/RitikPatill/agent-eval-lab.git
cd agent-eval-lab

# Install the Python package and dev tools (requires Python 3.11+)
pip install -e ".[dev]"

# Set your Anthropic API key — required for the LLM judge
export ANTHROPIC_API_KEY=sk-...

# Seed the DB with two pre-built demo runs and start both servers
# Requires Node 18+ for the Next.js dashboard
bash record_demo.sh

The script installs Node dependencies, seeds SQLite with a baseline run (r_001) and a regression run (r_002), and starts the FastAPI backend on port 8000 and the Next.js dashboard on port 3000. Press Ctrl-C to stop both servers.

Usage

Run any dataset against any agent from the CLI:

aelab run datasets/research_qa.yaml --agent agents/v1_researcher.py
# → prints a run id, e.g. r_abc123

Open http://localhost:3000/runs to see the new run. Click it to walk through the trace — each row is one agent step showing the thought, the tool called, and the judge’s score with a written justification. To compare two runs, navigate to http://localhost:3000/compare?a=r_001&b=r_002. Tasks where the second run scored lower on any rubric axis receive a REGRESSION badge, with the judge justification shown inline so you know why the score dropped.

The FastAPI backend also exposes interactive docs at http://localhost:8000/docs.

Architecture

flowchart LR
    CLI[aelab CLI] --> Harness[Agent Harness]
    Harness -->|tool calls| Tools[calc · web-fetch · file-read]
    Harness -->|trace| DB[(SQLite)]
    Harness --> Judge[LLM Judge]
    Judge -->|rubric scores| DB
    DB --> API[FastAPI]
    API --> UI[Next.js Dashboard]
    UI -->|runs / traces / compare| User((User))
    Datasets[YAML datasets] --> Harness

See docs/architecture.md for a detailed design walkthrough.

Project structure

agent-eval-lab/
  aelab/               # Python package — core logic
    cli.py             # `aelab` entry-point (Typer)
    agent.py           # ReAct loop: thought → tool call → observation
    tools/             # calculator, web-fetch, file-read implementations
    runner.py          # orchestrates dataset → agent → judge → db
    judge.py           # LLM-as-judge with structured JSON output
    regression.py      # pure score-delta and regression-flag logic
    db.py              # SQLite schema and query helpers
    api.py             # FastAPI app: /api/runs, /api/traces, /api/compare
  agents/              # example agent configs (v1 baseline, v2 regression)
  datasets/            # research_qa.yaml, tool_use.yaml
  scripts/             # seed_demo.py — idempotent DB seeder
  web/                 # Next.js 14 + Tailwind dashboard
    src/app/runs/      # runs list with regression badges
    src/app/compare/   # side-by-side score delta view
    src/components/    # CompareTable, ScoreDelta, RunSelectorForm
  docs/                # architecture.md, quickstart.md, screenshot.png, demo.gif
  tests/               # pytest: scaffold, regression logic, API integration
  record_demo.sh       # end-to-end demo: seed → start servers → print URLs
  pyproject.toml

Roadmap

v0.2 — Multi-judge ensembling: run N independent judge calls per trace, aggregate scores by mean, and surface standard deviation in the dashboard to flag statistically uncertain results
v0.3 — OpenTelemetry export: map trace events to OTEL spans and add --otel-endpoint to aelab run so traces are visible in Jaeger, Zipkin, or Grafana Tempo
v0.4 — Langfuse sink: add --sink langfuse to post completed runs and rubric scores to a Langfuse instance alongside production traces

License

MIT — see LICENSE.

Built autonomously by autodev, a multi-agent orchestrator I designed. Each commit in this repo was authored by me; the implementation work was performed by Sonnet under the orchestrator’s control. Read the orchestrator’s README to see how.

This site is open source. Improve this page.