agent-eval-lab

Architecture

This document is a detailed design walkthrough. For a high-level overview and quickstart, see the README.

Overview

AgentEval Lab is a local-first pipeline that takes a YAML dataset of tasks, runs them through a ReAct-style agent, grades the resulting traces with an LLM judge, and stores everything in SQLite. A FastAPI backend exposes the data to a Next.js dashboard that supports browsing runs, inspecting traces step by step, and comparing two runs to detect regressions.

Data model

The SQLite database contains four tables.

runs records each top-level evaluation run. It stores a unique run_id (e.g. r_001), the dataset path, the agent module path, a created_at timestamp, and a computed avg_score that is written after all tasks are judged.

task_runs records each individual task execution within a run. It holds a foreign key to runs, the task_id from the YAML dataset, the raw prompt and expected_outcome, the agent’s final answer, and a status field (success or error).

trace_events records every step the agent took while solving a task. Each row belongs to a task_run and carries a step_index, a kind (thought, tool_call, or observation), the content text, and an optional tool_name and tool_input for tool-call steps. Rows are ordered by step_index to reconstruct the full reasoning chain.

scores records the judge’s rubric output for each task run. Each row holds a foreign key to task_runs, an axis name (e.g. correctness), a numeric score (1–5), and a justification string. One row per axis per task run.

Agent harness

The agent loop in aelab/agent.py implements a minimal ReAct cycle:

Thought — the agent is prompted with the task and any prior observations; it responds with a reasoning step that is stored as a thought trace event.
Tool call — if the thought includes a tool invocation, the harness parses the tool name and input, executes the appropriate tool (calculator, web-fetch, or file-read), and stores a tool_call trace event.
Observation — the tool’s output is appended to context as an observation trace event and fed back into the next iteration.

The loop continues until the agent emits a final answer or a maximum step limit is reached. Every event is written to trace_events in order, so the full chain of reasoning is always recoverable.

LLM-as-judge

The judge in aelab/judge.py is a separate Claude API call that never shares context with the agent call, eliminating any risk of the agent influencing its own evaluation.

The judge is given:

A system prompt that defines its role as an objective evaluator
The original task prompt and expected_outcome
The full ordered trace (all thought, tool-call, and observation events)
The agent’s final answer

The rubric axes are: correctness (did the answer match the expected outcome?), tool efficiency (did the agent use the minimum necessary tool calls?), hallucination (did the agent fabricate facts not supported by tool outputs?), and format compliance (did the answer match the requested format?).

Scores are elicited via a tool schema that forces the model to return a structured JSON object with axis, score (integer 1–5), and justification per axis. This eliminates parsing ambiguity and ensures every required axis is present in the output.

Regression detection

aelab/regression.py implements two pure functions.

compute_deltas(scores_a, scores_b) takes two flat lists of score rows (each with task_id, axis, and score) and groups them by task_id × axis. For each group it computes delta = score_b − score_a. The result is a list of delta objects that the compare endpoint attaches to each task.

is_regression(deltas) returns True if any delta in the list is ≤ −1, meaning at least one axis dropped by a full point or more in run B relative to run A.

API surface

GET /api/runs — returns a list of all runs ordered by created_at descending. Each item includes run_id, dataset, agent, avg_score, created_at, and has_regression (a boolean computed by comparing to the immediately preceding run for the same dataset).

GET /api/runs/{run_id} — returns full detail for a single run: the run metadata, a list of task_runs each with their trace_events and scores.

GET /api/compare?a={run_id}&b={run_id} — returns a side-by-side comparison. The response contains metadata for both runs and a tasks array. Each task entry includes the task prompt, the per-axis score deltas computed by compute_deltas, a has_regression boolean from is_regression, and the judge justifications from both runs for context.

Dashboard routing

/runs fetches GET /api/runs and renders a table of runs with their average scores and a ⚠ regression badge when has_regression is true. Each row links to the trace viewer and includes a “Compare →” link that pre-fills the compare form.

/runs/[id] fetches GET /api/runs/{run_id} and renders the trace viewer. Tasks are listed with their final answer and judge scores; each task is expandable to reveal the full ordered trace as a collapsible step tree showing thought, tool call, and observation events with their content.

/compare renders a two-dropdown RunSelectorForm when no query parameters are present. When ?a=<id>&b=<id> are present it fetches GET /api/compare and renders the CompareTable. Each task row shows per-axis score deltas via the ScoreDelta chip (red for negative, green for positive, gray for zero) and a REGRESSION badge when has_regression is true for that task. Judge justifications from both runs are shown inline in an expanded panel.

This site is open source. Improve this page.