Skip to content

@Eval + @Metric

Spec

Full specification: specs/eval.md · Item: AJ-4 (bundles @Metric, originally tracked as AJ-5)

Purpose

@Eval is a class decorator that turns a class into an evaluation suite over an @Agent or @Workflow target. @Metric is the companion method decorator that marks per-case scoring functions. The two ship together because a metric-less @Eval is meaningless.

EvalRunner.run(SuiteCls) resolves the dataset, instantiates the target, runs cases concurrently, aggregates per-metric scores via the configured aggregator (mean / min / max / p50 / p95 / count_passing), computes a weighted total, compares against threshold, and returns an EvalRun snapshot. EvalRun.save(...) + compare_runs(prev, curr) power regression detection (the AJ-27 CLI wraps this).

Reach for @Eval whenever you change a prompt, a model, or a tool — and you want a number to decide whether to ship.

Signature

def Eval(
    *,
    agent: type | None = None,
    workflow: type | None = None,
    dataset: str | os.PathLike[str] | Dataset | type[Dataset],
    threshold: float = 0.5,
    concurrency: int = 5,
) -> Callable[[type[T]], type[T]]: ...


@overload
def Metric(fn: Callable[..., float], /) -> Callable[..., float]: ...
@overload
def Metric(
    *,
    aggregator: Literal["mean", "min", "max", "p50", "p95", "count_passing"] = "mean",
    weight: float = 1.0,
    pass_threshold: float = 0.5,
) -> Callable[[Callable[..., float]], Callable[..., float]]: ...

Quick example

from ajolopy import Agent
from ajolopy.eval import Eval, EvalRunner, Metric


@Agent(model="claude-opus-4-7", system="You are Acme Support.")
class Support:
    """Top-level support agent."""


@Eval(agent=Support, dataset="evals/support.jsonl", threshold=0.85)
class SupportEval:
    """Smoke test for the Support agent."""

    @Metric
    def helpful(self, output, expected) -> float:
        return 1.0 if expected["intent"] in output.text.lower() else 0.0

    @Metric(aggregator="p95", weight=0.5)
    def latency(self, output, expected) -> float:
        return max(0.0, 1.0 - output.latency_ms / 5000)


async def main() -> None:
    run = await EvalRunner().run(SupportEval)
    run.save(".ajolopy/eval-runs/2026-05-14T22-30-00Z.json")

Kwargs

Decorator Kwarg Type Default Description
@Eval agent type \| None None An @Agent class. Exactly one of agent= / workflow= is required.
@Eval workflow type \| None None A @Workflow class. Mutually exclusive with agent=.
@Eval dataset str \| PathLike \| Dataset \| type[Dataset] required Anything resolve_dataset accepts (path to JSONL, custom Dataset, ...).
@Eval threshold float 0.5 Aggregate-score floor in [0.0, 1.0]. The suite passes when the weighted aggregate clears it.
@Eval concurrency int 5 Max in-flight cases per run(). Must be >= 1.
@Metric aggregator Literal["mean", "min", "max", "p50", "p95", "count_passing"] "mean" How per-case scores fold into the metric's aggregate.
@Metric weight float 1.0 Relative weight in the suite-wide weighted aggregate. Must be > 0.
@Metric pass_threshold float 0.5 Per-case pass bar AND per-metric aggregate pass bar (shared by design).

Escape hatches

  • Subclass EvalRunner to override _invoke_target (custom invocation flow) or _aggregate (custom score combinators).
  • Programmatic suites. Build _ajolopy_eval metadata by hand for runtime-generated suites that do not fit the decorator surface.
  • Custom datasets. Implement Dataset (see specs/dataset-loader.md) for non-JSONL sources (CSV, S3, in-memory generators, ...).
  • Async metrics. @Metric accepts async def for LLM-as-judge or network-bound scoring.

Common gotchas

  • Exactly one of agent= / workflow= must be set. Both or neither raises EvalConfigError at decoration time.
  • A suite without any @Metric method raises EvalConfigError — an eval with no metrics has nothing to measure.
  • Per-case isolation: a target error or metric exception fails just that case; the suite continues. EvalRun.passed reflects the aggregate.
  • compare_runs(prev, curr) rejects dataset sha256 mismatches with EvalComparisonError — comparisons are only meaningful for the same dataset.
  • EvalRun.load(path).cases[i].output.raw is the string repr of the original return, not the original object. Round-tripping full fidelity is out of scope; keep the in-memory EvalRun if you need the raw value.

See also