@Eval + @Metric¶
Spec
Full specification: specs/eval.md
· Item: AJ-4 (bundles @Metric, originally tracked as AJ-5)
Purpose¶
@Eval is a class decorator that turns a class into an evaluation suite
over an @Agent or @Workflow target.
@Metric is the companion method decorator that marks per-case
scoring functions. The two ship together because a metric-less @Eval
is meaningless.
EvalRunner.run(SuiteCls) resolves the dataset, instantiates the target,
runs cases concurrently, aggregates per-metric scores via the configured
aggregator (mean / min / max / p50 / p95 / count_passing),
computes a weighted total, compares against threshold, and returns an
EvalRun snapshot. EvalRun.save(...) + compare_runs(prev, curr)
power regression detection (the AJ-27 CLI wraps this).
Reach for @Eval whenever you change a prompt, a model, or a tool — and
you want a number to decide whether to ship.
Signature¶
def Eval(
*,
agent: type | None = None,
workflow: type | None = None,
dataset: str | os.PathLike[str] | Dataset | type[Dataset],
threshold: float = 0.5,
concurrency: int = 5,
) -> Callable[[type[T]], type[T]]: ...
@overload
def Metric(fn: Callable[..., float], /) -> Callable[..., float]: ...
@overload
def Metric(
*,
aggregator: Literal["mean", "min", "max", "p50", "p95", "count_passing"] = "mean",
weight: float = 1.0,
pass_threshold: float = 0.5,
) -> Callable[[Callable[..., float]], Callable[..., float]]: ...
Quick example¶
from ajolopy import Agent
from ajolopy.eval import Eval, EvalRunner, Metric
@Agent(model="claude-opus-4-7", system="You are Acme Support.")
class Support:
"""Top-level support agent."""
@Eval(agent=Support, dataset="evals/support.jsonl", threshold=0.85)
class SupportEval:
"""Smoke test for the Support agent."""
@Metric
def helpful(self, output, expected) -> float:
return 1.0 if expected["intent"] in output.text.lower() else 0.0
@Metric(aggregator="p95", weight=0.5)
def latency(self, output, expected) -> float:
return max(0.0, 1.0 - output.latency_ms / 5000)
async def main() -> None:
run = await EvalRunner().run(SupportEval)
run.save(".ajolopy/eval-runs/2026-05-14T22-30-00Z.json")
Kwargs¶
| Decorator | Kwarg | Type | Default | Description |
|---|---|---|---|---|
@Eval |
agent |
type \| None |
None |
An @Agent class. Exactly one of agent= / workflow= is required. |
@Eval |
workflow |
type \| None |
None |
A @Workflow class. Mutually exclusive with agent=. |
@Eval |
dataset |
str \| PathLike \| Dataset \| type[Dataset] |
required | Anything resolve_dataset accepts (path to JSONL, custom Dataset, ...). |
@Eval |
threshold |
float |
0.5 |
Aggregate-score floor in [0.0, 1.0]. The suite passes when the weighted aggregate clears it. |
@Eval |
concurrency |
int |
5 |
Max in-flight cases per run(). Must be >= 1. |
@Metric |
aggregator |
Literal["mean", "min", "max", "p50", "p95", "count_passing"] |
"mean" |
How per-case scores fold into the metric's aggregate. |
@Metric |
weight |
float |
1.0 |
Relative weight in the suite-wide weighted aggregate. Must be > 0. |
@Metric |
pass_threshold |
float |
0.5 |
Per-case pass bar AND per-metric aggregate pass bar (shared by design). |
Escape hatches¶
- Subclass
EvalRunnerto override_invoke_target(custom invocation flow) or_aggregate(custom score combinators). - Programmatic suites. Build
_ajolopy_evalmetadata by hand for runtime-generated suites that do not fit the decorator surface. - Custom datasets. Implement
Dataset(seespecs/dataset-loader.md) for non-JSONL sources (CSV, S3, in-memory generators, ...). - Async metrics.
@Metricacceptsasync deffor LLM-as-judge or network-bound scoring.
Common gotchas¶
- Exactly one of
agent=/workflow=must be set. Both or neither raisesEvalConfigErrorat decoration time. - A suite without any
@Metricmethod raisesEvalConfigError— an eval with no metrics has nothing to measure. - Per-case isolation: a target error or metric exception fails just
that case; the suite continues.
EvalRun.passedreflects the aggregate. compare_runs(prev, curr)rejects dataset sha256 mismatches withEvalComparisonError— comparisons are only meaningful for the same dataset.EvalRun.load(path).cases[i].output.rawis the string repr of the original return, not the original object. Round-tripping full fidelity is out of scope; keep the in-memoryEvalRunif you need the raw value.
See also¶
@Agent/@Workflow— the targets an eval suite scores.- Spec:
specs/eval.md. - Dataset format:
specs/dataset-loader.md.