Skip to content

Experiments

AgentV eval files are the only runnable authoring artifact. Use top-level experiment: inside eval.yaml for runtime choices: targets, workers, timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy.

name: support-regression
experiment:
targets: [codex-gpt5, claude-sonnet]
workers: 2
timeout_seconds: 720
repeat:
count: 4
strategy: pass_at_k
cost_limit_usd: 2.00
workspace:
hooks:
before_all:
command: ["bash", "-lc", "bun install && bun run build"]
tests:
- id: refund-eligibility
input: Can this customer get a refund?
criteria: Applies the refund policy correctly

execution: is accepted only as a legacy top-level alias for existing eval files. Do not use both experiment: and execution: in the same eval.

Use tests[] for composition, imports, and selection.

tests:
- include: evals/support/*.eval.yaml
type: suite
select:
test_ids:
- refund-*
- missing-order-date
tags: regression
metadata:
priority: high
run:
threshold: 1.0
repeat:
count: 2
strategy: pass_all
- include: cases/*.cases.yaml
type: tests
- include: cases/regression.jsonl
type: tests
- cases/smoke/*.cases.yaml

type: suite preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The child suite’s experiment: or legacy execution: runtime block is ignored; the parent eval’s runtime block controls the run.

type: tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

tests[].select.test_ids filters imported test IDs with glob patterns. tests[].select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. tests[].select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to include with type: tests and may point at raw case files, directories, or globs. Importing another eval suite must use object form with include: and type: suite.

Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite artifacts are nested under the source suite name inside a wrapper eval result directory, for example .agentv/results/<wrapper-eval>/<timestamp>/<imported-suite>/<test-id>/.... Direct tests owned by the wrapper eval and raw case imports live directly under <test-id>/....

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > tests[].run > experiment
experiment:
target: agent
threshold: 0.8
repeat:
count: 3
strategy: pass_at_k
tests:
- include: ./evals/flaky-agentic/**/*.eval.yaml
type: suite
select:
tags: [agentic]
run:
repeat:
count: 3
strategy: pass_at_k
- include: ./evals/regression/**/*.eval.yaml
type: suite
select:
tags: [must-pass]
run:
threshold: 1.0
repeat:
count: 2
strategy: pass_all
- id: critical-case
input: "..."
criteria: Must pass exactly
run:
threshold: 1.0
repeat:
count: 1

Scoped run: supports threshold, repeat, timeout_seconds, and budget_usd. Candidate-changing fields such as target and targets stay parent-level under experiment:. Workspace mutation belongs in workspace.hooks, and runner-specific setup belongs in targets[].hooks.

experiment: configures evaluation policy. It does not own commands that prepare files, dependencies, repos, or target-specific runner state.

NeedPut it in
Install dependencies, build the repo, seed filesworkspace.hooks.before_all
Reset or apply per-case stateworkspace.hooks.before_each / workspace.hooks.after_each
Configure an agent runner or provider varianttargets[].hooks
Choose targets, repeats, pass policy, budget, thresholdexperiment
workspace:
hooks:
before_all:
command: ["bash", "-lc", "bun install && bun run build"]
targets:
- name: agent-with-skills
provider: codex
hooks:
before_each:
command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]
experiment:
target: agent-with-skills
repeat:
count: 3
strategy: pass_at_k

repeat supports the same core strategies as repeated attempts:

experiment:
repeat:
count: 3
strategy: mean
cost_limit_usd: 1.50

Supported strategies:

StrategyBehavior
pass_at_kUses the best passing attempt; early-exits by default unless early_exit: false is set
pass_allUses the weakest attempt score, so every repeated attempt must meet the threshold
meanAggregates repeated attempt scores by mean
confidence_intervalUses the lower bound of a 95% confidence interval as the conservative score

AgentV also accepts runs and early_exit under experiment: as shorthand for repeat-run policy:

experiment:
runs: 4
early_exit: true

Do not set both repeat and runs in the same runtime block.

Default eval runs write to:

.agentv/results/<eval-name>/<timestamp>/

Imported source suite metadata appears in index.jsonl rows and manifests. AgentV does not add a redundant suite directory when the result group is already the eval name.