Experiments

AgentV eval files are the only runnable authoring artifact. Use top-level experiment: inside eval.yaml for runtime choices: targets, workers, timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy.

name: support-regression

experiment:
  targets: [codex-gpt5, claude-sonnet]
  workers: 2
  timeout_seconds: 720
  repeat:
    count: 4
    strategy: pass_at_k
    cost_limit_usd: 2.00

workspace:
  hooks:
    before_all:
      command: ["bash", "-lc", "bun install && bun run build"]

tests:
  - id: refund-eligibility
    input: Can this customer get a refund?
    criteria: Applies the refund policy correctly

execution: is accepted only as a legacy top-level alias for existing eval files. Do not use both experiment: and execution: in the same eval.

Tests Imports

Use tests[] for composition, imports, and selection.

tests:
  - include: evals/support/*.eval.yaml
    type: suite
    select:
      test_ids:
        - refund-*
        - missing-order-date
      tags: regression
      metadata:
        priority: high
    run:
      threshold: 1.0
      repeat:
        count: 2
        strategy: pass_all
  - include: cases/*.cases.yaml
    type: tests
  - include: cases/regression.jsonl
    type: tests
  - cases/smoke/*.cases.yaml

type: suite preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The child suite’s experiment: or legacy execution: runtime block is ignored; the parent eval’s runtime block controls the run.

type: tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

tests[].select.test_ids filters imported test IDs with glob patterns. tests[].select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. tests[].select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to include with type: tests and may point at raw case files, directories, or globs. Importing another eval suite must use object form with include: and type: suite.

Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite artifacts are nested under the source suite name inside a wrapper eval result directory, for example .agentv/results/<wrapper-eval>/<timestamp>/<imported-suite>/<test-id>/.... Direct tests owned by the wrapper eval and raw case imports live directly under <test-id>/....

Scoped Run Overrides

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > tests[].run > experiment

experiment:
  target: agent
  threshold: 0.8
  repeat:
    count: 3
    strategy: pass_at_k

tests:
  - include: ./evals/flaky-agentic/**/*.eval.yaml
    type: suite
    select:
      tags: [agentic]
    run:
      repeat:
        count: 3
        strategy: pass_at_k

  - include: ./evals/regression/**/*.eval.yaml
    type: suite
    select:
      tags: [must-pass]
    run:
      threshold: 1.0
      repeat:
        count: 2
        strategy: pass_all

  - id: critical-case
    input: "..."
    criteria: Must pass exactly
    run:
      threshold: 1.0
      repeat:
        count: 1

Scoped run: supports threshold, repeat, timeout_seconds, and budget_usd. Candidate-changing fields such as target and targets stay parent-level under experiment:. Workspace mutation belongs in workspace.hooks, and runner-specific setup belongs in targets[].hooks.

Lifecycle Ownership

experiment: configures evaluation policy. It does not own commands that prepare files, dependencies, repos, or target-specific runner state.

Need	Put it in
Install dependencies, build the repo, seed files	`workspace.hooks.before_all`
Reset or apply per-case state	`workspace.hooks.before_each` / `workspace.hooks.after_each`
Configure an agent runner or provider variant	`targets[].hooks`
Choose targets, repeats, pass policy, budget, threshold	`experiment`

workspace:
  hooks:
    before_all:
      command: ["bash", "-lc", "bun install && bun run build"]

targets:
  - name: agent-with-skills
    provider: codex
    hooks:
      before_each:
        command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]

experiment:
  target: agent-with-skills
  repeat:
    count: 3
    strategy: pass_at_k

Repeat Runs

repeat supports the same core strategies as repeated attempts:

experiment:
  repeat:
    count: 3
    strategy: mean
    cost_limit_usd: 1.50

Supported strategies:

Strategy	Behavior
`pass_at_k`	Uses the best passing attempt; early-exits by default unless `early_exit: false` is set
`pass_all`	Uses the weakest attempt score, so every repeated attempt must meet the threshold
`mean`	Aggregates repeated attempt scores by mean
`confidence_interval`	Uses the lower bound of a 95% confidence interval as the conservative score

AgentV also accepts runs and early_exit under experiment: as shorthand for repeat-run policy:

experiment:
  runs: 4
  early_exit: true

Do not set both repeat and runs in the same runtime block.

Result Layout

Default eval runs write to:

.agentv/results/<eval-name>/<timestamp>/

Imported source suite metadata appears in index.jsonl rows and manifests. AgentV does not add a redundant suite directory when the result group is already the eval name.