Prototype Bakeoff Plan

This page is a historical record of the prototype-selection experiment. The resulting decision is recorded in Backend Selection Decision: Wardwright is now BEAM-first, with Elixir owning runtime plumbing and Gleam owning correctness-heavy pure logic where the boundary is stable enough.

The old standalone harnesses and cross-backend executable probes have been removed from the live tree. New executable behavior belongs in native ExUnit/StreamData tests under app/test.

The original bakeoff kept Go, Rust, Elixir, and a proposed Gleam-on-BEAM variant alive so the first durable implementation could be chosen from evidence instead of preference. The bakeoff turned that intention into a controlled experiment: three non-trivial governance features, implemented independently in each prototype, scored with a rubric defined before implementation began.

The goal is not to prove that one language is universally better. The goal is to identify which prototype gives Wardwright the best combination of correctness, authoring semantics, runtime behavior, testability, maintenance quality, and delivery cost for the product we are actually building.

Experiment Matrix

Each row is one bakeoff feature. The original matrix used Go, Rust, and Elixir for nine implementation attempts total. Add Gleam as a fourth evaluated variant for all three bakeoffs before selecting a primary prototype.

The Gleam variant should be evaluated as Elixir runtime shell plus Gleam typed business-logic core, not as a replacement for Elixir’s HTTP/application boundary. Elixir remains responsible for application supervision, dynamic process registries, provider/model/session lifecycle, and integration with existing Plug/Cowboy surfaces. Gleam owns policy/config data types, pure decision functions, validation, routing math, cache/event classification, and other logic where static exhaustiveness and type safety should reduce policy bugs.

Bakeoff Go Rust Elixir Gleam-on-BEAM Primary signal
Portable structured-output governor bakeoff/json-go bakeoff/json-rust bakeoff/json-elixir bakeoff/json-gleam Policy semantics, provider normalization, receipt quality
Concurrent recent-history governor bakeoff/history-go bakeoff/history-rust bakeoff/history-elixir bakeoff/history-gleam Correctness under load, cache contention, deterministic eviction
Async alert sink with backpressure bakeoff/alerts-go bakeoff/alerts-rust bakeoff/alerts-elixir bakeoff/alerts-gleam Supervision, latency isolation, retries, queue behavior

Gleam may score poorly in areas where ecosystem libraries are immature or where Elixir has better operational ergonomics. That does not make the spike a failure. The evaluation should separately score:

TTSR is not a fair first bakeoff because Rust already has a working spike and the starting line is not equivalent. It should remain a later stream-governance follow-up after the bakeoff chooses or narrows the primary implementation.

Feature Specs

1. Portable Structured-Output Governor

This is not just “validate JSON.” Providers increasingly support structured outputs, but they expose different schema subsets, strictness levels, streaming constraints, refusal behavior, and cache/compile latency. Wardwright’s differentiator is provider-portable output governance.

Required behavior:

The visible contract and held-out oracle should cover valid output, syntax-invalid output, schema-invalid output, semantically-invalid output, schema alternative selection, tolerated optional/missing fields, recovery after one or more guard events, and exhausted-loop failure.

2. Concurrent Recent-History Governor

This extends the policy cache into a realistic runtime governor. It is designed to expose cache architecture, concurrency, latency, and eviction differences. The initial bakeoff should limit history queries to a single session/run scope. Cross-session history search is deferred until the product model can specify which caller, tenant, project, consent, and retention boundaries make such queries safe and useful.

Required behavior:

The visible contract and held-out oracle should cover scope isolation, regex match counts, equivalent events arriving concurrently within one session, deterministic eviction, threshold non-trigger, threshold trigger, irrelevant in-scope events that do not match the rule, out-of-scope events that would match but must not count, and latency/load probes.

3. Async Alert Sink With Backpressure

Alerting is a first-class governance action and a good way to test runtime supervision and failure isolation. The request path should remain bounded even when an alert sink is slow or failing.

Required behavior:

The visible contract and held-out oracle should cover fast sink, slow sink, failing-then-recovering sink, full queue behavior, duplicate alert idempotency, and request-latency budget under sink pressure.

Guard Loop Semantics

TTSR-style rejection is a guard event, not necessarily a terminal outcome. A policy can stop current generation, preserve safe output, redact unsafe spans, inject validation feedback, restart from a different point, switch model or provider, or escalate to a terminal block. Tests should therefore avoid asserting only “rejected” or “accepted” when the behavior under test is a governed loop.

For governed-loop features, the shared contract should assert:

Mocked-model tests should drive canned sequences such as invalid output followed by valid repaired output, repeated invalid output until budget exhaustion, partial streaming output that is stopped mid-span, and multiple rules firing on one response. Canned scenarios should live in reviewable JSON fixtures so policy authors can inspect and extend the behavior corpus without reading test code. Live-LLM tests are useful during development for realism and input diversity, but they should be marked explicitly, excluded from default CI, and treated as exploratory unless their prompts, model, seed/config, and observed counterexamples are captured as regression fixtures.

Test-First Workflow

Every bakeoff starts with a reviewed visible contract and a separate held-out evaluation oracle before implementation branches are created.

  1. Write a visible contract pack under docs/bakeoff-contracts/<feature>.md. It should describe public API behavior, policy semantics, expected receipt shape, native-test translation requirements, and optional live-LLM discovery guidance. Agents may read this contract.
  2. Add visible example fixtures only when they clarify the contract. These fixtures are translation material, not the final judge.
  3. Write the final held-out backend oracle separately from the agent worktree. It should hit only public/prototype test APIs and assert externally visible behavior. Agents must not run this oracle during implementation.
  4. Confirm the held-out backend oracle fails against all three backends for the expected missing-feature reasons.
  5. Hold a human review gate on both the visible contract and held-out oracle. The review should judge whether scenarios, data generation, edge cases, and failure messages are strong enough to guide the bakeoff and catch shallow implementations.
  6. Create the three implementation branches from the same main commit, using worktrees that expose the visible contract but not the held-out oracle.
  7. In each branch, translate the visible contract into native tests first: Go testing, Rust unit/property tests, Elixir ExUnit/StreamData. The native translation is part of the evaluated work product.
  8. Implement until native tests pass. Agents are encouraged to add additional native tests when they observe vacuous passes, untested branches, or live counterexamples.
  9. Agents may run optional live_llm discovery tests during development when credentials or local models are available, including local Ollama. They may adapt those live tests for the backend they are building. Live tests are discovery and realism tools, not CI gates and not the final oracle.
  10. After each agent finishes, run the held-out backend oracle externally against the completed backend as the final correctness gate.
  11. Run the normal repo checks and collect metrics.

Native tests are part of the scoring. A backend should not get full credit for passing the held-out oracle if the native test translation is shallow or tests implementation details instead of behavior.

Useful live failures should be reduced into deterministic native regression fixtures before the implementation is considered complete.

Agents should also consider lightweight mutation testing before declaring a feature done: deliberately break a condition, branch, guard action, eviction rule, or receipt field and confirm the native tests fail for the intended reason. Held-out oracle failures are measured externally after the agent finishes.

Contract And Oracle Quality Bar

The visible contract and held-out oracle are not smoke tests. Together they define and evaluate bakeoff behavior and should be reviewed before kickoff. A weak contract or oracle will produce misleading implementation scores.

Each bakeoff test suite should include:

Before writing each visible contract and held-out oracle, inspect relevant real-world open source test suites and provider examples for style and edge cases. Useful sources include JSON Schema test suites for structured-output behavior, webhook/retry queue tests for alert sinks, and cache/concurrency tests from production-grade libraries. Borrow test ideas and data shapes, not project-specific code, unless the license and attribution path are explicitly acceptable.

The kickoff checklist for each bakeoff contract and oracle:

Gate Requirement
Scenario coverage At least one success, one retry/recovery, one hard failure, and one configuration rejection.
Generated coverage Property/generative tests cover meaningful ranges rather than a few constants.
Cross-backend neutrality Tests assert the public contract and receipts, not backend internals.
Reviewability The user can read the fixtures and understand what behavior is being required.
Failure evidence A failing case prints enough input, receipt, and policy context to diagnose the issue.
Rigor review The suite is reviewed and accepted before any implementation branch starts.

Controls

Metrics To Capture

Each implementation attempt writes a small result artifact: docs/bakeoff-results/<feature>-<backend>.json.

Suggested fields:

{
  "feature": "portable_structured_output_governor",
  "backend": "rust",
  "branch": "bakeoff/json-rust",
  "base_sha": "example",
  "start_time": "2026-05-14T00:00:00Z",
  "first_native_tests_passing_time": "2026-05-14T00:00:00Z",
  "held_out_oracle_passing_time": "2026-05-14T00:00:00Z",
  "review_ready_time": "2026-05-14T00:00:00Z",
  "input_tokens": null,
  "output_tokens": null,
  "cached_input_tokens": null,
  "uncached_input_tokens": null,
  "cache_hit_rate": null,
  "reasoning_output_tokens": null,
  "weighted_total_input_plus_5x_output": null,
  "weighted_uncached_input_plus_5x_output": null,
  "tool_calls": null,
  "files_changed": 0,
  "lines_added": 0,
  "lines_deleted": 0,
  "dependencies_added": [],
  "checks": {
    "native": "pass",
    "held_out_backend_oracle": "pass",
    "mise_check": "pass",
    "gitleaks": "pass"
  },
  "known_limitations": []
}

Token usage is useful when available. Capture total input, cached input, uncached input, output, and reasoning output separately. Output tokens are weighted more heavily than input tokens in the initial cost proxy. Cached input tokens should not be hidden inside total input because cache hit rate may make otherwise-expensive runs materially cheaper and may reward stable prompts, shared context, and repeated test harness setup.

If exact usage is not available, use wall-clock time, tool calls, review iterations, dependency churn, and diff size as cost proxies.

Before real bakeoff branches start, run a tiny instrumentation probe to calibrate what the harness can actually capture. The probe should use deterministic static actions with known expected counts rather than an implementation task. Its job is to test the measurement system itself: wall-clock timing, command counts, input/output token estimates, weighted token cost, git diff effects, command output size, and expected-versus-observed comparisons.

The retired local probe used deterministic static actions and approximate token counting without a provider. If a future bakeoff is needed, rebuild that instrumentation around the active agent runner and native BEAM checks instead of restoring the old cross-backend scripts. When real agent runs are used, prefer direct provider or tool usage metadata. If an external runner such as opencode is used, capture its session export or stats output alongside the harness result. The initial cost proxy reports both total_input + 5 * output and uncached_input + 5 * output until provider-specific cached-input pricing is known.

The first probe is expected to be boring: it should run a few static commands, produce no repository state change, and match expected action counts. Real bakeoff runs should usually produce at least one new commit, so the harness tracks final HEAD, commits added, and diff stats from base to final in addition to dirty worktree status. If those expectations do not match, fix the harness before launching real bakeoff agents.

Use --artifact-dir for calibration and real bakeoff runs so full command outputs and the exact input blob are available for later tokenization and audit. The JSON summary keeps previews and paths; the artifact directory preserves the raw material.

The real-model Codex probe used codex exec --json usage events to derive cached/uncached input, cache hit rate, output, reasoning output, and weighted token proxies. That remains the preferred accounting shape if bakeoffs resume, but the old repository-local harness is no longer maintained.

A successful GPT-5.5 medium no-tool probe on 2026-05-14 captured:

Initial calibration results:

Scoring Rubric

Score each implementation out of 100 after the post-commit adversarial review.

Dimension Points What good looks like
Held-out correctness 20 Passes all held-out backend scenarios and properties without backend-specific exceptions.
Native test translation 15 Native tests express the same behavior and can fail for real regressions.
Feature completeness 15 Implements the full frozen spec, including edge cases and receipt fields.
Code hygiene and maintainability 15 Clear structure, small interfaces, low special-casing, idiomatic backend style.
Runtime behavior 10 Good latency, bounded resource use, clean shutdown/backpressure behavior where relevant.
Observability and receipts 10 Explains decisions, actions, retries, failures, and policy state clearly.
Security and safety 10 No secret leakage, fail-closed where appropriate, bounded untrusted inputs.
Delivery cost 5 Low wall-clock time, low cached-adjusted token/tool use, low dependency churn, few review fixes.

Tie-breakers:

Decision Gates

After each bakeoff:

After all three bakeoffs:

Execution Sequence

Do not launch all nine implementation jobs until the measurement process has proved itself on real work.

  1. Freeze the visible contract, held-out oracle, and base main SHA.
  2. Launch the first wave as four concurrent jobs for one bakeoff feature, one per backend/variant.
  3. Review the outputs, metrics, native test translations, held-out oracle failures/passes, and post-commit adversarial reviews.
  4. If the data is comparable and the instructions produced useful work, launch the remaining jobs.
  5. If the first wave exposes bad scoring, weak tests, unclear instructions, or untrustworthy instrumentation, revise the harness and rerun a smaller wave before spending the full bakeoff budget.

Baseline Parity Audit Template

Before the first bakeoff, fill this table from live code:

Capability Go Rust Elixir Gleam-on-BEAM Notes
OpenAI-compatible chat endpoint          
Synthetic simulate endpoint          
Config mutation endpoint          
Provider target config          
Env/fnox credential references          
Policy cache endpoints          
history_threshold request policy          
Stream governance/TTSR          
Native property tests          
Shared probes          
Load-test harness          

BEAM Runtime Isolation Requirement

The Elixir and Gleam-on-BEAM variants should explicitly model Wardwright runtime isolation. The target process hierarchy is:

Required isolation behavior:

The bakeoff score should reward implementations that test this with observable behavior: deliberately crash a session worker, timeout a policy/NIF/sidecar call, and confirm another session under the same model plus another model runtime continues to answer.