Feature Spikes

This page is a working backlog for self-directed Wardwright experiments. The goal is not to chase every gateway feature. The goal is to find constrained agentic workflows where a synthetic model policy layer can measurably reduce failures, cost, latency, or diagnosis time.

Research Signals

Recent agent-observability writeups emphasize failures that ordinary request logs do not explain well: tool misuse, context loss, goal drift, retry loops, multi-agent cascading errors, and silent quality degradation. Latitude’s March 2026 failure taxonomy is especially relevant because it maps each failure mode to a detection strategy and argues for turning production traces into regression tests.

LangGraph’s human-in-the-loop docs draw a useful boundary: true human approval requires interrupting execution, persisting graph state, and resuming with an explicit decision. Wardwright’s MVP alerting is much smaller: record a receipt event and notify an operator without pausing the request.

oh-my-pi’s Time Traveling Streamed Rules are the closest public analogue to Wardwright’s stream-policy idea: regex-triggered output stream rules that activate only when the model starts producing relevant content, abort the stream, inject a reminder, and retry once per session. Wardwright can generalize that into backend-neutral synthetic model policy with explicit receipt semantics and bounded release latency.

Experiment Matrix

Spike	Why it might matter	MVP shape	Cost/risk	Success metric
Structured output repair	JSON/XML drift is common and easy to test.	Final-output validator with retry-or-block.	Medium latency from retries; parser/schema design.	Lower invalid-output rate against fixtures.
Streaming TTSR	Known bad patterns can be stopped before consumers see them.	Buffered horizon, literal/regex trigger, retry with reminder.	Hard streaming semantics; visible latency.	Violating bytes never released in buffered mode.
Governance authoring assistant	Most users should describe desired behavior, not hand-write policy DSL.	Permissioned model selection, draft artifact generation, diff review, and simulation-guided revision.	Must avoid treating AI output as authoritative policy; prompt/data redaction matters.	User intent compiles to a valid rule with generated tests and human-approved artifact diff.
Policy graph and conflict workbench	Complex policies need visible ordering, parallel-safe groups, and conflict review.	Phase graph, rule cards, effect sets, arbitration badges, and conflict diagnostics.	UI can become too abstract unless grounded in examples and receipts.	Users can explain why a rule runs in parallel, ordered, or rejected before activation.
Starlark AST and trace workbench	Code-first policy might be viable if behavior can be visualized through AST projection and simulation traces.	Small Starlark policy editor, parsed branch/call graph, source-span highlights, scenario trace overlay, and opaque-branch warnings.	Static AST can imply more understanding than exists; runtime traces must be authoritative.	Users can identify which code branch/action caused a scenario difference without reading the full policy.
Structured-vs-code policy comparison	The primary authoring model is still an open product decision.	Build the same TTSR, cache-count, and model-switch policies in structured primitives and Starlark-first UIs against the same scenarios.	Parallel spikes can split focus unless they share scenarios and success criteria.	One approach clearly improves prediction, review, debugging, or trust for technical policy authors.
Tool-loop detector	Retry loops are expensive and diagnosable from session facts.	Session rolling counter keyed by tool/args/result hashes.	Needs agent/tool metadata standardization.	Fewer repeated equivalent calls in generated traces.
Async alert sinks	Makes policy value visible before full approval workflow exists.	Receipt event plus webhook/Telegram/Slack sink adapter.	Sink failure/backpressure semantics.	Policy trip reliably creates auditable delivery record.
Approval gate	Valuable for irreversible operations, but not just a notification.	Pending request state, approve/edit/reject, timeout.	Requires durable state and client UX contract.	Resumable approval tests pass without duplicate side effects.
Prompt variant receipts	Makes Wardwright useful as a prompt experiment boundary.	Versioned preamble/postscript transforms recorded in receipts.	Can become Helicone-style product sprawl.	Operators can compare outcomes by transform version.
Budget/context governor	Route decisions are central to synthetic models.	Run/session counters and context-threshold route actions.	Budget facts need deterministic cache semantics.	Generated threshold tests produce stable route/degrade/alert choices.
Trace-to-regression importer	Converts production failures into examples and tests.	Receipt fixture import to BDD scenario generator.	Needs stable receipt schema and redaction.	A labeled incident becomes a failing test before fix.

First Example Library

The example policies should be intentionally boring before they are clever:

Ambiguous success: alert when an agent claims completion but required artifact metadata is absent.
JSON contract: retry once with validation feedback, then block if JSON is still invalid or missing required semantic fields.
Deprecated API TTSR: withhold streamed code long enough to catch OldClient(, retry with a reminder, and prove the bad bytes were not released.
Repeated tool call: count equivalent tool calls in a session and inject a reminder or alert after N repeats.
Budget step-up: record when a route crosses from cheap/local to expensive/managed and alert when a session crosses a configured spend window.

Each example needs:

a model definition
a BDD scenario
a generated/property variant
a receipt fixture
a UI state that shows the trigger, action, latency, and release status
an assistant prompt fixture that can regenerate or explain the rule draft
a policy graph summary showing phase, effect set, and arbitration behavior

What Not To Overbuild Yet

Hosted marketplace policy monetization: keep manifest/provenance hooks, but do not let it drive MVP complexity.
Full arbitrary receipt queries inside policy code: use deterministic declared working sets first.
Synchronous human approval: document the contract now, implement after persistence/resume semantics are real.
Provider-specific prompt management parity: record prompt transforms and variants first; broader A/B analytics can wait until receipts are richer.
AI-authored opaque policy: the assistant may draft and review artifacts, but the runtime policy must remain compiled, deterministic, inspectable, and simulator-verifiable.