Skip to content

chore(ai-gateway): remove unused MORPH provider and morph warp grep model#4004

Merged
chrarnoldus merged 1 commit into
mainfrom
chore/remove-morph-provider
Jun 12, 2026
Merged

chore(ai-gateway): remove unused MORPH provider and morph warp grep model#4004
chrarnoldus merged 1 commit into
mainfrom
chore/remove-morph-provider

Conversation

@chrarnoldus

Copy link
Copy Markdown
Contributor

Summary

  • Removed the morph_warp_grep_free_model (morph-warp-grep-v2) Kilo-exclusive model and deleted providers/morph.ts.
  • Removed the MORPH gateway provider from provider-definitions.ts and dropped 'morph' from the ProviderId union — the only model routed through this gateway was the one removed above, so it is no longer used.
  • Added 'morph-warp-grep-v2' to forbiddenFreeModelIds so stale clients receive an appropriate error (per the AI gateway policy for removed free models).
  • Updated the OpenRouter suppression test to use the existing hidden gemma_4_26b_a4b_it_free_model instead of the removed morph model.

Verification

  • No manual verification performed. Changes are code-removal only against an already-hidden, unused model/provider.

Visual Changes

N/A

Reviewer Notes

  • The 'morph' entries in openrouter/inference-provider-id.ts (OpenRouterInferenceProviderIdSchema and VercelNonUserByokInferenceProviderIdSchema) are intentionally left in place — those are external upstream inference provider IDs for routing/BYOK on OpenRouter and Vercel, distinct from our internal MORPH gateway provider.

…odel

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
@chrarnoldus chrarnoldus self-assigned this Jun 12, 2026
@kilo-code-bot

kilo-code-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Clean removal of the unused MORPH gateway provider and morph-warp-grep-v2 Kilo-exclusive model, with correct tombstoning in forbiddenFreeModelIds per the AI gateway policy.

Files Reviewed (5 files)
  • apps/web/src/lib/ai-gateway/forbidden-free-models.ts'morph-warp-grep-v2' correctly added to forbidden list
  • apps/web/src/lib/ai-gateway/models.tsmorph_warp_grep_free_model import and kiloExclusiveModels entry removed
  • apps/web/src/lib/ai-gateway/providers/morph.ts — file deleted
  • apps/web/src/lib/ai-gateway/providers/provider-definitions.tsMORPH provider definition removed
  • apps/web/src/lib/ai-gateway/providers/types.ts'morph' removed from ProviderId union
  • apps/web/src/lib/ai-gateway/providers/openrouter/index.test.ts — suppression test updated to use gemma_4_26b_a4b_it_free_model (confirmed status: 'hidden', valid substitute)

Fix these issues in Kilo Cloud


Reviewed by claude-4.6-sonnet-20260217 · 504,825 tokens

Review guidance: REVIEW.md from base branch main

@lambertjosh

Copy link
Copy Markdown
Contributor

Did we get rid of the experimental option already client side?

@chrarnoldus

Copy link
Copy Markdown
Contributor Author

I don't know, but this stopped working a long time ago.

@chrarnoldus chrarnoldus merged commit 3b86a63 into main Jun 12, 2026
16 checks passed
@chrarnoldus chrarnoldus deleted the chore/remove-morph-provider branch June 12, 2026 19:53
@chrarnoldus

Copy link
Copy Markdown
Contributor Author

I'll make a PR

iscekic added a commit that referenced this pull request Jun 13, 2026
…ests

Main merged PR #4004 which deleted the morph provider. The two test files
that exercised the rejection branch of modelServesAllGatewayChatApis used
morph as the only available Kilo-exclusive model on a chat_completions-only
gateway. With morph gone, no real catalog entry satisfies that condition.

Both test files now stub findKiloExclusiveModel via jest.mock/requireActual
so that the marker id 'test-exclusive/alibaba-only' returns a KiloExclusiveModel
with gateway: 'alibaba'. The real PROVIDERS.ALIBABA definition supports only
chat_completions, so the rejection path is exercised without relying on any
specific provider file being present in the catalog.
iscekic added a commit that referenced this pull request Jun 15, 2026
…ficient (#3982)

* refactor(auto-routing): move classifier core into contracts package

* feat(auto-routing): add tier, routing-table, decision and benchmark contracts

* feat(auto-routing): add benchmark-driven decision engine and KV routing table

* feat(auto-routing): return routing decisions from /decide

* fix(auto-routing): log unparseable routing table JSON before falling back

* feat(auto-routing-benchmark): scaffold benchmark worker with D1 schema

* feat(auto-routing-benchmark): classifier golden dataset and grading

* style(auto-routing-benchmark): apply oxfmt formatting

* feat(auto-routing-benchmark): decider golden dataset with deterministic checkers

* fix(auto-routing-benchmark): unambiguous whitespace instruction in off-by-one case

* feat(auto-routing-benchmark): queue-driven benchmark runs with aggregation and table publish

* feat(auto-routing-benchmark): admin config, runs and routing-table endpoints

* feat(admin): proxy routes for auto-routing benchmark service

* feat(admin): benchmark config, runs and routing table panel

* fix(admin): stabilize benchmark runs polling interval dependencies

* feat(web): internal token mint endpoint for auto-routing benchmark

Mints a short-lived (6h) user API token for a given userId, guarded by the
shared internal secret over Authorization: Bearer. The decider benchmark uses
this to authenticate the kilo CLI against the gateway under a real user's
identity.

* feat(auto-routing-benchmark): run decider cases through kilo CLI in a container

The decider benchmark now executes each case through the stable kilo CLI
(@kilocode/cli) running in a Cloudflare Container, instead of bare OpenRouter
chat completions, so it measures the real agent harness.

- Container (Dockerfile + dependency-free server.mjs) spawns `kilo run
  --format json --auto` per case; the kilo user token is injected only as a
  child-process env var, never logged or written to disk.
- BenchRunnerContainer DO + wrangler containers/durable_objects/migrations.
- kilo-events.ts: pure parser for the CLI JSON event stream (text + cost),
  tolerant of both part.* and flattened event shapes.
- cli-runner.ts: proxies a case to the container and parses the result.
- run.ts: chunks decider cases (10/chunk) into per-(model,chunk) queue
  messages; fetches a short-lived user token once per message; fails fast when
  benchmarkUserId is unset (plus a defensive per-case guard). Classifier path
  unchanged.
- New benchmarkUserId config field (nullable) on BenchmarkConfig.
- vitest aliases @cloudflare/containers to a node-safe stub so unit tests can
  import the worker entry without the cloudflare:workers chain.

* feat(admin): benchmark user id config field

Adds a Benchmark user id input to the benchmark config editor (empty -> null),
with help text noting decider runs fail until it is set. Round-trips through
configToFormState/formStateToConfig.

* feat(gateway): add kilo-auto/efficient with blocking auto-routing decisions

* chore(auto-routing): drop unused import in routing-table contracts

* fix(auto-routing-benchmark): harden decider CLI parsing, grading and retries

- accept step_finish (underscore) events so per-case cost is summed
- retry once when a CLI session ends with no assistant text
- exact checks also accept the last non-empty output line
- uniform final-answer suffix on decider prompts
- /admin/debug-cli endpoint returning raw CLI events for diagnosis

* fix(auto-routing-benchmark): warm up CLI container before concurrent decider cases

* fix(auto-routing-benchmark): faster container turnover to avoid instance exhaustion

* fix(auto-routing-benchmark): address review findings

- serialize CLI runs per container and run decider cases sequentially
  (the CLI sqlite migration is unsafe under concurrent sessions)
- add dead-letter queue and raise container instance ceiling
- redact the kilo token from captured stderr before it leaves the container
- timing-safe secret comparison and tokenSource audit field on minted tokens
- validate persisted routing tables before serving them from the admin API
- regenerate worker types with the production web base URL
- dedupe the routing-table response schema; tier boundary tests

* style(auto-routing-benchmark): format wrangler.jsonc

* fix(auto-routing-benchmark): guard against double finish on spawn failure

Also documents the queue handler's throw-to-retry contract.

* fix(auto-routing): break contracts module cycle and keep response schema client-safe

madge flagged tiers.ts -> index.ts (type-only but counted); tier derivation
now takes a structural subset of ClassifierOutput. The routing-table response
schema moves into contracts so the client component no longer pulls
config.server (server-only) through the admin client re-export.

* chore(admin): drop unused import after schema move

* feat(auto-routing): classifier model becomes an admin override over the benchmark winner

* feat(auto-routing): manual benchmark runs, classifier override, decider reasoning effort

- benchmark runs start only from the admin panel; models with existing
  results are skipped (latest summaries carried forward) unless forced
- classifier benchmark publishes a winner; the admin-set classifier model
  becomes an override on top of it (clearable from the panel)
- decider models accept a reasoning effort, forwarded to the kilo CLI as
  --variant and mirrored in the routing table and live decisions

* refactor(auto-routing): simplification pass

- benchmark worker: single run-state read per queue message; decider chunks
  require caseIds (legacy fallback removed); dead defensive branch and unused
  DeciderCase.maxTokens dropped; container owns CLI warmup via /warmup
  instead of a synthetic benchmark case; admin routes use zodJsonValidator
  like sibling services
- apps/web: parseAdminResponse and the worker-admin fetch wrapper are shared
  modules instead of per-file copies; BenchmarksSection.types re-export shim
  deleted; dead prevConfigRef guard removed; classifier-model sync effect
  keyed on stable primitives; tier sort order hoisted to module scope

* refactor(auto-routing-benchmark): use drizzle for all D1 access

* refactor(auto-routing-benchmark): normalize D1 schema and adopt drizzle-kit migrations

Eliminate all JSON blob columns from the benchmark worker's D1 database:
- Add drizzle-kit, drizzle.config.ts, and pnpm db:generate script
- Replace config_json/runtime_json blobs with dedicated tables
  (config_classifier_models, config_decider_models) and snapshot columns
  on benchmark_runs (min_accuracy, max_concurrency, benchmark_user_id)
- Replace detail_json blob in case_results with explicit diagnostic columns
  (fallback_reason, retried, exit_code, output_prefix, event_count,
  last_event_types)
- Add run_models table for per-run model config snapshots (enqueued flag,
  api kind flags, reasoning_effort)
- Add carried flag to model_summaries (true = prior-run summary copied in
  at startRun for skipped models)
- Explode routing_tables.table_json into routing_table_candidates rows
- Squash old migrations into a single baseline 0000 migration

Rewrite storage layer accordingly: apiKindsToFlags/flagsToApiKinds helpers,
getConfigRows/replaceConfig, insertRun(run, models, carried), getRunWithModels,
saveRoutingTable(table, publishedAt), getLatestRoutingTable returning RoutingTable
with safeParse, getClassifierWinner from D1 directly.

Move pickClassifierWinner to src/winner.ts (pure, no D1 dep).
Add GET /admin/classifier-winner endpoint.
Add ClassifierWinnerResponseSchema to contracts.
KV puts removed; finalizeRunIfComplete now only deletes KV keys so the
auto-routing worker repopulates as a read-through cache.

* fix(auto-routing-benchmark): preserve null candidate cost and type drizzle batches

Replace `avg_cost_usd ?? 0` with a transparent pass-through cast so a stored
NULL is not silently promoted to 0 (cheapest) in the ranking; the downstream
RoutingTableSchema.safeParse in getLatestRoutingTable will reject a corrupted
table rather than serve it with wrong costs. Add a round-trip test confirming
null is preserved through routingTableToRows → rowsToRoutingTable.

Replace the three `any[]` + `as unknown as Parameters<typeof orm.batch>[0]`
patterns in replaceConfig, insertRun, and saveRoutingTable with the typed
`BatchItem<'sqlite'>` tuple form from drizzle-orm/batch, removing the
eslint-disable suppressions.

* refactor(auto-routing-benchmark): make candidate cost non-null to match the contract

* feat(auto-routing): read-through KV cache backed by the benchmark service

On a KV miss (or corrupt value), fetch routing-table and classifier-winner
from the benchmark worker via a service binding, write the result back with
a 1h TTL, and return it. Corrupt cached values are treated as misses. The
existing 60s isolate-level ttlCached wrappers and fail-closed defaults are
unchanged.

* fix(auto-routing): await read-through cache writes and surface origin error bodies

* ci(workers): run worker predeploy scripts (D1 migrations) before deploy

* fix(auto-routing-benchmark): reuse loaded run state in finalize and build tables from the run snapshot

* refactor(auto-routing): share ttl cache, single-source schemas and drop dead exports

- Move TtlCache/ttlCached to @kilocode/worker-utils; delete the two
  identical service-local copies and update all import sites
- Single-source ReasoningEffortSchema in packages/auto-routing-contracts/tiers.ts;
  routing-table.ts and index.ts use it; benchmark.ts re-exports for compatibility
- Add BenchmarkRunStatus type to contracts; db-schema.ts uses it instead of
  the inline literal union
- Replace local ApiKind in benchmark db.ts with ClassifierApiKind from contracts
- Extract DecideBaseParams / buildDecidePayload shared helper from mirror into
  auto-routing-mirror.ts; auto-routing-decision.ts consumes it
- Delete AutoRoutingAdminResult<T> type alias from both admin client files
  (zero consumers); delete BenchmarkRoutingTableResponseSchema re-export from
  benchmark admin client (consumers import from contracts directly)
- Replace route.ts timingSafeStringEqual with timingSafeEqual from
  @kilocode/encryption; keep extractBearerToken local (jose/jest constraint)
- Replace inline 'classifier'|'decider' and api-kind array types in
  BenchmarksSection.tsx with BenchmarkKind and ClassifierApiKind from contracts

* docs(gateway): drop stale keep-in-sync comment on DecideBaseParams

* feat(gateway): bill classifier cost to the user for kilo-auto/efficient

* fix(gateway): fix type error and remove dead guard in classifier billing

* fix(auto-routing): apply decision reasoningEffort to efficient routing

* feat(auto-routing): align kilo-auto/efficient catalog with balanced, hide from listing

* fix(admin): correct run-summaries colspan in benchmarks section

* feat(admin): derive decider model API kinds from gateway provider definitions

* feat(auto-routing): drop default routing table; no table means no decision

* fix(auto-routing): keep classifier override when benchmark origin is unavailable

* docs(contracts): fix stale classifier-winner comment

* fix(benchmark): exclude no-cost-signal summaries from routing table ranking

* test(benchmark): fix expected ranking order in no-cost-signal test

* feat(benchmark): remove fabricated default config; runs require a saved config

* chore(benchmark): drop redundant case_results index, regenerate baseline migration

* docs(benchmark): fix stale KV comment in wrangler config

* feat(auto-routing-benchmark): grade subtaskType and riskLevel, expand classifier dataset to per-pair coverage

* feat(auto-routing-benchmark): expand decider dataset to per-pair taxonomy coverage

Grow the decider benchmark from 30 to 76 cases so every
(taskType, subtaskType) pair in the classifier taxonomy has at least
4 mechanically-checkable cases, with at least 20 cases per difficulty
tier (23 low / 31 medium / 22 high).

- DeciderCase gains subtaskType; ids follow the
  <taskType>-<subtype>-<topic> scheme used by the classifier dataset
- Existing cases retagged with subtypes where they genuinely fit
  (three system-behavior investigation cases moved to
  planning_design/system_design, the HTTP 201 lookup to
  investigation/external_research, and the let-closure case reframed
  as refactoring/migration)
- New agentic_execution cases are self-contained file/terminal tasks
  deterministic in the node:22-slim container
- Tests now enforce per-pair and per-tier quotas from the
  classifierTaxonomy export, subtype/taskType consistency, regex
  compilability, and json_equal round-tripping

* feat(auto-routing): session-sticky decisions with switch-cost factor

Remember the last served model per conversation in the decision-cache DO
and keep it while it meets the current tier's accuracy threshold, unless
the fresh pick is cheaper by more than the routing table's new
switchCostFactor. Switching models discards provider prompt caches, so a
session whose difficulty tier oscillates no longer ping-pongs between
models. Decisions report a sticky flag in the response and the
auto_routing_decision log line.

* feat(auto-routing-benchmark): plumb switchCostFactor through config, runs, and routing table

Store the new BenchmarkConfig.switchCostFactor in the benchmark_config
singleton, snapshot it into benchmark_runs at startRun, and carry the
run's snapshotted value into published routing tables so the schema's
required RoutingTableSchema.switchCostFactor parses on read. Regenerate
the squashed D1 baseline migration, add a Switch cost factor field to
the admin config form, and update test fixtures (including the apps/web
decision fixtures missing the new required sticky flag).

* fix(ai-gateway): align efficient fallback with Qwen-for-all-APIs after main merge

* refactor(auto-routing): drop per-candidate API-kind plumbing, validate at config save

All decider candidates are served via providers that speak every gateway
chat API (in practice OpenRouter), so per-candidate supportedApiKinds was
dead weight in the contracts, decision engine, D1 schema, and routing
table. The one real failure mode - an admin configuring a model whose
serving provider is chat-completions-only - is now rejected at config
save time instead.

* fix(auto-routing): review-pass fixes

- never let a heuristic fallback classification re-anchor the session's
  sticky model (same trust rule as the classification cache)
- drop the dead ClassifierApiKindSchema export
- rename the decider pages-helper case so its id no longer collides with
  the classifier dataset's debug-fix-pagination-slice in shared telemetry
- trim a stale JSDoc in model-api-kinds.ts

* test(ai-gateway): add sticky field to decision fixture

* feat(dev): move auto-routing workers into their own opt-in dev group

* fix(auto-routing): make the decider benchmark runnable in local dev

- Inject KILO_API_URL into the benchmark container via a new
  KILO_CLI_API_URL worker var so the kilo CLI targets the same gateway
  the worker mints tokens against (prod default: api.kilo.ai).
- Add .dev.vars.example mapping both URLs to the local apps/web dev
  server (worker-side localhost, container-side host.docker.internal).
- Add AUTO_ROUTING_BENCHMARK_WORKER_URL to the apps/web env example so
  the admin panel proxies to the local benchmark worker instead of prod.
- Work around wrangler force-pulling the amd64 container egress proxy
  on Apple Silicon (its transparent-proxy setsockopt crashes under
  emulation, failing every local container start) by pinning the arm64
  manifest digest via MINIFLARE_CONTAINER_EGRESS_IMAGE in the dev
  runner.

* fix(auto-routing): kill the whole CLI process tree on decider case timeout

The kilo bin is a Node wrapper that spawns the real CLI binary as a
grandchild. SIGKILLing only the wrapper orphaned the grandchild on
timeout: it kept running (and spending) and held the stdout/stderr
pipes open, so 'close' never fired, the case promise never resolved,
and the chunk's queue message hung until the runtime cut it — then
retried from case 0 and eventually dead-lettered. Observed live: a
runaway agentic case ran 20+ minutes past the 180s cap and wedged the
whole run.

Spawn the CLI detached so it leads its own process group, kill the
group on timeout, and add an after-exit grace backstop so a stray
pipe-holder can never hang a case again.

* feat(auto-routing): benchmark repetitions, p95 latency, and classifier latency gate

- Config gains classifierRepetitions, deciderRepetitions (1-5), and
  classifierMaxP95LatencyMs (null = no constraint); run rows snapshot the
  active repetition count and latency budget at start time.
- case_results PK extended with rep column; timed_out column added.
- model_summaries gains p95_latency_ms (nearest-rank p95 over all rows)
  and timeouts count.
- pickClassifierWinner enforces an optional p95 latency budget: candidates
  meeting both accuracy and latency are ranked by cost; when none meet the
  budget, falls back to lowest-p95 among accuracy-meeting models.
- classifier_winner contract surfaces the winner's p95LatencyMs.
- DECIDER_CHUNK_SIZE reduced from 10 to 5 to stay well within queue
  consumer wall-clock limits.
- Container server propagates timedOut flag through ContainerRunResponse
  and CliRunResult so timed-out cases are recorded in D1.

* fix(auto-routing): correct case_results migration backfill and close test gaps

- Migration 0001: replace "rep"/"timed_out" column refs in INSERT...SELECT
  with literal 0,0 — old table lacks those columns; D1 silently degrades
  double-quoted unknowns to string literals, corrupting NOT NULL integer rows.
- Contracts: add BenchmarkConfigSchema defaults test (classifierRepetitions=1,
  deciderRepetitions=1, classifierMaxP95LatencyMs=1000 when omitted).
- Benchmark: extract buildDeciderMessages() pure function; add fan-out test
  asserting models × reps × ceil(76/5) messages each carrying the correct rep.

* feat(admin): benchmark repetitions, latency budget, and p95/timeout columns

Add classifier/decider repetitions (1–5) and classifierMaxP95LatencyMs
inputs to the Benchmark Config card; add p95 latency and Timeouts
columns to the run summaries table; update test fixtures with new fields.

* fix(admin): correct runs-table colSpan and cover config form round-trip

Set both RunSummariesTable colSpan values back to 6 to match the outer
BenchmarkRunsTable's 6-column header (chevron, Kind, Status, Started,
Completed, Error). Export configToFormState and formStateToConfig for
unit testing and add focused tests covering null-config defaults,
round-trip preservation of repetitions/latency fields, and empty-string
classifierMaxP95LatencyMs coercing to null.

* chore(auto-routing): squash benchmark D1 migrations into one baseline

* test(ai-gateway): stop depending on removed morph model in API-kind tests

Main merged PR #4004 which deleted the morph provider. The two test files
that exercised the rejection branch of modelServesAllGatewayChatApis used
morph as the only available Kilo-exclusive model on a chat_completions-only
gateway. With morph gone, no real catalog entry satisfies that condition.

Both test files now stub findKiloExclusiveModel via jest.mock/requireActual
so that the marker id 'test-exclusive/alibaba-only' returns a KiloExclusiveModel
with gateway: 'alibaba'. The real PROVIDERS.ALIBABA definition supports only
chat_completions, so the rejection path is exercised without relying on any
specific provider file being present in the catalog.

* fix(auto-routing-benchmark): return 400 when starting a run without config

The POST /admin/runs handler let startRun's "config not set" precondition
error propagate to the global error handler, surfacing a client-side
precondition as HTTP 500. Guard the null config in the route handler,
mirroring the /admin/debug-cli pattern, and return 400 instead.

* fix(auto-routing-benchmark): slice queue fan-out under sendBatch limit

Cloudflare Queues caps sendBatch at 100 messages; a decider fan-out is
models × reps × ceil(76/5) messages, which clears 100 with as few as two
models, so the dispatch is now sliced into <=100-message batches. A
mid-dispatch enqueue failure marks the run failed (surfacing in the admin
panel) instead of leaving a partially-enqueued run wedged in 'running'
until the stale sweep.

* fix(ai-gateway): suppress first-usage events for classifier overhead row

The internal auto-routing/classifier microdollar row reused the primary
request's posthog_distinct_id, so it could emit the generic first_usage /
first_microdollar_usage lifecycle events and race the primary usage row —
mis-attributing auto-routing/classifier as the user's first model. Drop the
distinct id on that row so the events stay gated to the primary usage; DB
billing is unaffected (it keys on kiloUserId).

* fix(ai-gateway): bill classifier cost regardless of final-provider BYOK

The auto-routing classifier always runs on Kilo's own OpenRouter
credential, so its cost is owed whether or not the final inference is
served via the user's BYOK key. The billing guard skipped the classifier
usage row whenever the final provider was BYOK, letting BYOK users incur
repeated Kilo-funded classification with no attribution. Bill on positive
classifier cost alone; the row stays is_byok:false / user_byok:false.

* fix(ai-gateway): make efficient classifier spend authenticated + exit-safe

Two leaks in the kilo-auto/efficient classifier-billing path:

- The paid /decide classifier ran before any access check, including for
  unauthenticated requests — which are then rejected (efficient resolves to
  a paid model), spending Kilo-funded inference with no user to attribute.
  The classifier is now skipped when the request has no authenticated user.
- Classifier billing was scheduled only at the end of the successful upstream
  path, so any intervening early return (abuse block, provider/api-kind
  rejection, balance/org checks, upstream 4xx) dropped the already-incurred
  cost. Billing is now registered via after() right after auth resolves, so
  the row persists regardless of how the request ends.

Adds tests for the unauthenticated-skip and downstream-rejection (abuse
block) paths.

* fix(auto-routing): reject duplicate benchmark model ids at validation

config_classifier_models.model and config_decider_models.model are D1
primary keys, but BenchmarkConfigSchema only validated per-entry shape and
minimum length. Duplicate ids passed validation and surfaced as an opaque
D1 constraint violation (HTTP 500) at replaceConfig. Add a superRefine that
flags duplicate (trim-normalized) ids with field-specific issues, so a
duplicate save returns an actionable 400. Adds contract + route tests.

* fix(auto-routing): reject model-experiment ids as decider candidates

Per .specs/model-experiments.md, an experimented public_model_id is a
dedicated preview id users must explicitly select and MUST NOT enter
kilo-auto candidate sets. Benchmark-config save only validated gateway
chat-API support, so an experiment public id could be saved as a decider
candidate and then automatically selected for kilo-auto/efficient. Add a
status-independent ownership check (findExperimentReservedModelIds queries
all experiment statuses, not just the active|paused Redis membership) and
reject such ids with a 400. Adds a route test.

* fix(auto-routing-benchmark): invalidate carried summaries on identity change

A model's prior summaries were carried into a new run on model-id match
alone, so a run that changed reasoning effort, repetitions, the dataset, or
grading/CLI would publish a routing table pairing the current
run_models.reasoning_effort with measurements taken under different
conditions. Persist an engine_identity (dataset content hash + engine
version) per run and carry a prior result only when engine identity,
repetitions, AND the model's reasoning_effort all match; otherwise the model
is re-benchmarked. Adds the column (migration 0001), identity computation,
and carry/invalidation tests.

* fix(auto-routing-benchmark): one active run per kind + stale recovery

Adds a coherent server-side run-admission state machine:
- A partial unique index (one running run per kind) is the atomic backstop;
  startRun pre-checks for an active run and throws RunAlreadyActiveError,
  which the admin route maps to 409 instead of creating overlapping runs.
- Stale runs are swept on GET /admin/runs (not only when starting a run), so
  a dead/wedged run is recovered without the UI deadlock where Start is
  disabled while a run shows 'running'.
- finalizeRunIfComplete skips publishing the routing table / classifier
  winner when a newer run of the same kind has already completed, so a slow
  older run can't overwrite newer published results.

Squashes the branch's D1 migrations into a single baseline now that this
schema isn't deployed to a used database.

* fix(auto-routing): harden benchmarks admin panel (a11y, overflow, dirty state)

Addresses the admin-panel review findings:
- Dirty-state tracking in the config editor: a background config refetch
  (poll / focus) no longer overwrites unsaved edits; the form syncs from
  server only while pristine, with an explicit "Discard & reload".
- Invalidate the routing-table / config queries on the running→terminal run
  transition so published output refreshes instead of showing stale data.
- Expandable run rows now expose a keyboard-accessible button with
  aria-expanded / aria-controls (row click kept as a mouse convenience).
- Wide nested summary + routing tables wrapped in overflow-x-auto.
- Full run error shown in the expanded row (plus a title tooltip on the
  truncated cell) instead of being permanently clipped.

* docs(auto-routing): add ADR and benchmark service README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants