LLM Benchmark: sequential test for MongoDB#5439
Open
bradleyshep wants to merge 103 commits into
Open
Conversation
Generalizes the cost-to-done benchmark harness from two backends (SpacetimeDB, PostgreSQL) to also support MongoDB, at parity with the existing PostgreSQL path. Standard MERN stack (Express + Mongoose + Socket.io), manual Socket.io real-time (no change streams), Vite on 6373. run.sh: N-backend port allocation (Vite 6373, Mongo DB 6437), mongodb pre-flight + per-run database isolation, a `.benchmark-backend` marker written at generate time so fix/upgrade/grade reliably tell mongodb and postgres apart (both use a server/ dir), a mongodb arm in the minimal and standard CLAUDE.md assembly, and parallel-run sed patching for 6373 + mongodb:// connection strings. reset-app.sh: marker-based detection + a mongodb reset arm (dropDatabase). grade.sh: marker-based detection and Vite port resolved from metadata (fallbacks aligned to run.sh: 6173/6273/6373). generate-report.mjs: clarify the server/ LOC branch covers postgres + mongodb. GRADING.md / GRADING_WORKFLOW.md: add MongoDB URL/port; correct stale ports. Verified end to end: `./run.sh --level 1 --backend mongodb` generates, builds, and deploys a working MERN chat app with zero build reprompts and cost telemetry in the standard format; reset and grade plumbing tested. .gitignore: stop tracking generated run output (published to the external spacetimedb-ai-test-results repo instead).
Generalizes the cost-to-done benchmark harness from two backends (SpacetimeDB, PostgreSQL) to also support MongoDB, at parity with the existing PostgreSQL path. Standard MERN stack (Express + Mongoose + Socket.io), manual Socket.io real-time (no change streams), Vite on 6373. run.sh: N-backend port allocation (Vite 6373, Mongo DB 6437), mongodb pre-flight + per-run database isolation, a `.benchmark-backend` marker written at generate time so fix/upgrade/grade reliably tell mongodb and postgres apart (both use a server/ dir), a mongodb arm in the minimal and standard CLAUDE.md assembly, and parallel-run sed patching for 6373 + mongodb:// connection strings. reset-app.sh: marker-based detection + a mongodb reset arm (dropDatabase). grade.sh: marker-based detection and Vite port resolved from metadata (fallbacks aligned to run.sh: 6173/6273/6373). generate-report.mjs: clarify the server/ LOC branch covers postgres + mongodb. GRADING.md / GRADING_WORKFLOW.md: add MongoDB URL/port; correct stale ports. Verified end to end: `./run.sh --level 1 --backend mongodb` generates, builds, and deploys a working MERN chat app with zero build reprompts and cost telemetry in the standard format; reset and grade plumbing tested. .gitignore: stop tracking generated run output (published to the external spacetimedb-ai-test-results repo instead).
…clockworklabs/SpacetimeDB into bradley/sequential-mongodb-test
Grading is manual (Chrome MCP / human in-browser), so the deterministic Playwright path is dead weight. Removes the --test/TEST_MODE plumbing and the run.sh auto-grade block that invoked the (now-deleted) Playwright scripts, making the harness self-consistent. - run.sh: drop --test/TEST_MODE, the testMode metadata field, and the Playwright/agents auto-grade block; UI-contract stripping is now unconditional (it only mattered for automated UI assertions). - benchmark.sh, run-loop.sh: drop --test/TEST_FLAG passthrough. - reset-app.sh: reword comment (clean slate for grading, not Playwright). - README.md, DEVELOP.md: drop Playwright references; note templates/ and the mongodb backend. - .gitignore: drop the Playwright ignore entries. The grade-playwright.sh, grade-agents.sh, and parse-playwright-results.mjs scripts were already removed.
Stop ignoring sequential-upgrade/ so generated app state is versioned and can be reverted between levels. Still excluded: node_modules/dist/.vite/drizzle, local .env, verbose run.log, and PII raw-telemetry.jsonl. Snapshot of the L1 MongoDB run (chat-app-20260616-100224): generate + 1 presence fix iteration (online-users ref-counting), model claude-sonnet-4-6.
Restore point for Level 1 before the L2 upgrade. Online-presence ref-counting fix confirmed; BUG_REPORT.md removed (resolved).
…emetry (failed L6 attempt excluded)
…presence bugs); preserve L1/L7 bug reports in snapshots
…read-reply leak); failed API-500 fix attempt excluded
…eaderboard through L9
…ctivity-decay bug); leaderboard through L10
… fixes; 1 publish attempt (vs 3 pre-migration-note); 5-min cache; L2 $1.03
…ntry
When tables (schema.ts) and reducers (index.ts) are split, the entry must re-export the
schema default (export { default } from './schema') or publish aborts with
"haven't exported your schema". Recurred 3/3 benchmark generates — fair SDK structure doc.
… fixes; 1 publish attempt, 0 errors; 5-min cache; L3 $0.80
…fixes; 1 publish attempt, 0 errors; 5-min cache; L4 $0.60
…ll 3/3), 0 fixes; 2 publishes (duplicate-index self-fix); 5-min cache; L5 $0.74
…I-only (banner over still-visible/received messages), not data-level revocation; BUG_REPORT filed, fix pending
…kick now revokes data access: client subscribes via membership semijoin so a kicked user's messages drop from cache (existing vanish + new ones never arrive); UI renders kicked card instead of overlay; client-only fix; L6 fix $1.04 (upgrade $0.92 + fix $1.04 = $1.97 to-done)
… 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L7 $1.23
…0 fixes; 1 publish, 0 tsc errors; 5-min cache; L8 $1.09
…, 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L9 $1.45
…l 3/3), 0 fixes; client-only (0 publishes, 0 tsc errors); 5-min cache; L10 $0.50
…(cross-session draft real-time sync — input now driven by reactive messageDraft table, not just room-switch); upgrade $0.78 + fix $0.47 = $1.25 to-done; client-only fix; 5-min cache
…), 0 fixes; identity-native (no migration code — persistent Identity preserves history on setName); 1 publish, 0 tsc errors; 5-min cache; L12 $1.33 — RUN COMPLETE: L1-12 $13.05 to-done, 2 fixes
…s) for run 20260618 — matches L1-L6 tracking
…bsolute paths); gitignore them repo-wide. Cost data remains in cost-summary.json + COST_REPORT.md
…ale for the official skill edits made during the 20260618 run)
…sults repo; stop tracking run dirs here (gitignore them; output dir unchanged)
…ript-server/client SKILL.md). Remove focused/fork alternate refs and STDB_SDK_REF switch
…), matching the issue log
…ranscript-capture blocks)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of Changes
Adds MongoDB as a third backend to the LLM cost-to-done benchmark (alongside
SpacetimeDB and PostgreSQL), plus benchmark-tooling cleanups and a few skill-doc
improvements surfaced by the runs.
Benchmark harness (
tools/llm-sequential-upgrade/):skills/files (drop thefocused/forkalternates and theSTDB_SDK_REFswitch)Skills (
skills/typescript-{server,client}/SKILL.md) — documentation-only additions from benchmark transcript analysis:useTablerows, bigint in JSX, nullablectx.connectionId)API and ABI breaking changes
None.
Expected complexity level and risk
1 — nearly all changes live in an internal benchmark tool. The only changes to
shipped artifacts are documentation additions to two skill files; no SDK/runtime code
paths are touched.
Testing
run.shsyntax-checked)SKILL.mdadditions for accuracy and tone