Skip to content

LLM Benchmark: sequential test for MongoDB#5439

Open
bradleyshep wants to merge 103 commits into
masterfrom
bradley/sequential-mongodb-test
Open

LLM Benchmark: sequential test for MongoDB#5439
bradleyshep wants to merge 103 commits into
masterfrom
bradley/sequential-mongodb-test

Conversation

@bradleyshep

Copy link
Copy Markdown
Contributor

Description of Changes

Adds MongoDB as a third backend to the LLM cost-to-done benchmark (alongside
SpacetimeDB and PostgreSQL), plus benchmark-tooling cleanups and a few skill-doc
improvements surfaced by the runs.

Benchmark harness (tools/llm-sequential-upgrade/):

  • Add the MongoDB backend (Express + Mongoose + Socket.io) and a matching perf-benchmark client
  • Consolidate the SpacetimeDB SDK reference to the official skills/ files (drop the focused/fork alternates and the STDB_SDK_REF switch)
  • Pin the model and force the 5-min cache tier for cost parity
  • Remove the unused Playwright grading path

Skills (skills/typescript-{server,client}/SKILL.md) — documentation-only additions from benchmark transcript analysis:

  • Per-viewer access control via Views
  • TypeScript gotchas (readonly useTable rows, bigint in JSX, nullable ctx.connectionId)
  • Schema re-export note

API and ABI breaking changes

None.

Expected complexity level and risk

1 — nearly all changes live in an internal benchmark tool. The only changes to
shipped artifacts are documentation additions to two skill files; no SDK/runtime code
paths are touched.

Testing

  • Full SpacetimeDB L1–L12 benchmark sweep + the MongoDB comparison run completed clean on the updated harness (run.sh syntax-checked)
  • Reviewer: skim the two SKILL.md additions for accuracy and tone

Generalizes the cost-to-done benchmark harness from two backends
(SpacetimeDB, PostgreSQL) to also support MongoDB, at parity with the
existing PostgreSQL path. Standard MERN stack (Express + Mongoose +
Socket.io), manual Socket.io real-time (no change streams), Vite on 6373.

run.sh: N-backend port allocation (Vite 6373, Mongo DB 6437), mongodb
pre-flight + per-run database isolation, a `.benchmark-backend` marker
written at generate time so fix/upgrade/grade reliably tell mongodb and
postgres apart (both use a server/ dir), a mongodb arm in the minimal and
standard CLAUDE.md assembly, and parallel-run sed patching for 6373 +
mongodb:// connection strings.

reset-app.sh: marker-based detection + a mongodb reset arm (dropDatabase).
grade.sh: marker-based detection and Vite port resolved from metadata
(fallbacks aligned to run.sh: 6173/6273/6373).
generate-report.mjs: clarify the server/ LOC branch covers postgres + mongodb.
GRADING.md / GRADING_WORKFLOW.md: add MongoDB URL/port; correct stale ports.

Verified end to end: `./run.sh --level 1 --backend mongodb` generates,
builds, and deploys a working MERN chat app with zero build reprompts and
cost telemetry in the standard format; reset and grade plumbing tested.

.gitignore: stop tracking generated run output (published to the external
spacetimedb-ai-test-results repo instead).
Generalizes the cost-to-done benchmark harness from two backends
(SpacetimeDB, PostgreSQL) to also support MongoDB, at parity with the
existing PostgreSQL path. Standard MERN stack (Express + Mongoose +
Socket.io), manual Socket.io real-time (no change streams), Vite on 6373.

run.sh: N-backend port allocation (Vite 6373, Mongo DB 6437), mongodb
pre-flight + per-run database isolation, a `.benchmark-backend` marker
written at generate time so fix/upgrade/grade reliably tell mongodb and
postgres apart (both use a server/ dir), a mongodb arm in the minimal and
standard CLAUDE.md assembly, and parallel-run sed patching for 6373 +
mongodb:// connection strings.

reset-app.sh: marker-based detection + a mongodb reset arm (dropDatabase).
grade.sh: marker-based detection and Vite port resolved from metadata
(fallbacks aligned to run.sh: 6173/6273/6373).
generate-report.mjs: clarify the server/ LOC branch covers postgres + mongodb.
GRADING.md / GRADING_WORKFLOW.md: add MongoDB URL/port; correct stale ports.

Verified end to end: `./run.sh --level 1 --backend mongodb` generates,
builds, and deploys a working MERN chat app with zero build reprompts and
cost telemetry in the standard format; reset and grade plumbing tested.

.gitignore: stop tracking generated run output (published to the external
spacetimedb-ai-test-results repo instead).
Grading is manual (Chrome MCP / human in-browser), so the deterministic
Playwright path is dead weight. Removes the --test/TEST_MODE plumbing and
the run.sh auto-grade block that invoked the (now-deleted) Playwright
scripts, making the harness self-consistent.

- run.sh: drop --test/TEST_MODE, the testMode metadata field, and the
  Playwright/agents auto-grade block; UI-contract stripping is now
  unconditional (it only mattered for automated UI assertions).
- benchmark.sh, run-loop.sh: drop --test/TEST_FLAG passthrough.
- reset-app.sh: reword comment (clean slate for grading, not Playwright).
- README.md, DEVELOP.md: drop Playwright references; note templates/ and
  the mongodb backend.
- .gitignore: drop the Playwright ignore entries.

The grade-playwright.sh, grade-agents.sh, and parse-playwright-results.mjs
scripts were already removed.
Stop ignoring sequential-upgrade/ so generated app state is versioned and can be
reverted between levels. Still excluded: node_modules/dist/.vite/drizzle, local
.env, verbose run.log, and PII raw-telemetry.jsonl.

Snapshot of the L1 MongoDB run (chat-app-20260616-100224): generate + 1 presence
fix iteration (online-users ref-counting), model claude-sonnet-4-6.
Restore point for Level 1 before the L2 upgrade. Online-presence ref-counting
fix confirmed; BUG_REPORT.md removed (resolved).
…presence bugs); preserve L1/L7 bug reports in snapshots
…read-reply leak); failed API-500 fix attempt excluded
… fixes; 1 publish attempt (vs 3 pre-migration-note); 5-min cache; L2 $1.03
…ntry

When tables (schema.ts) and reducers (index.ts) are split, the entry must re-export the
schema default (export { default } from './schema') or publish aborts with
"haven't exported your schema". Recurred 3/3 benchmark generates — fair SDK structure doc.
… fixes; 1 publish attempt, 0 errors; 5-min cache; L3 $0.80
…fixes; 1 publish attempt, 0 errors; 5-min cache; L4 $0.60
…ll 3/3), 0 fixes; 2 publishes (duplicate-index self-fix); 5-min cache; L5 $0.74
…I-only (banner over still-visible/received messages), not data-level revocation; BUG_REPORT filed, fix pending
…kick now revokes data access: client subscribes via membership semijoin so a kicked user's messages drop from cache (existing vanish + new ones never arrive); UI renders kicked card instead of overlay; client-only fix; L6 fix $1.04 (upgrade $0.92 + fix $1.04 = $1.97 to-done)
… 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L7 $1.23
…0 fixes; 1 publish, 0 tsc errors; 5-min cache; L8 $1.09
…, 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L9 $1.45
…l 3/3), 0 fixes; client-only (0 publishes, 0 tsc errors); 5-min cache; L10 $0.50
…(cross-session draft real-time sync — input now driven by reactive messageDraft table, not just room-switch); upgrade $0.78 + fix $0.47 = $1.25 to-done; client-only fix; 5-min cache
…), 0 fixes; identity-native (no migration code — persistent Identity preserves history on setName); 1 publish, 0 tsc errors; 5-min cache; L12 $1.33 — RUN COMPLETE: L1-12 $13.05 to-done, 2 fixes
…s) for run 20260618 — matches L1-L6 tracking
…bsolute paths); gitignore them repo-wide. Cost data remains in cost-summary.json + COST_REPORT.md
…ale for the official skill edits made during the 20260618 run)
…sults repo; stop tracking run dirs here (gitignore them; output dir unchanged)
…ript-server/client SKILL.md). Remove focused/fork alternate refs and STDB_SDK_REF switch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant