LLM Benchmark: sequential test for MongoDB by bradleyshep · Pull Request #5439 · clockworklabs/SpacetimeDB

bradleyshep · 2026-06-24T13:50:42Z

Description of Changes

Adds MongoDB as a third backend to the LLM cost-to-done benchmark (alongside
SpacetimeDB and PostgreSQL), plus benchmark-tooling cleanups and a few skill-doc
improvements surfaced by the runs.

Benchmark harness (tools/llm-sequential-upgrade/):

Add the MongoDB backend (Express + Mongoose + Socket.io) and a matching perf-benchmark client
Consolidate the SpacetimeDB SDK reference to the official skills/ files (drop the focused/fork alternates and the STDB_SDK_REF switch)
Pin the model and force the 5-min cache tier for cost parity
Remove the unused Playwright grading path

Skills (skills/typescript-{server,client}/SKILL.md) — documentation-only additions from benchmark transcript analysis:

Per-viewer access control via Views
TypeScript gotchas (readonly useTable rows, bigint in JSX, nullable ctx.connectionId)
Schema re-export note

API and ABI breaking changes

None.

Expected complexity level and risk

1 — nearly all changes live in an internal benchmark tool. The only changes to
shipped artifacts are documentation additions to two skill files; no SDK/runtime code
paths are touched.

Testing

Full SpacetimeDB L1–L12 benchmark sweep + the MongoDB comparison run completed clean on the updated harness (run.sh syntax-checked)
Reviewer: skim the two SKILL.md additions for accuracy and tone

Generalizes the cost-to-done benchmark harness from two backends (SpacetimeDB, PostgreSQL) to also support MongoDB, at parity with the existing PostgreSQL path. Standard MERN stack (Express + Mongoose + Socket.io), manual Socket.io real-time (no change streams), Vite on 6373. run.sh: N-backend port allocation (Vite 6373, Mongo DB 6437), mongodb pre-flight + per-run database isolation, a `.benchmark-backend` marker written at generate time so fix/upgrade/grade reliably tell mongodb and postgres apart (both use a server/ dir), a mongodb arm in the minimal and standard CLAUDE.md assembly, and parallel-run sed patching for 6373 + mongodb:// connection strings. reset-app.sh: marker-based detection + a mongodb reset arm (dropDatabase). grade.sh: marker-based detection and Vite port resolved from metadata (fallbacks aligned to run.sh: 6173/6273/6373). generate-report.mjs: clarify the server/ LOC branch covers postgres + mongodb. GRADING.md / GRADING_WORKFLOW.md: add MongoDB URL/port; correct stale ports. Verified end to end: `./run.sh --level 1 --backend mongodb` generates, builds, and deploys a working MERN chat app with zero build reprompts and cost telemetry in the standard format; reset and grade plumbing tested. .gitignore: stop tracking generated run output (published to the external spacetimedb-ai-test-results repo instead).

…clockworklabs/SpacetimeDB into bradley/sequential-mongodb-test

Grading is manual (Chrome MCP / human in-browser), so the deterministic Playwright path is dead weight. Removes the --test/TEST_MODE plumbing and the run.sh auto-grade block that invoked the (now-deleted) Playwright scripts, making the harness self-consistent. - run.sh: drop --test/TEST_MODE, the testMode metadata field, and the Playwright/agents auto-grade block; UI-contract stripping is now unconditional (it only mattered for automated UI assertions). - benchmark.sh, run-loop.sh: drop --test/TEST_FLAG passthrough. - reset-app.sh: reword comment (clean slate for grading, not Playwright). - README.md, DEVELOP.md: drop Playwright references; note templates/ and the mongodb backend. - .gitignore: drop the Playwright ignore entries. The grade-playwright.sh, grade-agents.sh, and parse-playwright-results.mjs scripts were already removed.

Stop ignoring sequential-upgrade/ so generated app state is versioned and can be reverted between levels. Still excluded: node_modules/dist/.vite/drizzle, local .env, verbose run.log, and PII raw-telemetry.jsonl. Snapshot of the L1 MongoDB run (chat-app-20260616-100224): generate + 1 presence fix iteration (online-users ref-counting), model claude-sonnet-4-6.

Restore point for Level 1 before the L2 upgrade. Online-presence ref-counting fix confirmed; BUG_REPORT.md removed (resolved).

…oint)

…int)

…store point)

…emetry (failed L6 attempt excluded)

…presence bugs); preserve L1/L7 bug reports in snapshots

…int)

…read-reply leak); failed API-500 fix attempt excluded

…point)

…eaderboard through L9

…store point)

…ctivity-decay bug); leaderboard through L10

… fixes; 1 publish attempt (vs 3 pre-migration-note); 5-min cache; L2 $1.03

…ntry When tables (schema.ts) and reducers (index.ts) are split, the entry must re-export the schema default (export { default } from './schema') or publish aborts with "haven't exported your schema". Recurred 3/3 benchmark generates — fair SDK structure doc.

… fixes; 1 publish attempt, 0 errors; 5-min cache; L3 $0.80

…fixes; 1 publish attempt, 0 errors; 5-min cache; L4 $0.60

…ll 3/3), 0 fixes; 2 publishes (duplicate-index self-fix); 5-min cache; L5 $0.74

…I-only (banner over still-visible/received messages), not data-level revocation; BUG_REPORT filed, fix pending

…kick now revokes data access: client subscribes via membership semijoin so a kicked user's messages drop from cache (existing vanish + new ones never arrive); UI renders kicked card instead of overlay; client-only fix; L6 fix $1.04 (upgrade $0.92 + fix $1.04 = $1.97 to-done)

… 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L7 $1.23

…0 fixes; 1 publish, 0 tsc errors; 5-min cache; L8 $1.09

…, 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L9 $1.45

…l 3/3), 0 fixes; client-only (0 publishes, 0 tsc errors); 5-min cache; L10 $0.50

…(cross-session draft real-time sync — input now driven by reactive messageDraft table, not just room-switch); upgrade $0.78 + fix $0.47 = $1.25 to-done; client-only fix; 5-min cache

…), 0 fixes; identity-native (no migration code — persistent Identity preserves history on setName); 1 publish, 0 tsc errors; 5-min cache; L12 $1.33 — RUN COMPLETE: L1-12 $13.05 to-done, 2 fixes

…s) for run 20260618 — matches L1-L6 tracking

…bsolute paths); gitignore them repo-wide. Cost data remains in cost-summary.json + COST_REPORT.md

…ale for the official skill edits made during the 20260618 run)

…sults repo; stop tracking run dirs here (gitignore them; output dir unchanged)

…ript-server/client SKILL.md). Remove focused/fork alternate refs and STDB_SDK_REF switch

…), matching the issue log

…ranscript-capture blocks)

bradleyshep added 30 commits June 15, 2026 16:05

init

b7c93fd

Merge branch 'bradley/sequential-mongodb-test' of https://github.com/…

41bf822

…clockworklabs/SpacetimeDB into bradley/sequential-mongodb-test

some minor polish/fixes

f5ee5ec

model flag; runbook

bba5a5f

L1 MongoDB final — presence fix verified, bug report cleared

463fc24

Restore point for Level 1 before the L2 upgrade. Online-presence ref-counting fix confirmed; BUG_REPORT.md removed (resolved).

Record L1 MongoDB grades: 12/12 (Features 1-4 all 3/3), 1 fix iteration

272b2ff

L2 MongoDB generate — Scheduled Messages added (pre-grading restore p…

0bdc6cc

…oint)

L2 MongoDB final — 15/15 (Features 1-5 all 3/3), 0 fix iterations

5b4b666

L3 MongoDB generate — Ephemeral Messages added (pre-grading restore p…

74ab683

…oint)

L3 MongoDB final — 18/18 (Features 1-6 all 3/3), 0 fix iterations

91f058d

L4 MongoDB generate — Message Reactions added (pre-grading restore po…

8f21161

…int)

L4 MongoDB final — 21/21 (Features 1-7 all 3/3), 0 fix iterations

23ecd69

L5 MongoDB generate — Message Editing + history added (pre-grading re…

61660f1

…store point)

L5 MongoDB final — 24/24 (Features 1-8 all 3/3), 0 fix iterations

5c8e912

L6 MongoDB generate — Real-Time Permissions added; backfill L2-L6 tel…

8396e48

…emetry (failed L6 attempt excluded)

L6 MongoDB final — 27/27 (Features 1-9 all 3/3), 0 fix iterations

72cc015

L7 MongoDB generate — Rich Presence added (pre-grading restore point)

0d1d12c

L7 MongoDB final — 30/30 (Features 1-10 all 3/3), 1 fix iteration (3 …

ed0bf42

…presence bugs); preserve L1/L7 bug reports in snapshots

L8 MongoDB generate — Message Threading added (pre-grading restore po…

c528f7b

…int)

L8 MongoDB final — 33/33 (Features 1-11 all 3/3), 1 fix iteration (th…

9379d63

…read-reply leak); failed API-500 fix attempt excluded

L9 MongoDB generate — Private Rooms & DMs added (pre-grading restore …

f1550cb

…point)

Add cross-backend LEADERBOARD (cost/fixes/quality per level, through L8)

95433da

L9 MongoDB final — 36/36 (Features 1-12 all 3/3), 0 fix iterations; l…

eca79b7

…eaderboard through L9

L10 MongoDB generate — Room Activity Indicators added (pre-grading re…

cf66672

…store point)

L10 MongoDB final — 39/39 (Features 1-13 all 3/3), 1 fix iteration (a…

6478171

…ctivity-decay bug); leaderboard through L10

L11 MongoDB generate — Draft Sync added (pre-grading restore point)

b380fe1

bradleyshep added 29 commits June 18, 2026 10:09

L2 STDB upgrade + 15/15 (Scheduled Messages, Features 1-5 all 3/3), 0…

ebceee0

… fixes; 1 publish attempt (vs 3 pre-migration-note); 5-min cache; L2 $1.03

Update SKILL.md

efdf0bd

L3 STDB upgrade + 18/18 (Ephemeral Messages, Features 1-6 all 3/3), 0…

6bd9539

… fixes; 1 publish attempt, 0 errors; 5-min cache; L3 $0.80

L4 STDB upgrade + 21/21 (Message Reactions, Features 1-7 all 3/3), 0 …

e19edc2

…fixes; 1 publish attempt, 0 errors; 5-min cache; L4 $0.60

L5 STDB upgrade + 24/24 (Message Editing with History, Features 1-8 a…

bec904b

…ll 3/3), 0 fixes; 2 publishes (duplicate-index self-fix); 5-min cache; L5 $0.74

L6 STDB upgrade (Real-Time Permissions) — PRE-FIX snapshot: kick is U…

215e476

…I-only (banner over still-visible/received messages), not data-level revocation; BUG_REPORT filed, fix pending

cost summary

800edce

Update SKILL.md

24db0fa

L7 STDB upgrade (Rich User Presence) + 30/30 (Features 1-10 all 3/3),…

82ab523

… 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L7 $1.23

L8 STDB upgrade (Message Threading) + 33/33 (Features 1-11 all 3/3), …

601ba13

…0 fixes; 1 publish, 0 tsc errors; 5-min cache; L8 $1.09

L9 STDB upgrade (Private Rooms & DMs) + 36/36 (Features 1-12 all 3/3)…

4dedc7c

…, 0 fixes; 1 publish, 0 tsc errors; 5-min cache; L9 $1.45

L10 STDB upgrade (Room Activity Indicators) + 39/39 (Features 1-13 al…

6258c5d

…l 3/3), 0 fixes; client-only (0 publishes, 0 tsc errors); 5-min cache; L10 $0.50

L11 STDB upgrade (Draft Sync) + 42/42 (Features 1-14 all 3/3), 1 fix …

f665a98

…(cross-session draft real-time sync — input now driven by reactive messageDraft table, not just room-switch); upgrade $0.78 + fix $0.47 = $1.25 to-done; client-only fix; 5-min cache

L12 STDB upgrade (Anonymous Migration) + 45/45 (Features 1-15 all 3/3…

aa6f4fe

…), 0 fixes; identity-native (no migration code — persistent Identity preserves history on setName); 1 publish, 0 tsc errors; 5-min cache; L12 $1.33 — RUN COMPLETE: L1-12 $13.05 to-done, 2 fixes

Add L7-L12 + L11-fix telemetry (cost-summary, COST_REPORT, transcript…

d794c82

…s) for run 20260618 — matches L1-L6 tracking

Stop tracking telemetry app-dir.txt/metadata.json (machine-specific a…

c4a99ec

…bsolute paths); gitignore them repo-wide. Cost data remains in cost-summary.json + COST_REPORT.md

Add STDB benchmark issue log & skills changelog (bug catalog + ration…

0f22d1a

…ale for the official skill edits made during the 20260618 run)

Issue log: drop Category column

703ec3d

Keep STDB issue log as untracked local working doc (not committed)

dd0d8d1

Move sequential-upgrade run output to external spacetimedb-ai-test-re…

4491c30

…sults repo; stop tracking run dirs here (gitignore them; output dir unchanged)

Merge branch 'master' into bradley/sequential-mongodb-test

7ce2941

Update SKILL.md

fe59eb4

Remove MONGODB_BACKEND_PLAN.md

0adfcc5

Single SDK reference: always use the official customer skills (typesc…

2c0c7ef

…ript-server/client SKILL.md). Remove focused/fork alternate refs and STDB_SDK_REF switch

Keep STDB cost tracking as untracked local working doc (not committed…

911d077

…), matching the issue log

Trim verbose comments (.gitignore run-output note, detect_backend + t…

e715c30

…ranscript-capture blocks)

Remove now fixed bug from skill

c739a38

bradleyshep requested a review from cloutiertyler June 24, 2026 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM Benchmark: sequential test for MongoDB#5439

LLM Benchmark: sequential test for MongoDB#5439
bradleyshep wants to merge 103 commits into
masterfrom
bradley/sequential-mongodb-test

bradleyshep commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bradleyshep commented Jun 24, 2026

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant