Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
b7c93fd
init
bradleyshep Jun 15, 2026
647edd9
Add MongoDB backend to the sequential-upgrade benchmark harness
bradleyshep Jun 15, 2026
41ab48b
Add MongoDB backend to the sequential-upgrade benchmark harness
bradleyshep Jun 15, 2026
41bf822
Merge branch 'bradley/sequential-mongodb-test' of https://github.com/…
bradleyshep Jun 15, 2026
febe81f
Remove Playwright automated grading
bradleyshep Jun 15, 2026
f5ee5ec
some minor polish/fixes
bradleyshep Jun 16, 2026
bba5a5f
model flag; runbook
bradleyshep Jun 16, 2026
1c21ff9
Track benchmark run output for git-revert; L1 MongoDB baseline
bradleyshep Jun 16, 2026
463fc24
L1 MongoDB final — presence fix verified, bug report cleared
bradleyshep Jun 16, 2026
272b2ff
Record L1 MongoDB grades: 12/12 (Features 1-4 all 3/3), 1 fix iteration
bradleyshep Jun 16, 2026
0bdc6cc
L2 MongoDB generate — Scheduled Messages added (pre-grading restore p…
bradleyshep Jun 16, 2026
5b4b666
L2 MongoDB final — 15/15 (Features 1-5 all 3/3), 0 fix iterations
bradleyshep Jun 16, 2026
74ab683
L3 MongoDB generate — Ephemeral Messages added (pre-grading restore p…
bradleyshep Jun 16, 2026
91f058d
L3 MongoDB final — 18/18 (Features 1-6 all 3/3), 0 fix iterations
bradleyshep Jun 16, 2026
8f21161
L4 MongoDB generate — Message Reactions added (pre-grading restore po…
bradleyshep Jun 16, 2026
23ecd69
L4 MongoDB final — 21/21 (Features 1-7 all 3/3), 0 fix iterations
bradleyshep Jun 16, 2026
61660f1
L5 MongoDB generate — Message Editing + history added (pre-grading re…
bradleyshep Jun 16, 2026
5c8e912
L5 MongoDB final — 24/24 (Features 1-8 all 3/3), 0 fix iterations
bradleyshep Jun 16, 2026
8396e48
L6 MongoDB generate — Real-Time Permissions added; backfill L2-L6 tel…
bradleyshep Jun 16, 2026
72cc015
L6 MongoDB final — 27/27 (Features 1-9 all 3/3), 0 fix iterations
bradleyshep Jun 16, 2026
0d1d12c
L7 MongoDB generate — Rich Presence added (pre-grading restore point)
bradleyshep Jun 16, 2026
ed0bf42
L7 MongoDB final — 30/30 (Features 1-10 all 3/3), 1 fix iteration (3 …
bradleyshep Jun 16, 2026
c528f7b
L8 MongoDB generate — Message Threading added (pre-grading restore po…
bradleyshep Jun 16, 2026
9379d63
L8 MongoDB final — 33/33 (Features 1-11 all 3/3), 1 fix iteration (th…
bradleyshep Jun 16, 2026
f1550cb
L9 MongoDB generate — Private Rooms & DMs added (pre-grading restore …
bradleyshep Jun 16, 2026
95433da
Add cross-backend LEADERBOARD (cost/fixes/quality per level, through L8)
bradleyshep Jun 16, 2026
eca79b7
L9 MongoDB final — 36/36 (Features 1-12 all 3/3), 0 fix iterations; l…
bradleyshep Jun 16, 2026
cf66672
L10 MongoDB generate — Room Activity Indicators added (pre-grading re…
bradleyshep Jun 16, 2026
6478171
L10 MongoDB final — 39/39 (Features 1-13 all 3/3), 1 fix iteration (a…
bradleyshep Jun 16, 2026
b380fe1
L11 MongoDB generate — Draft Sync added (pre-grading restore point)
bradleyshep Jun 16, 2026
6c10efb
L11 MongoDB final — 42/42 (Features 1-14 all 3/3), 0 fix iterations; …
bradleyshep Jun 16, 2026
d7ab849
LEADERBOARD: add time-to-complete (wall-clock) section through L11, w…
bradleyshep Jun 16, 2026
ebb35b3
L12 MongoDB generate — Anonymous Migration added (pre-grading restore…
bradleyshep Jun 16, 2026
12223b0
update
bradleyshep Jun 16, 2026
b1c3de6
L12 MongoDB final — 45/45 (Features 1-15 all 3/3), 0 fix iterations; …
bradleyshep Jun 16, 2026
9df3a9f
perf-benchmark: add MongoDB backend (client adapter + stress/realisti…
bradleyshep Jun 16, 2026
4e4217e
perf-benchmark: add MongoDB optimized reference + results (clean sonn…
bradleyshep Jun 16, 2026
9707447
benchmark prompts: strip benchmark-revealing framing (Key Differences…
bradleyshep Jun 17, 2026
30b4192
CLAUDE.md: drop STDB-only 'bindings' from the generic phase summary (…
bradleyshep Jun 17, 2026
fce231c
CLAUDE.md: drop redundant phase enumeration — defer to the backend fi…
bradleyshep Jun 17, 2026
374d557
benchmark prompts: trim redundant filler (restated subtitle, Referenc…
bradleyshep Jun 17, 2026
fcb589e
spacetime.md: trim redundant subtitle for parity with mongo/pg cleanup
bradleyshep Jun 17, 2026
2095851
Merge branch 'master' into bradley/sequential-mongodb-test
bradleyshep Jun 17, 2026
ff3e7eb
run.sh: assemble STDB CLAUDE.md from official skills (typescript-serv…
bradleyshep Jun 17, 2026
731915f
L1 STDB generate + 12/12 (fresh 20260617 baseline: cleaned prompts + …
bradleyshep Jun 17, 2026
11202cc
L2 STDB upgrade + 15/15 (Scheduled Messages, Features 1-5 all 3/3), 0…
bradleyshep Jun 17, 2026
7dc5fe5
L3 STDB upgrade + 18/18 (Ephemeral Messages, Features 1-6 all 3/3), 0…
bradleyshep Jun 17, 2026
30fe67c
L4 STDB upgrade + 21/21 (Message Reactions, Features 1-7 all 3/3), 0 …
bradleyshep Jun 17, 2026
e022e34
L5 STDB upgrade + 24/24 (Message Editing, Features 1-8 all 3/3), 0 fi…
bradleyshep Jun 17, 2026
6613f2a
L6 STDB upgrade + 27/27 (Real-Time Permissions, Features 1-9 all 3/3)…
bradleyshep Jun 17, 2026
adbd57f
L7 STDB upgrade + 30/30 (Rich Presence, Features 1-10 all 3/3), 0 fix…
bradleyshep Jun 17, 2026
7f07405
L8 STDB upgrade + 33/33 (Message Threading, Features 1-11 all 3/3), 0…
bradleyshep Jun 17, 2026
417bbf8
L9 STDB upgrade + 36/36 (Private Rooms & DMs, Features 1-12 all 3/3),…
bradleyshep Jun 17, 2026
42b37c8
L10 STDB upgrade + 39/39 (Activity Indicators, Features 1-13 all 3/3)…
bradleyshep Jun 17, 2026
21b525e
L11 STDB upgrade + 42/42 (Draft Sync, Features 1-14 all 3/3), 0 fixes…
bradleyshep Jun 17, 2026
9c5ac4f
L12 STDB upgrade + 45/45 (Anon Migration, Features 1-15 all 3/3), 0 f…
bradleyshep Jun 17, 2026
c80328d
run.sh: add selectable STDB SDK reference (STDB_SDK_REF: focused defa…
bradleyshep Jun 17, 2026
1a7dd4e
Fix STDB backend template bugs + save session transcript per run
bradleyshep Jun 17, 2026
0caee57
L1 STDB generate + 12/12 (Basic Chat, Typing, Read Receipts, Unread —…
bradleyshep Jun 17, 2026
4532b0e
run.sh: default model to claude-sonnet-4-6 + array-based claude invoc…
bradleyshep Jun 17, 2026
3bb3ceb
L2 STDB upgrade + 15/15 (Scheduled Messages, Features 1-5 all 3/3), 0…
bradleyshep Jun 17, 2026
573ecff
L3 STDB upgrade + 18/18 (Ephemeral Messages, Features 1-6 all 3/3), 0…
bradleyshep Jun 17, 2026
e8e689e
L4 STDB upgrade + 21/21 (Message Reactions, Features 1-7 all 3/3), 0 …
bradleyshep Jun 17, 2026
8c1917e
L5 STDB upgrade + 24/24 (Message Editing with History, Features 1-8 a…
bradleyshep Jun 17, 2026
8f79661
L6 STDB upgrade + 27/27 (Real-Time Permissions, Features 1-9 all 3/3)…
bradleyshep Jun 17, 2026
84013c9
L7 STDB upgrade + 30/30 (Rich User Presence, Features 1-10 all 3/3), …
bradleyshep Jun 17, 2026
e08cf84
L8 STDB upgrade + 33/33 (Message Threading, Features 1-11 all 3/3), 0…
bradleyshep Jun 17, 2026
06b4cbe
L9 STDB upgrade + 36/36 (Private Rooms & DMs, Features 1-12 all 3/3),…
bradleyshep Jun 18, 2026
d70506c
L10 STDB upgrade + 39/39 (Room Activity Indicators, Features 1-13 all…
bradleyshep Jun 18, 2026
70241b5
L11 STDB upgrade + 42/42 (Draft Sync, Features 1-14 all 3/3), 0 fixes…
bradleyshep Jun 18, 2026
17c636e
L12 STDB upgrade + 45/45 (Anonymous Migration, Features 1-15 all 3/3)…
bradleyshep Jun 18, 2026
9b799d3
minor fixes before testing again
bradleyshep Jun 18, 2026
5ba3c85
skills: add SpacetimeDB TS gotchas from benchmark transcript analysis
bradleyshep Jun 18, 2026
0b7503b
L1 STDB generate + 12/12 (Basic Chat, Typing, Read Receipts, Unread —…
bradleyshep Jun 18, 2026
ebceee0
L2 STDB upgrade + 15/15 (Scheduled Messages, Features 1-5 all 3/3), 0…
bradleyshep Jun 18, 2026
b4b7dc2
skills/typescript-server: document schema re-export from the module e…
bradleyshep Jun 18, 2026
efdf0bd
Update SKILL.md
bradleyshep Jun 18, 2026
6bd9539
L3 STDB upgrade + 18/18 (Ephemeral Messages, Features 1-6 all 3/3), 0…
bradleyshep Jun 18, 2026
e19edc2
L4 STDB upgrade + 21/21 (Message Reactions, Features 1-7 all 3/3), 0 …
bradleyshep Jun 18, 2026
bec904b
L5 STDB upgrade + 24/24 (Message Editing with History, Features 1-8 a…
bradleyshep Jun 18, 2026
215e476
L6 STDB upgrade (Real-Time Permissions) — PRE-FIX snapshot: kick is U…
bradleyshep Jun 18, 2026
6ecb477
L6 STDB fix (Real-Time Permissions) + 27/27 (Features 1-9 all 3/3) — …
bradleyshep Jun 18, 2026
800edce
cost summary
bradleyshep Jun 18, 2026
24db0fa
Update SKILL.md
bradleyshep Jun 18, 2026
82ab523
L7 STDB upgrade (Rich User Presence) + 30/30 (Features 1-10 all 3/3),…
bradleyshep Jun 18, 2026
601ba13
L8 STDB upgrade (Message Threading) + 33/33 (Features 1-11 all 3/3), …
bradleyshep Jun 18, 2026
4dedc7c
L9 STDB upgrade (Private Rooms & DMs) + 36/36 (Features 1-12 all 3/3)…
bradleyshep Jun 18, 2026
6258c5d
L10 STDB upgrade (Room Activity Indicators) + 39/39 (Features 1-13 al…
bradleyshep Jun 18, 2026
f665a98
L11 STDB upgrade (Draft Sync) + 42/42 (Features 1-14 all 3/3), 1 fix …
bradleyshep Jun 18, 2026
aa6f4fe
L12 STDB upgrade (Anonymous Migration) + 45/45 (Features 1-15 all 3/3…
bradleyshep Jun 18, 2026
d794c82
Add L7-L12 + L11-fix telemetry (cost-summary, COST_REPORT, transcript…
bradleyshep Jun 18, 2026
c4a99ec
Stop tracking telemetry app-dir.txt/metadata.json (machine-specific a…
bradleyshep Jun 18, 2026
0f22d1a
Add STDB benchmark issue log & skills changelog (bug catalog + ration…
bradleyshep Jun 18, 2026
703ec3d
Issue log: drop Category column
bradleyshep Jun 18, 2026
dd0d8d1
Keep STDB issue log as untracked local working doc (not committed)
bradleyshep Jun 23, 2026
4491c30
Move sequential-upgrade run output to external spacetimedb-ai-test-re…
bradleyshep Jun 23, 2026
7ce2941
Merge branch 'master' into bradley/sequential-mongodb-test
bradleyshep Jun 24, 2026
fe59eb4
Update SKILL.md
bradleyshep Jun 24, 2026
0adfcc5
Remove MONGODB_BACKEND_PLAN.md
bradleyshep Jun 24, 2026
2c0c7ef
Single SDK reference: always use the official customer skills (typesc…
bradleyshep Jun 24, 2026
911d077
Keep STDB cost tracking as untracked local working doc (not committed…
bradleyshep Jun 24, 2026
e715c30
Trim verbose comments (.gitignore run-output note, detect_backend + t…
bradleyshep Jun 24, 2026
c739a38
Remove now fixed bug from skill
bradleyshep Jun 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions skills/typescript-client/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,10 @@ conn.db.user.onInsert((ctx, user) => console.log('Joined:', user.name));
conn.db.user.onDelete((ctx, user) => console.log('Left:', user.name));
conn.db.user.onUpdate((ctx, oldUser, newUser) => console.log('Updated:', newUser.name));
```

## Gotchas

- **`useTable` rows are `readonly`.** Copy before sorting/mutating, or it fails to type-check:
`const [rows] = useTable(tables.message); const sorted = [...rows].sort(...)`.
- **bigint in JSX.** ids/counts from `t.u64()`/`t.i64()` columns are `bigint`, which React
cannot render. Wrap it: `{Number(row.id)}` or `{String(count)}`.
18 changes: 18 additions & 0 deletions skills/typescript-server/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ const spacetimedb = schema({ entity, record }); // ONE object, not spread args
export default spacetimedb;
```

The published module's **entry file must export the schema as default**. If you split tables
(`schema.ts`) from reducers/lifecycle (`index.ts`), re-export it from the entry:

```typescript
// index.ts
export { default } from './schema'; // re-export the schema for the module entry
```

## Reducers

Export name becomes the reducer name:
Expand Down Expand Up @@ -131,6 +139,10 @@ export const onDisconnect = spacetimedb.clientDisconnected((ctx) => { ... });
// Auth: ctx.sender is the caller's Identity
if (!row.owner.equals(ctx.sender)) throw new SenderError('unauthorized');

// ctx.connectionId: the per-connection id, NULLABLE (ConnectionId | null) — null-check before use.
// One Identity can hold several connections (multiple tabs/devices).
if (ctx.connectionId) { /* ... */ }

// Server timestamp (deterministic per reducer call)
ctx.db.item.insert({ id: 0n, createdAt: ctx.timestamp });

Expand Down Expand Up @@ -161,6 +173,8 @@ export const tick = spacetimedb.reducer(

// One-time: ScheduleAt.time(ctx.timestamp.microsSinceUnixEpoch + delayMicros)
// Repeating: ScheduleAt.interval(60_000_000n)
// Read time back from a scheduleAt value (tagged union):
// const micros = at.tag === 'time' ? at.value : at.value.microsSinceUnixEpoch; // bigint
```

## Custom Types
Expand All @@ -183,6 +197,10 @@ const Shape = t.enum('Shape', {

## Views

A client subscribing to a view receives only the rows it returns. Use a per-user view
(keyed on `ctx.sender`) for per-viewer access control: deleting a row it depends on
(e.g. a membership row) automatically drops the rows it was exposing from that client.

```typescript
// Anonymous view (same for all clients):
export const activeUsers = spacetimedb.anonymousView(
Expand Down
37 changes: 37 additions & 0 deletions tools/llm-oneshot/apps/chat-app/prompts/base_mongodb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# MongoDB Chat App - Base Prompt

Create me a **real-time chat app** using **MongoDB as the backend**.

Project root is:

```
apps/chat-app/
```

Create the project under a **timestamped folder**:

```
apps/chat-app/mongodb/chat-app-YYYYMMDD-HHMMSS/
```

Use `chat-app` as the **database name** for MongoDB.

## Constraints

- Work **entirely inside** your timestamped folder. Do not touch any other existing code.
- Only create/modify code under:
- `apps/chat-app/mongodb/chat-app-YYYYMMDD-HHMMSS/server/` (server-side TypeScript)
- `apps/chat-app/mongodb/chat-app-YYYYMMDD-HHMMSS/client/` (client-side TypeScript/React)
- Keep it minimal and readable.

## UI Requirements

- Dark theme with consistent color palette
- Clear visual hierarchy — active states, hover effects, focus indicators
- Responsive layout that works on desktop (mobile optional)
- Loading and empty states for all data-dependent views
- Visual feedback for user actions (button states, success/error indicators)

## Features

<!-- Include feature files below this line -->
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Language: TypeScript + MongoDB

Create this app using **MongoDB as the backend** with **TypeScript**.

## Project Setup

```
apps/chat-app/staging/typescript/<LLM_MODEL>/mongodb/chat-app-YYYYMMDD-HHMMSS/
```

Database name: `chat-app`

## Architecture

**Backend:** Node.js + Express + Mongoose + Socket.io
**Client:** React + Vite + TypeScript

## Constraints

- Only create/modify code under:
- `.../server/` (server-side TypeScript)
- `.../client/` (client-side TypeScript/React)
- Keep it minimal and readable.

## Branding & Styling

- App title: **"MongoDB Chat"**
- Dark theme using official MongoDB brand colors:
- Primary: `#00ED64` (MongoDB green)
- Primary hover: `#00C957` (darker green)
- Secondary: `#00684A` (MongoDB forest green)
- Background: `#001E2B` (MongoDB dark slate)
- Surface: `#023430` (deep green-slate)
- Border: `#1C2D38` (muted slate border)
- Text: `#E8EDEB` (light gray)
- Text muted: `#889397` (MongoDB gray)
- Accent: `#00ED64` (MongoDB green)
- Success: `#00ED64` (green for online indicators)
- Warning: `#FFC010` (MongoDB amber)
- Danger: `#FF4F4F` (MongoDB red)

## Output

Return only code blocks with file headers for the files you create.
3 changes: 3 additions & 0 deletions tools/llm-sequential-upgrade/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Shell scripts here are run under bash (git-bash on Windows). Force LF so they
# don't get CRLF-converted on checkout and break under stricter bash (WSL/CI).
*.sh text eol=lf
19 changes: 13 additions & 6 deletions tools/llm-sequential-upgrade/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,12 @@
**/results/**/.vite/
**/results/**/drizzle/

# Local env files inside generated apps (not committed)
**/results/**/.env

# Telemetry backup files
**/telemetry/*.jsonl.bak


# Playwright
**/playwright/node_modules/
**/playwright/test-results/
**/playwright/playwright-report/

# Isolation git repos inside generated apps (created by run.sh, cleaned up after)
**/results/**/.git/
# OTel collector live dump - not tracked
Expand All @@ -21,3 +18,13 @@ telemetry/metrics.jsonl

# Raw telemetry contains PII (email, account IDs) - store privately
**/telemetry/**/raw-telemetry.jsonl
# Full Claude Code session transcript (large; contains absolute paths/PII) - store privately
**/telemetry/**/session-transcript.jsonl
# Verbose run transcripts (large, regenerable) - not tracked
**/telemetry/**/run.log
# Local absolute app paths (machine-specific)
**/telemetry/**/app-dir.txt
**/telemetry/**/metadata.json

# Sequential-upgrade run output lives in the external spacetimedb-ai-test-results repo
sequential-upgrade/sequential-upgrade-*/
24 changes: 14 additions & 10 deletions tools/llm-sequential-upgrade/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Sequential Upgrade: LLM Cost-to-Done Benchmark
# Chat App: Build Instructions

You are running an automated benchmark that measures the **total cost to build a fully working chat app** — comparing SpacetimeDB vs PostgreSQL.

Your job is to **generate, build, deploy, and fix** the app. Grading happens in a separate manual session — you do NOT test in the browser.
Your job is to **generate, build, deploy, and fix** a fully working chat app. Verification happens in a separate session — you do NOT test in the browser.

---

Expand Down Expand Up @@ -30,10 +28,18 @@ Depending on the mode passed in the launch prompt:

---

## Shell Syntax

Windows host with both a Bash and a PowerShell tool — don't mix syntax. In the Bash tool use
POSIX: `mkdir -p` not `New-Item`, `sleep` not `Start-Sleep`, `2>/dev/null` not `2>$null`,
`VAR=x` not `$VAR=x`. PowerShell cmdlets in bash fail with "command not found".

---

## Anti-Contamination

Do NOT read any files under:
- `../llm-oneshot/apps/chat-app/typescript/` (graded reference implementations)
- `../llm-oneshot/apps/chat-app/typescript/` (reference implementations)
- `../llm-oneshot/apps/chat-app/staging/`
- Any other AI-generated app code in this workspace

Expand All @@ -46,7 +52,7 @@ Only read files you created, the backend instructions, and the feature prompts.
1. Read `backends/<backend>.md` for pre-flight checks, phases, and deploy steps
2. Read the language setup: `../llm-oneshot/apps/chat-app/prompts/language/typescript-<backend>.md`
3. Read the feature prompt: `../llm-oneshot/apps/chat-app/prompts/composed/<NN>_<name>.md`
4. Follow the phases in the backend file (generate backend → bindings → client → verify → deploy)
4. Follow the phases in the backend file, in order
5. Output `DEPLOY_COMPLETE` when the dev server is confirmed running

For **upgrade**: only add the NEW features from the target level. Do not rewrite existing working features.
Expand All @@ -62,8 +68,6 @@ For **upgrade**: only add the NEW features from the target level. Do not rewrite
5. Append to `ITERATION_LOG.md` (see format below)
6. Output `FIX_COMPLETE`

Do NOT do browser testing — that happens in the grading session.

---

## ITERATION_LOG.md
Expand All @@ -85,6 +89,6 @@ Append to this file after every fix. Never overwrite.

---

## Cost Tracking
## Telemetry

Cost is tracked automatically via OpenTelemetry — do NOT estimate tokens or produce a COST_REPORT.md. That is generated automatically after the session ends.
Do NOT estimate tokens or produce a COST_REPORT.md — that's captured automatically after the session ends.
6 changes: 3 additions & 3 deletions tools/llm-sequential-upgrade/DEVELOP.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,20 +263,20 @@ llm-sequential-upgrade/
DEVELOP.md # This file (for humans)
run.sh # Code Agent launcher (generate/fix/upgrade)
grade.sh # Grade Agent launcher (interactive Chrome MCP)
grade-playwright.sh # Grade via Playwright (optional, deterministic)
templates/ # BUG_REPORT.md / ITERATION_LOG.md formats
docker-compose.otel.yaml # OTel Collector container
otel-collector-config.yaml # Collector config (OTLP → JSON files)
parse-telemetry.mjs # Telemetry → COST_REPORT.md
backends/
spacetime.md # SpacetimeDB-specific phases
spacetime-sdk-rules.md # SpacetimeDB SDK patterns
spacetime-templates.md # Code templates
# SDK reference = the official skills/typescript-{server,client}/SKILL.md
postgres.md # PostgreSQL-specific phases
mongodb.md # MongoDB-specific phases
test-plans/
feature-01-basic-chat.md # Per-feature browser test scripts
...
feature-15-anonymous-migration.md
playwright/ # Optional Playwright test suite
telemetry/ # Shared OTel Collector output
sequential-upgrade/ # Sequential upgrade test variant
sequential-upgrade-YYYYMMDD/ # Dated run with results, telemetry, inputs
Expand Down
1 change: 1 addition & 0 deletions tools/llm-sequential-upgrade/GRADING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ You need TWO Chrome browser profiles so each user gets completely separate ident
1. **Browser A (default profile):** Navigate to the app URL and register as "Alice"
- SpacetimeDB: `http://localhost:6173`
- PostgreSQL: `http://localhost:6273`
- MongoDB: `http://localhost:6373`

2. **Switch to Browser B:** Use `switch_browser` to switch to the second Chrome profile

Expand Down
8 changes: 5 additions & 3 deletions tools/llm-sequential-upgrade/GRADING_WORKFLOW.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,12 @@ Code generation and fix iterations are token-tracked (the benchmark metric). Gra
```

After generation, apps are running at:
- **SpacetimeDB**: `http://localhost:5173` (run-index 0)
- **PostgreSQL**: `http://localhost:5274` (run-index 1)
- **SpacetimeDB**: `http://localhost:6173`
- **PostgreSQL**: `http://localhost:6273`
- **MongoDB**: `http://localhost:6373`

Port offsets for parallel runs: run-index N uses ports `5173 + N*100` (spacetime) and `5174 + N*100` (postgres).
Port offsets for parallel runs: run-index N adds N to the base port —
`6173 + N` (spacetime), `6273 + N` (postgres), `6373 + N` (mongodb).

---

Expand Down
6 changes: 3 additions & 3 deletions tools/llm-sequential-upgrade/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ Side-by-side results give a direct comparison of AI-generation cost across backe
## Directory contents

- `run.sh`: orchestrates generation, upgrade, and fix sessions. Supports `--upgrade`, `--fix`, `--composed-prompt`, `--resume-session`.
- `grade.sh` / `grade-agents.sh` / `grade-playwright.sh`: grading harnesses (manual + automated)
- `grade.sh`: interactive grading harness (manual, Chrome MCP)
- `templates/`: canonical `BUG_REPORT.md` / `ITERATION_LOG.md` formats for grading
- `benchmark.sh` / `run-loop.sh`: batch runners for parallel or sequential benchmark execution
- `cleanup.sh` / `reset-app.sh`: dev utilities
- `benchmark-viewer.html`: local viewer for METRICS_DATA.json files (open in browser, drop JSON)
- `generate-report.mjs`: aggregate per-session cost-summary.json into a markdown report
- `parse-telemetry.mjs`: parse OTel log stream into per-session cost-summary.json
- `parse-playwright-results.mjs`: convert Playwright JSON output to grading markdown
- `docker-compose.otel.yaml` / `otel-collector-config.yaml`: OTel collector + PostgreSQL
- `docker-compose.otel.yaml` / `otel-collector-config.yaml`: OTel collector + PostgreSQL + MongoDB
- `backends/`: per-backend setup / SDK reference documents given to the AI
- `perf-benchmark/`: runtime throughput benchmark (msgs/sec) for the AI-generated apps
- `CLAUDE.md` / `DEVELOP.md` / `GRADING.md` / `GRADING_WORKFLOW.md`: process documentation
Expand Down
Loading
Loading