# Decisions — Easy-Multiplayer

## Decision Log

| # | Date | Decision | Rationale | Status |
|---|------|----------|-----------|--------|
| 1 | Pre-existing | P2P architecture (Trystero/WebTorrent) | No server needed; direct peer connections | Inherited |
| 2 | Pre-existing | Rollback netcode pattern | Low-latency input handling for real-time games | Inherited |
| 3 | Pre-existing | Pure ES6 modules, no bundler | Simplicity; direct browser imports | Inherited |
| 4 | Pre-existing | Three.js for optional rendering | 3D avatar visualization in ShownPlayer | Inherited |
| 5 | 2026-03-12 | ShownPlayer/VoiceChat classified as optional, moved to `optional/` | Too application-specific for core library. Core should be rendering-agnostic. | Active |
| 6 | 2026-03-12 | SoundManager classified as secondary | Rollback-aware audio is netcode-adjacent but not essential | Active |
| 7 | 2026-03-12 | Refactor before research | Current code can be cleaned up regardless of future design decisions | Superseded by #11 |
| 8 | 2026-03-12 | Separate "peers" from "input slots" | Enables spectator scaling | Subsumed by #15 (passive participation) |
| 9 | 2026-03-12 | Disconnect via attendance, not input silence | Silence-on-change-only requires separate liveness signal | Active (formalized as Goal B2) |
| 10 | 2026-03-12 | Input Query System for smart rollback avoidance | Predicate-based input evaluation reduces unnecessary rollbacks | Active (formalized as Goal B4) |
| 11 | 2026-05-25 | Pivot to semantic-relevance architecture per redesign doc | Substantially reduces rollback frequency; enables passive-participant scaling | Active |
| 12 | 2026-05-25 | Two-level terminology: public API uses game vocabulary ("player", "tick"), internal layers use simulation vocabulary ("participant", "step") | Library is broader than games; internal language should reflect that without confusing game devs | Active |
| 13 | 2026-05-25 | Abstract `Transport` interface; `MemoryTransport` is the test substrate of record | Required for deterministic multi-node testing; enables P2P/server-client/hybrid as interchangeable backends | Active |
| 14 | 2026-05-25 | English scenario catalog (`TEST_SCENARIOS.md`) precedes code tests | Many novel behaviors have undefined edge cases; English forces them out before implementation | Active |
| 15 | 2026-05-25 | Sparse change-only input protocol; silence = unchanged, not disconnect | Massive bandwidth reduction; required for passive participation; matches semantic-relevance philosophy | Active |
| 16 | 2026-05-25 | Predicate context: explicit `ctx` parameter (Option C); debug mode duplicates ctx and detects mutation, production mode trusts ctx for efficiency | Closure-over-live-state is the classic footgun; explicit ctx + debug-mode guard catches errors during dev without runtime cost in prod | Active |
| 17 | 2026-05-25 | ~~Disconnect agreement is NOT a special-cased algorithm~~ | Original reasoning: short partitions resolve via acceptance/grace windows; long partitions via severe-desync recovery + authority — avoid over-engineering. **Retired 2026-05-29:** this conflated "no heavyweight CONSENSUS protocol" (still true) with "no disconnect-specific mechanism at all" (wrong). Handling disconnect-tick agreement explicitly, decentrally, and sparsely is exactly what we want. Superseded by #30. | Retired (2026-05-29) |
| 18 | 2026-05-25 | Bootstrap-on-join: full state at grace-window edge + sparse input log since then + current per-participant input state. Serving peer chosen randomly. Receiving peer enters explicit "catching-up" state visible to Layer 3 | Random selection distributes serving cost; catching-up state lets the game show appropriate UI during re-simulation | Active |
| 19 | 2026-05-25 | Acceptance window, grace window, hash window cadence, attendance interval are tunable parameters, not fixed numbers | The 200ms/300ms in the design doc were illustrative; real values depend on game type | Active |
| 20 | 2026-05-25 | Bus-based rollback (forthcoming): re-simulation as parallel workers; visible simulation never freezes during rollback | Long rollbacks currently freeze the visible simulation; user-visible smoothness > correctness latency since visible state was already "false" | Proposed (Phase C+) |
| 21 | 2026-05-25 | Phase B implementations must keep "displayed state" decoupled from "authoritative latest state" | Required to slot bus-based rollback in later without restructuring | Active |
| 22 | 2026-05-28 | Transport interface scope finalized (Goal A1). Six resolutions: (a) best-effort is the baseline contract, optional `{reliable:true}` for bulk/critical one-shots (bootstrap B10, state transfer B8) only; (b) 16 KB structured-clone payload delivered without caller chunking; (c) no backpressure signal in v1; (d) self excluded everywhere (`getPeers`, peer events, `broadcast` loopback) — only `localId` refers to self; (e) liveness is transport-internal and message-independent — application silence never drops a peer; (f) clock sync stays out of Layer 1 — transport exposes only the `clockHint` RTT primitive, `SyncedClock` is composed alongside and exchanges ping/pong as ordinary Layer-2 messages | Keeps Layer 1 a trivially-implementable, deterministically-testable message+liveness primitive; unentangles the clock ping/pong currently baked into `WorldNetworkCommunicator.SendData`; all complexity (windows, redelivery, tick mapping) stays in Layer 2 where it belongs | Active |
| 23 | 2026-05-29 | Sparse input protocol implementation (Goal B1), in `SparseInput.js`: (a) a change carries the FULL intent snapshot, not a per-field delta — simpler, matches the `{tick, intent}` format, deferred optimization; (b) the decoder holds a caller-provided `defaultIntent` as the pre-first-change baseline (the canonical-default *agreement* question stays open — KNOWN_ISSUES #13 / S-003-05); (c) reconstruction is hold-last keyed by tick stamp (reorder-safe), with idempotent duplicates and equal-to-held redundant changes as no-ops; (d) stored/emitted intents are structurally deep-cloned so they are immune to later caller mutation; (e) the module is pure (no clock/transport/global) and is NOT yet wired into the v1 `RollbackNetcode` finalization path — that migration is staged behind B2 (liveness) and B9 (finalization), since sparse sending violates v1's confirmed-tick finalization assumptions | Lands the reusable, deterministically-testable core of the protocol first; avoids a half-broken in-engine migration before the goals that redefine finalization exist | Active |
| 24 | 2026-05-29 | Heartbeat liveness implementation (Goal B2): (a) a reusable pure, clock-driven `transports/HeartbeatLiveness.js` component (emit-on-cadence + last-seen map + periodic timeout sweep; `onJoined` on first beat/rejoin, `onLeft` on timeout) rather than baking liveness into each transport; (b) `MemoryTransport` gains it as an OPT-IN `{ heartbeat }` mode — default MemoryTransport is unchanged so the A2 conformance suite (idealized register/partition liveness) stays valid; (c) heartbeat route as a transport-internal `{__em_hb:true}` message that is consumed for liveness and never surfaced to `onMessage` — application traffic neither sustains nor revives a peer; (d) `NetworkSim` suppresses its idealized immediate join/left events for transports flagged `_managesOwnLiveness`, so a partition surfaces ONLY via heartbeat-silence timeout; (e) defaults intervalMs=500 / timeoutMs=2000 / sweepMs=interval, all tunable; the real detection bound is `timeoutMs + sweepMs` — the GOALS "≤2× interval" target holds only when `timeoutMs ≤ intervalMs`, but a larger timeout is preferred to tolerate a few lost beats (documented in `PROTOCOL_SPEC.md`) | Keeps liveness deterministically testable and decoupled from app message flow per #22e; opt-in avoids invalidating the A2 idealized-liveness conformance model; explicit detection-bound honesty over silently restating the GOALS bound | Active |
| 25 | 2026-05-29 | Context-aware intent construction (Goal B3), in `LocalIntent.js`: (a) a single `getLocalInputs(localGameState) => intent|null` replaces v1's per-field `defineInput`; meaning is bound at sample time (button A = `{jump}` in game, `{confirm}` in dialog) so rollback cannot reinterpret it; (b) `null`/`undefined` is a FIRST-CLASS passive marker — NOT an active-but-neutral intent object. The reconstructed value for a passive tick is literally `null`, which input-bearing/query logic treats as "not a participant this tick" (resolves S-003-03's neutral-vs-excluded question in favor of EXCLUDED); (c) the canonical pre-history baseline is `null` (passive) — a never-heard-from participant is passive everywhere until its first intent arrives (resolves S-003-05 / KNOWN_ISSUES #13). A game MAY override `defaultIntent`, but all peers must agree; (d) a permanently-passive participant ships zero packets (no rollback pressure); an active→passive transition ships a single `{tick, intent:null}` change; (e) `fromFieldSamplers` is the mechanical migration shim from `defineInput` (composes zero-arg samplers, ignores the state arg, returns null when empty — mirrors v1 `GetPlayerInput`); (f) the public `EasyMultiplayer.getLocalInputs(fn)` is wired into the facade and takes precedence over `defineInput`; to make the facade importable/testable under a plain Node ESM loader, the default `TrysteroTransport` is now imported LAZILY in `start()` (only when no transport is injected) instead of at module top level. The pure module + harness prove all B3 success/verify criteria; the SPARSE-send + passive-rollback-pressure release remain the same staged input-path migration as B1 (ride Goal B9) | Single state-aware intent function is the redesign's core input model; null-as-passive is what enables zero-cost spectators and scalable passive participation; resolving the two open defaults (neutral-vs-excluded, pre-history baseline) unblocks deterministic reconstruction; lazy Trystero import removes the A5 https-import coupling that made the facade untestable in Node | Active |
| 26 | 2026-05-29 | Predicate context freezing (Goal B4), in `QueryContext.js`: (a) the query signature is `query(playerId, ctx, (input, ctx) => predicate)`; predicates must be PURE of ctx (result depends only on `input` + `ctx`, never closed-over live state); (b) DEBUG mode deep-clones ctx at query time — the clone is the FROZEN snapshot used for ALL later re-evaluation, so a caller mutating its live ctx afterwards cannot change a past result; after the predicate runs, the live ctx is compared to the snapshot and a mutation fires `onMutation` naming the offending predicate (name + truncated source); (c) PRODUCTION mode stores ctx by reference with no clone and no compare — the predicate call is the only work, structurally identical to a raw closure (proven by spying the clone fn: N calls in debug, 0 in production); (d) deep-clone choice (resolves KNOWN_ISSUES #5): prefer host `structuredClone` (Node ≥17 / modern browsers — preserves `undefined`, Date/Map/Set, key-order independent, unlike a JSON round-trip), with a hand-rolled cycle-safe recursive walker as the fallback for older hosts (the Node-12 selftest runner, which lacks `structuredClone`, exercises the walker); functions/symbols are carried by reference (a ctx carrying behavior is itself a smell); (e) `recheck({playerId, startTick, endTick, inputAt})` re-evaluates each frozen snapshot under corrected per-tick input and returns the earliest tick whose result flips — the rollback DECISION that consumes this is Goal B5; in debug, recheck re-clones the snapshot per call so a mutating predicate cannot compound across re-evaluations. The pure module + harness prove all B4 success/verify criteria; wiring the 3-arg signature into v1 `RollbackNetcode.Query` (currently 2-arg, closure-over-live-state, re-evaluated in `_checkQueriesForRollback`) is the same staged in-engine migration as B1/B3 — it rides the B5/B9 rollback-and-finalization rework, not B4 | Closure-over-live-state is the redesign's headline footgun; explicit ctx + a debug-mode freeze/compare makes nondeterministic re-evaluation impossible during development at zero production cost; resolving the clone implementation (#5) unblocks debug-overhead reasoning; keeping recheck pure-but-decisionless preserves the layer split (B4 = freezing primitive, B5 = rollback policy) | Active |
| 27 | 2026-05-29 | Hash window broadcasts + uncertainty-aware desync (Goal B5), in `HashWindow.js`: (a) the broadcast is `{oldestTick, interval, stateHashes[], usedInputs[]}` where `stateHashes` is POSITIONAL on a fixed checkpoint grid (index i ⇔ tick `oldestTick + i*interval`; gaps are `null`) and `usedInputs` lists the relevant *queried* inputs since `oldestTick`, each flagged `confirmed`/unconfirmed — "relevance" is encoded by only recording queried inputs; (b) `compareHashWindows(local, remote)` aligns on the OVERLAP of non-null checkpoints (same `interval` required, else `incomparable`), finds the EARLIEST diverging checkpoint D and the last agreed checkpoint L, then inspects the window `(L, D]`: if EITHER peer has an unconfirmed relevant input in that range → `wait` (a mismatch that may still be explained/corrected — NOT a desync), else → `desync` at D; all-match → `agree`; (c) inputs at/before L are already accounted for (hashes matched at L) so only `(L, D]` matters; an unconfirmed input outside that range does NOT suppress a real desync; (d) resolution is bounded: once the previously-uncertain inputs confirm and hashes still differ, the next comparison flips `wait`→`desync`; (e) the module is pure and DECISIONLESS about recovery — it only classifies a divergence; the recovery ACTION (re-bootstrap/authority) is a later goal, and the per-tick query re-evaluation that produces the hashes is B4's `QueryContext.recheck`. The pure module + harness (genuine cross-peer window exchange over MemoryTransport/VirtualClock) prove both B5 success criteria; wiring this into v1's eager per-tick `ReceiveStateHash`/`CheckStateHashes` (single `desiredStateHash` vs `engineFingerprint`, gated by the coarse `IsSyncedState`) is the same staged in-engine migration as B1/B3/B4 — it rides the B6/B9 windows-and-finalization rework | Eager per-tick single-hash comparison is the root of false-positive desyncs under sparse input sync; the windowed broadcast + "different state can be correct-given-different-knowledge" rule is the most important conceptual departure of v2; keeping the module decisionless preserves the layer split (B5 = classify, recovery = act) and lets B4's recheck feed it | Active |
| 28 | 2026-05-29 | Tunable acceptance + grace windows (Goal B6), in `AcceptanceWindows.js`: (a) a frozen, validated `WindowConfig` ({acceptanceWindowMs, graceWindowMs, snapshotIntervalTicks, attendanceIntervalMs}) centralizes the protocol's tunables (concretizing principle #19); validation enforces `graceWindowMs > acceptanceWindowMs > 0`, positive-integer `snapshotIntervalTicks`, positive `attendanceIntervalMs`; (b) documented defaults `acceptanceWindowMs=200` (~one inter-region RTT), `graceWindowMs=300` (acceptance + ~one relay hop), `snapshotIntervalTicks=20` (1s at the 50ms/20Hz tick), `attendanceIntervalMs=500` (matches the B2 default); (c) `classifyInput({nowMs, inputMs, relayed})` is the pure per-arrival decision over `age = nowMs - inputMs`: `age ≤ acceptanceWindowMs` → ACCEPT (raw or relayed); `acceptanceWindowMs < age ≤ graceWindowMs` → ACCEPT only if `relayed` (already accepted elsewhere), else REJECT_RAW_IN_GRACE (a raw late input is not accepted directly); `age > graceWindowMs` → REJECT_BEYOND_GRACE with `mayTriggerRecovery:true`; (d) boundaries are INCLUSIVE (`≤`) and tick conversion (`windowTicks`) FLOORS, so a window never over-accepts past its configured number; future-dated inputs (negative age) are accepted (buffered ahead); (e) the module is pure and STATELESS — dedup, the "already accepted" set, triggering the actual recovery, and the finalization sweep itself live in the engine (recovery = B8, finalization = B9). The pure module + harness (genuine cross-peer relay convergence: a peer partitioned from an origin still converges because a third peer relays the input into its grace window) prove all B6 success/verify criteria including ±1ms edge behavior; wiring into the v1 engine rides the B8/B9 rework | Hard-coded latency numbers are wrong for every game; centralizing the tunables with validated defaults concretizes #19; the raw-vs-relayed grace asymmetry is precisely what lets a briefly-disconnected peer reconverge without a special protocol; keeping classification pure/stateless preserves the layer split (B6 = classify one arrival, B8 = recover, B9 = finalize/bound memory) | Active |
| 29 | 2026-05-29 | `queryDisconnected` + disconnect-as-simulation-event (Goal B7), in `DisconnectTracker.js` — resolves S-005-06: when peers detect a participant's silence at different local moments, what tick does the network agree the disconnect happened on? Answer: (a) the canonical disconnect tick is a DETERMINISTIC function of SHARED data, `canonicalDisconnectTick(p) = lastAttendanceTick(p) + timeoutTicks`, NOT of any peer's local detection time — every peer that heard the same last tick-stamped attendance computes the same tick without negotiating; the cross-peer merge is grow-only-max (a beat only ever moves the tick FORWARD), so any agreement that does happen is deterministic and single-directional. **Scope:** B7 settles only this LOCAL math + the rollback signal; how peers that heard DIFFERENT last beats actually converge over a sparse network is #30 (probe fast path + recovery fallback), not B7; (b) attendance are tick-stamped and applied MONOTONICALLY — a stale/reordered older beat is ignored, so the disconnect tick only ever moves FORWARD; (c) a later attendance learned after the fact pushes the tick out, retroactively un-disconnecting ticks `[oldTick, newTick)` and, IF the peer already simulated past `oldTick`, forcing a disconnect-conditional rollback restarting at `oldTick` (fed to B4's `QueryContext.recheck`); learned in time (before `oldTick`), it just extends the alive period with no rollback; (d) a never-heard participant has NO disconnect tick and is NOT disconnected (that is passivity, B3 — distinct from disconnection); the `queryDisconnected(p, atTick)` boundary is INCLUSIVE (disconnected AT the canonical tick); (e) the module is PURE and decisionless — gating a late beat by the acceptance/grace window (B6) and performing the actual rollback / severe-desync recovery (B8) happen at the call site; this module owns only the deterministic tick math and the shift signal `{shifted, from, to, earliestAffectedTick}`. The pure module + harness (latency-independent agreement when both peers heard the beat directly; retroactive rollback only when the corrected beat is learned too late; and the HONEST LIMIT that a peer with no direct link holds `null`/never-connected, awaiting #30) prove the B7 LOCAL criteria; cross-network convergence is #30 and wiring into the v1 engine rides the B8/B9 rework | A disconnect must participate in rollback determinism the same way an input does; deriving the tick from the shared stamp rather than local detection time is what makes the merge a deterministic grow-only-max; monotonic forward-only movement makes reordered/late beats safe; keeping the module pure preserves the layer split (B7 = deterministic local tick math, #30 = sparse convergence, B6 = accept/gate, B8 = recover) | Active |
| 30 | 2026-05-29 | **SUPERSEDED 2026-06-05 → beat FORWARDING (grow-only-max gossip). The pull-on-suspicion probe FAST PATH is DELETED (code `DisconnectProbe.js`/`ProbeNode.js` + tests `disconnect-probe*.test.js`/`engine-l3-probe.test.js`/`selftest-b7.1.mjs` removed; engine unwired). Why: the probe rode the same reliable transport as a attendance yet cost a 2-trip request/response, so it can never rescue a convergence case that 1-trip reliable gossip cannot; and it was one-shot per `(playerId, tickY)`, so a single dropped correction caused a false disconnect. The premise "attendance are NEVER proactively forwarded" is itself REVERSED — beats are now forwarded grow-only-max (a lost beat self-heals via the next message). The slow-path B5→B8 fallback (#31 `mergeLastAttendanceTicks`) is RETAINED as the ultimate backstop; only the middle probe layer was removed. See `DESIGN_PARTICIPATION.md` §6.1 (replacement) + §6.2 (reductio).** ORIGINAL DECISION (kept for history): Sparse decentralized convergence on the disconnect tick — a HYBRID: (a) **fast path — pull-on-suspicion probe:** attendance are NEVER proactively forwarded (that would flood and defeat sparseness). Instead, when a peer's locally-computed disconnect tick Y for participant X is *imminent* (X's last beat is aging toward timeout) AND X's presence is *relevant* (queried), the peer broadcasts a small "I think X disconnects at tick Y" suspicion; any peer holding a STRICTLY NEWER beat from X replies with that beat (a bounded on-demand pull / anti-entropy read-repair, NOT a push). The suspecting peer's grow-only-max merge (#29a) pulls the tick forward BEFORE it crosses Y, so the disconnect is corrected pre-emptively with no rollback; (b) the probe is sent inside the B6 grace window (a few ticks before Y), giving one round-trip of slack — `reply-in-time` holds because `timeoutTicks` ≫ one RTT for sane configs; (c) reply amplification is suppressed: reply only with a strictly-newer beat, and back off if another peer's correction was already observed; (d) **fallback — when the reply misses its window** (slow link / genuine long partition), the wrong disconnect commits, surfaces as a B5 hash-window desync (or `wait` until inputs confirm), and is reconciled by B8 severe-desync recovery; the only datum that must travel during recovery is X's last-attendance-tick (one int per participant), piggybacked on the #18 bootstrap payload — bounded, on-demand, not flooded; (e) this yields EVENTUAL agreement within a connected component (canonical = max last-beat over currently-reachable peers); if the only peer that heard the latest beat dies before propagating, survivors legitimately re-canonicalize on their own max — the correct partition-tolerant behavior, honestly not an instantaneous global guarantee. Replaces the retired #17 with an explicit, sparse, decentralized mechanism. **FAST PATH implemented as Goal B7.1 (2026-05-29):** pure `DisconnectProbe.js` wraps `DisconnectTracker` — `suspicions(currentTick, relevantPlayerIds)` emits a deduped `{playerId, tickY}` only when the disconnect is imminent (`currentTick ∈ [Y-probeLeadTicks, Y)`) AND relevant; `onSuspicion` replies with our beat ONLY if strictly newer (our canonical > tickY) and we haven't observed a correction ≥ ours (amplification backoff); `onCorrection` applies grow-only-max via the tracker and records the observed beat. Re-emits after a correction shifts Y forward (dedupe keyed on the exact tickY). Proven via `tests/disconnect-probe.test.js` (24 unit) + `tests/disconnect-probe-harness.test.js` (4, `ProbeNode`) + `selftest-b7.1.mjs` (4, Node 12 too): fast-path convergence with no rollback; relevance-gate control; amplification suppression (a second newer holder backs off after seeing the first reply); and NO spurious un-disconnect (a genuinely dead peer still disconnects at the correct tick). The SLOW-PATH fallback landed in B8 (`mergeLastAttendanceTicks`). Honest limit: convergence is eventual within a connected component; a relevant peer with no path to a newer-beat holder stays at its local tick until the B5/B8 fallback. In-engine wiring rides B9 | Proactive attendance forwarding is the obvious-but-wrong fix (violates the sparseness constraint that defines v2); a relevance-gated suspicion probe converts a would-be rollback into a tiny pre-emptive pull at a fraction of recovery's cost, while the desync+recovery fallback guarantees convergence when the fast path can't win the race; the grace-window placement is what makes `reply-in-time` sound; being explicit beats #17's "do nothing special" because decentralized sparse agreement is precisely the hard problem worth solving directly | Superseded (2026-06-05) → beat forwarding |
| 31 | 2026-05-29 | Authority + severe-desync recovery + lagging-peer wake (Goal B8), in `Recovery.js`: (a) authority is ONLY a local tie-breaking rule for convergence, not ownership — `compareAuthority` is a deterministic TOTAL order over `(simulationAge, peerId)`: the OLDER simulation (higher `simulationAge`) wins, a LOWER `peerId` strictly breaks an age tie (never an undecided 0 for distinct peers). Because it is a function of data both peers already share, two peers independently reach the same verdict with no consensus round; (b) `resolveDesync({local, remote, divergeTick})` → the loser ADOPTS the winner's full-state transfer (`adopt-transfer`), the winner SERVES it (`serve-transfer`); convergence is monotonic because adopting a transfer also adopts the winner's authority PROVENANCE, so a connected component climbs to the single most-authoritative history; (c) the state transfer (`makeStateTransfer`) carries `{tick, snapshot (opaque, by reference — never cloned), lastAttendanceTicks}`; `mergeLastAttendanceTicks` folds the carried ticks in grow-only-max — this is the #30 SLOW-PATH FALLBACK that reconciles a disconnect-tick disagreement when the B7.1 probe missed its window, and grow-only-max guarantees a stale transfer can never roll a attendance (hence a disconnect tick) backward; (d) LAGGING-PEER WAKE: `shouldResetSimulationAge({localTick, networkTick, lagThresholdTicks})` (inclusive boundary) detects a peer that woke up >= threshold ticks behind; on a reset its authority age drops to `RESET_SIMULATION_AGE = 0` (youngest), so a stale isolated peer YIELDS instead of dominating ("outdated authority dominance") and adopts the live network state — proven load-bearing by a control test where the reset is disabled and the stale age-1000 peer pollutes the network; (e) the module is PURE and decisionless about WHEN to compare (the engine triggers on a B5 desync) and about the actual re-simulation; transfers use the transport's optional `{reliable:true}` point-to-point send (#22a) — pulled on demand, never flooded. The pure module + harness (two partitioned divergent groups converge to the older history after heal with age beating the id tiebreak; equal-age tie broken by lower id; lagging wake + the disabled-reset control; #30 attendance-tick merge end-to-end) prove both B8 success criteria; wiring into the v1 engine (which has no authority comparison or state-challenge flow yet) rides the B9 finalization rework | Convergence without a permanent master requires a deterministic shared total order, not a vote; older-sim-wins + lower-id-tiebreak is the design doc's lightweight rule; carrying provenance with the adopted history is what makes convergence monotonic rather than oscillating; the lagging reset is the specific guard against a stale peer's seniority dominating on rejoin; carrying last-attendance-tick in the transfer is the minimal datum that closes the #30 fallback; keeping the module pure preserves the layer split (B8 = authority/resolve/lag rules + transfer shape, B9 = finalize/bound) | Active |
| 33 | 2026-05-29 | Tick finalization + memory bounding (Goal B9), in `Finalization.js`: (a) a tick is FINALIZED (immutable) once it is strictly older than the finalization horizon `maxCurrentTick - graceWindowTicks` — justified because every legal late mutation is bounded by the B6 grace window: a late intent is accepted only within grace, a late/reordered attendance (and thus a disconnect-tick shift, #29/#30) is gated by that same window, and severe-desync recovery (#31) REPLACES history wholesale via a fresh transfer rather than reaching back into pruned per-tick data — so nothing can legally mutate a tick older than the horizon; (b) the horizon is GROW-ONLY-MAX off the maximum current tick ever observed, mirroring the #29 DisconnectTracker discipline — a recovery that jumps the local tick BACKWARD can never un-finalize already-collected ticks, while a forward jump (adopting a more-advanced authoritative state) legitimately advances it; before any `note()` the horizon is `-Infinity` so `isFinalized` is uniformly false; (c) TWO payload-agnostic tick-only GC policies (the caller maps ticks→payloads): `collectAnchored` for CARRY-FORWARD collections (snapshots, last-input-per-participant) keeps the latest entry at-or-before the horizon — the re-simulation anchor / last-known value — PLUS everything after, and collects the rest (dropping the anchor would make a still-mutable tick un-reconstructable; with no entry ≤ horizon it keeps all, since no finalized base exists yet); `collectBelow` for NO-carry-forward logs (query logs — a finalized tick can never be rechecked) collects everything strictly below the horizon and keeps the tick exactly AT the horizon (the oldest still-mutable tick); (d) the module is PURE (no clock/transport/globals) and decisionless about WHEN to run GC and the actual release — the engine owns both; per-participant sparse-input pruning is just `collectAnchored` applied per participant. The pure module + harness (`FinalizationNode` accumulates snapshots/query-logs/sparse-inputs per tick and runs the GC each tick) prove both B9 success criteria: a BOUNDED-GROWTH PLATEAU (retention is a function of phase, not session length — identical retainedCount at tick 600 and 1200) and a GC-DISABLED CONTROL that grows ~linearly (1121→2241) proving the GC is load-bearing; plus the rollback-anchor invariant and sparse carry-forward (a change-once-then-silent participant keeps exactly its last tick forever). Proven via `tests/finalization.test.js` (21 unit) + `tests/finalization-harness.test.js` (4) + `selftest-b9.mjs` (4, Node 12 too). In-engine wiring (the real release of retained data + the finalization sweep cadence) rides B-Integrate (#32) | A long-running session must not leak; tying finalization to the SAME grace-window bound that gates every late mutation means the GC can never collect a tick something could still legally touch — correctness and bounding fall out of one number; grow-only-max makes the horizon recovery-safe by construction; splitting the two retention semantics into pure tick-only policies makes the GC trivially testable and reusable (snapshots, per-participant inputs, query logs all map onto the two policies); keeping it decisionless preserves the layer split (B9 = what is finalized + what may be released, the engine = when) | Active |
| 32 | 2026-05-29 | The deferred in-engine wiring accumulated across Phase B is consolidated into ONE dedicated BIG-BANG integration goal, **B-Integrate**, sequenced AFTER all Phase B cores (through B9 finalization + B10 bootstrap) and BEFORE Phase C. Every B1–B8/B7.1 core landed PURE and harness-validated with its `RollbackNetcode`/`EasyMultiplayer` integration explicitly deferred (historically tagged "rides B9"); B-Integrate absorbs all of those deferrals at once. The gating FIRST step inside B-Integrate is cutting the wall-clock/`requestAnimationFrame` coupling (KNOWN_ISSUES #7): `EasyMultiplayer` must take an injected clock + a manual tick driver (as the harness nodes do) so the assembled engine is harness-testable; without that seam the integration would be unvalidated. **Correction:** B9 as scoped is ITSELF only a pure core (tick finalization + memory bounding) and does NOT contain the integration — the recurring "rides B9" phrasing across the docs was a placeholder and is superseded by "rides B-Integrate" (the per-goal notes are not all rewritten yet; this entry is the authoritative correction) | Concentrate integration risk in ONE validated step instead of smearing it across every goal; because each core is already individually harness-proven, the remaining unknown is COMPOSITION, best done once against a frozen, stable core set rather than as churn against a half-migrated, moving target; the clock/tick injection seam is a hard prerequisite for testing any assembly at all, so it leads | DONE (2026-05-29): assembled as `SimulationEngine.js` (a NEW module implementing the A3 SimulationNode contract — NOT a `RollbackNetcode`/`EasyMultiplayer` retrofit; the v1 files are untouched), composing all ten cores over injected Transport + clock with a manual tick. Built across five test-gated layers L1–L5; `tests/engine-l1..l4` + `tests/engine-integration.test.js` (end-to-end B1–B10 cross-section) + Node-12 `selftest-b-integrate.mjs`. Full suite 373/373 + all harness selftests green on Node 20 & 12. Deferred to Phase C: KNOWN_ISSUES #8 (real-transport bootstrap), #4 (`usedInputs`/"wait"-tier wiring, shipped lag/probe-lead defaults) |
| 34 | 2026-05-29 | Random-peer bootstrap + catching-up state (Goal B10), in `Bootstrap.js` — concretizes #18: (a) the SERVING PEER is chosen on the JOINER side, UNIFORMLY among eligible peers (`selectServingPeer(candidates, rng)`, `idx = floor(rng()*len)` clamped into range so a misbehaving rng can never index past the array); `rng` is INJECTED (`() => [0,1)`) so selection is deterministic under the seeded harness; over many joins this yields a uniform serving-peer histogram with no single load/pressure sink; (b) only a LIVE peer may serve (`eligibleServers` filters by `CatchUpStatus.LIVE` — a still-catching-up peer can't hand out a coherent present state); (c) the request is a single point-to-point RELIABLE send (`{reliable:true}`, #22a), NOT a broadcast that draws N replies — honoring the SPARSENESS constraint; (d) the BOOTSTRAP PAYLOAD (`makeBootstrapPayload`) pins the three pieces of #18: (1) a SNAPSHOT of full state at the grace-window edge `currentTick - graceWindowTicks` (a finalized tick — exactly the B9 `collectAnchored` anchor), carried BY REFERENCE (never cloned, mirroring `Recovery.makeStateTransfer`); (2) the SPARSE INPUT LOG of every change STRICTLY AFTER the edge up to the present (`validateInputLog` rejects entries at/before the edge — they're already baked into the snapshot+baseline, double-count); (3) the per-participant BASELINE intent IN EFFECT AT THE EDGE (B9's anchored last-input-per-participant) — this is the concretization of #18's vague "current per-participant input state"; the wrapper is frozen, the snapshot is not; (e) `reconstructInputs` rebuilds a `SparseInputDecoder` per participant (baseline-seeded; a log-only/never-heard participant defaults to `null` = passive, per B3) so re-sim reproduces every participant's intent at every tick >= the edge via hold-last; (f) the CATCHING-UP lifecycle (`CatchUpTracker`) is MONOTONIC and Layer-3-visible — a joiner constructs CATCHING_UP, fires `onEnterCatchingUp` on join, re-simulates edge→present, and `noteProgress(local, network)` flips it to LIVE ONCE (within `toleranceTicks`) firing `onLeaveCatchingUp`; it NEVER flaps back (a later lag is a B8 recovery concern, not a re-bootstrap); (g) the module is PURE and decisionless about WHEN to request a bootstrap, the actual transport send, and the re-sim loop — those belong to the engine. The pure module + harness (`BootstrapNode` server/joiner roles over MemoryTransport/VirtualClock) prove both B10 success criteria: UNIFORM SERVING-PEER DISTRIBUTION (100 joins over 5 fixed servers, chi-square < 13.28 df=4) and SPARSE single-server contact (exactly one server served per join), plus the catching-up lifecycle (one enter/leave pair, ends LIVE, adopts state at the captured present tick) and re-sim fidelity (baseline held + since-edge changes at the right ticks; never-heard => null). Proven via `tests/bootstrap.test.js` (21 unit) + `tests/bootstrap-harness.test.js` (4) + `selftest-b10.mjs` (4, Node 12 too). In-engine wiring (when to request, the real re-sim loop) rides B-Integrate (#32) | Random joiner-side selection is what spreads serving cost without a coordinator; pinning the payload to the B9 grace-window edge means a bootstrap and a finalization snapshot are the SAME anchor (one concept, not two); concretizing #18's "current input state" as the at-edge baseline is what composes the snapshot + since-edge log into a deterministic re-sim; the monotonic catching-up state is the Layer-3 hook a game needs to show a "joining…" UI; a single reliable point-to-point request (never a broadcast) is the SPARSENESS constraint applied to join; keeping the module decisionless preserves the layer split (B10 = select/payload/lifecycle, engine = when/send/re-sim) | Active |
| 35 | 2026-05-29 | Trystero behind the new Transport interface (Goal C1), in `transports/TrysteroTransport.js`: (a) the trystero binding (`{ joinRoom, selfId }`) is INJECTED via the constructor instead of a top-level `import ... from 'https://…'`; this removes the un-resolvable remote ESM import that previously broke `import`ing the module under a plain Node loader, making the ADAPTER deterministically testable in-process; (b) an async `createTrysteroTransport(opts)` factory is the SINGLE place that touches the remote URL — it does a DEFERRED dynamic `import(URL)` (evaluated only at call time, never at module load) and constructs the transport with the real binding; `EasyMultiplayer` now starts the default transport through this factory; (c) three spec/conformance gaps in the old adapter were fixed: pre-connect `send`/`broadcast` now THROW (was a silent no-op, conformance #9), post-disconnect is a no-op (tracked via `_disconnected`), unknown/departed-peer `send` is a no-op guarded by `getPeers()` (conformance #6), and `{ reliable:true }` maps onto a DISTINCT trystero action label (`TRYSTERO_LABELS.reliable` vs `.bestEffort`, conformance #1/#11) — trystero data channels are ordered+reliable, so the reliable label gets guaranteed in-order delivery; (d) testability mechanism: `test-harness/FakeTrysteroNetwork.js` mimics the trystero room API (`joinRoom`/`onPeerJoin`/`onPeerLeave`/`makeAction`/`getPeers`/`leave`, per-peer `selfId`) backed by the SAME `NetworkSim`/`VirtualClock` substrate the `MemoryTransport` conformance harness uses, with two logical channels so the reliable label bypasses loss; a member BUFFERS join/leave notices that arrive before its `onPeerJoin`/`onPeerLeave` is wired and flushes on registration (bridges NetworkSim's synchronous register events to trystero's async discovery without losing events); (e) the FULL 12-point transport conformance suite runs against the REAL `TrysteroTransport` over this fake — `tests/trystero-transport.test.js` (17 Vitest) + `test-harness/selftest-trystero.mjs` (16 Node-12). HONEST LIMIT (CLAUDE.md §4): the fake cannot reproduce real WebRTC channels, NAT, the WebTorrent swarm, wire jitter, or auto-chunking — those are browser-only and ride Goal C4 (human-verified); conformance #11 (reliable-survives-loss) is a reliable-channel PLUMBING test, not a claim that real best-effort drops (real trystero channels are all reliable); the remote dynamic import is never invoked in Node (it rejects the https URL scheme) | A general Transport interface must be provable against a REAL implementation, not just `MemoryTransport`; injecting the binding + isolating the URL behind one deferred-import factory is the minimal change that makes the adapter Node-testable AND collapses the fragile-CDN risk (KNOWN_ISSUES #1/#5) to a single line; reusing `NetworkSim` for the fake means the same deterministic medium (seed, latency, loss, partition, delivery log) validates both transports with no new substrate; fixing the three conformance gaps is required for the engine's correctness assumptions to hold over the real transport; being explicit about what only C4 can verify keeps the AI-testable / human-verified boundary honest | Active |
| 37 | 2026-05-30 | **Synced-tick — derive the present tick from a SHARED epoch, not a per-peer count (opt-in `syncedTick`), in `SimulationEngine.js`.** Triggered by real-WebRTC jank: over a live link multiplayer "worked barely" with delay + desync on the late-joining screen. Root cause: `engine.tick` was a free-running +1 count started at each peer's own `connect()`, so two peers that came online at different wall-clock times were PERMANENTLY tick-offset (`late-join-scenario.test.js` pins this PRE-fix behavior: a residual `|A.tick - B.tick|` of ~the round-trip latency, explicitly asserted `<= 10`, "not 0"). (a) THE FIX: when `syncedTick` is enabled the present tick is `floor((clock.now() - epoch) / tickMs)`, where `epoch` is the wall-clock instant of THIS peer's tick 0. `_advanceTick` no longer blindly does `this._tick + 1`; it steps the simulation UP TO the synced target, catching up any intermediate ticks (without fresh local sampling, mirroring the `_onBoot` re-sim loop) so `this._tick` stays equal to what the shared clock says — count-based mode is byte-for-byte unchanged because its target is always `this._tick + 1` and the catch-up loop never runs. (b) EPOCH AGREEMENT (user decision, mirrors the #31 B8 age-beats-id authority order): the OLDEST `startTime` wins — a smaller/earlier epoch is "older" and is adopted by everyone (`_adoptEpoch` takes the strict min); a tie leaves the value unchanged (equal epochs yield identical ticks, so the id tie-break is moot for the tick VALUE). Epoch is carried in the attendance (`epoch` field, added only when `syncedTick`) and in the bootstrap `MSG_BOOT` payload; a late joiner adopts the server's older epoch in `_onBoot` and catches up to `max(presentTick, maxSeen, syncedTarget)` — using the synced target (not the latency-stale `presentTick` the server captured) is what COLLAPSES THE RESIDUAL to ~0. (c) MUTUAL-BOOTSTRAP DEADLOCK FIX: two peers that connect simultaneously are both `bootstrap:true` joiners — each asks the other for state, each declines (a catching-up peer can't serve), and pre-fix BOTH wedged at tick 0 forever. New `MSG_BOOT_DECLINE` (sent only when `syncedTick`) carries the decliner's epoch; `_onBootDecline` lets a joiner that finds NO older peer (decliner epoch not strictly smaller than its own) co-found from its own initial state via `_goLive()`, while a joiner that learns an OLDER peer exists waits for that peer's authoritative state rather than founding a divergent history. Declines are also routed inside the joining branch of `_onMessage` (a catching-up peer must still REPLY to a co-joiner's request). Proven by `tests/synced-tick.test.js` (3): late-joiner converges to the SAME tick as the early peer (`<= 1`, vs the count-based `<= 10` residual); B adopts A's older epoch (A does NOT collapse to B's younger one); simultaneous start both go live and converge with no deadlock. Full suite 438/438 + all Node-12 selftests green. (d) CLOCK SKEW — the engine is fed a SYNCED clock, not a raw one (this was a conceptual correction: the engine derives `tick` from `clock.now()`, so the skew must be removed UPSTREAM in the clock, not "accepted as a bounded residual"). `test-harness/SyncedPeerClock.js` is a deterministic harness model of the v1 `SyncedClock` + `WorldNetworkCommunicator` P2P mechanism: each peer pings every other and the responder answers with ITS OWN synced `now()` (not its raw clock), so samples estimate the offset to the SAME shared frame and the mesh converges transitively (Cristian's algorithm + median offset, monotonic-clamped non-decreasing once ready). `PeerHarness.addPeer({ synced:true, clockOffset, clockDrift })` wraps each peer's raw clock in one over a shared registry. The FOUNDER syncs to nobody first so its offset stays 0 and it ANCHORS the frame; the founder's frequent `now()` calls ratchet its monotonic floor, so a BEHIND joiner cannot drag it down — the joiner is pulled UP instead. TWO engine-side requirements made this work and are the real fix: (1) a JOINER seeds `epoch = +Infinity` (NOT its raw `now()`) and only ADOPTS epochs learned from peers — without this a behind clock's small raw epoch wins "oldest-wins" and HIJACKS the network's tick origin (a peer +1h merely rockets ITSELF, but a peer −1h dragged everyone — the asymmetry is the seed); a peer going live without ever adopting (founder/co-founder) anchors lazily in `_goLive`; (2) a joiner DEFERS go-live until `clock.isReady()` (`_clockReady`/`_awaitClockReady`), so it never derives a tick from the still-skewed raw clock — a plain clock without `isReady()` is treated as always ready, so count-based + shared-clock modes are unchanged. `tests/synced-tick-skew.test.js` (6) pins BOTH halves: WITHOUT a synced clock a +500ms raw offset leaks a stable ~10-tick residual and drift grows unbounded (the motivation); WITH `synced:true` a +500ms, +1h, −1h, and a drifting peer ALL converge to `|B.tick − A.tick| <= 1` and the −1h peer no longer hijacks A. HONEST LIMIT (CLAUDE.md §4): convergence is proven only for LATE-JOIN ordering (a founder anchors). STILL UNSOLVED (user-raised, follow-on increments): (i) two desynced peers cold-starting SIMULTANEOUSLY (no anchor → needs a tie-break); (ii) two internally-synced GROUPS at different frames merging and re-settling on one; (iii) an already-synced peer RECALIBRATING with simulation impact (the founder's monotonic clamp STALLS rather than jumps backward — good — but a large correction needs abandon-and-resync or a smooth startTime shift); (iv) "longest-PRESENT peer wins" (not oldest-time) when a late joiner ALSO arrives with a ready synced clock. | A free-running per-peer count can never agree across peers that boot at different times — the offset is baked in at `connect()` and only bootstrap papers over it once, after which the two counters drift by the latency again; the only way two peers agree on "what tick is it now" without a master is to compute it as a pure function of a SHARED time reference + a SHARED epoch. Oldest-epoch-wins reuses the existing #31 deterministic authority order (no new consensus round) and makes "whoever was in the room first defines tick 0" fall out for free; carrying epoch on the attendance/boot the peers already exchange honors the sparseness constraint (no new flood). The decline message is the minimal datum that breaks the symmetric deadlock without a master while still trailing a genuinely-older peer. Opt-in keeps the 429 count-based tests untouched so the change is provably non-regressive. | Active |
| 36 | 2026-05-29 | **Game-design contract for the engine — the engine constrains execution SEMANTICS, not code layout; colocating simulation and rendering on one object is BLESSED, not discouraged.** Triggered by C2: graph-pacman cannot run headlessly, and we initially mis-read that as "sim/draw coupling is bad." (a) The ONLY boundary the engine enforces is the Layer-2/3 API: serializable state + a pure deterministic `tick(state, query, …)` that may run 0..N times per tick (rollback re-sim) + `getLocalInputs`/`query`. The engine NEVER sees rendering, audio, or DOM — those are entirely Layer-3-private. So whether sim and draw live on the SAME object (pacman's `DrawableObject`: sim fields + a sprite handle, with separate `Tick`/`Draw` methods) or in SEPARATE sim/renderer modules is a free dev-ergonomics choice; both are valid as long as the contract holds. Colocation is the ergonomic default and is RECOMMENDED for small games (matches the "easy multiplayer" thesis); a mandatory ECS-style or sim/render split is explicitly NOT required. (b) Three disciplines make colocation safe and ARE the documented contract: (1) the serialized state is an explicit ALLOWLIST of sim-only fields (pacman's `exportVars` at `examples/graph-pacman-game.js:1162`) so display-only state — animation frame, interpolated draw position, sprite handles — stays OUT of snapshots (smaller snapshots; no false desyncs from render drift); (2) the step uses only engine-provided determinism sources (seeded `random()`, frozen `query`, `time = tick*tickMs`) — never `Math.random`/`Date.now`/DOM/hardware reads; (3) `draw` is READ-ONLY with respect to simulation state. (c) The ONE genuinely load-bearing portability rule: renderer construction must be LAZY / INJECTABLE, never hard-wired into entity construction, so a colocated game still boots HEADLESS (Node tests, server authority, scale runs) without a display. pacman/graph-pacman violate ONLY this — `DrawableObject`'s constructor does `new DrawableSprite(...)` from a remote browser-only module (`:1146`), so the game can't instantiate sim state without a browser. That is a single-player portability smell, independent of multiplayer and independent of the colocation choice — NOT a reason to restructure the game. (d) Because blessing colocation raises the odds a dev accidentally drops nondeterminism into the step, this RAISES the priority of Goal C5 (determinism enforcement / detector). | The product thesis is "easy"; forcing a sim/render split is exactly the ceremony the target audience bounces off, and the engine already gets a clean boundary for free because it only ever touches export/import/tick/query. The real failure we hit (can't run headless) was misdiagnosed — it is a construction-time hard-dependency on a browser-only module, fixable by injecting the renderer with zero game restructuring. Documenting the allowlist + determinism-source + read-only-draw disciplines makes the easy path safe; the injectable-renderer rule keeps it portable; all of it preserves the Layer-2/3 split. | Active |
| 38 | 2026-05-31 | **Late-join finalized-state convergence (Goal B11), in `SimulationEngine.js` + the `EasyMultiplayer` facade.** Triggered by the deployed graph-pacman "cursed" multiplayer desync, reproduced HEADLESS and split into two distinct problems (both user-predicted). **B11a — the bootstrap re-simulation desync (FIXED).** A late joiner (`syncedTick`) got a CORRECT bootstrap anchor (the grace-window-edge snapshot matched the founder) yet re-simulating edge→present produced a different state by the very next finalized checkpoint: `compareHashWindows` reported `lastAgreedTick=edge`, `divergeTick=edge+snapshotInterval`, `desync`. Frequency-dependent — steady inputs CONVERGE, dense/asymmetric inputs reliably DESYNC, so exposing it needs churning per-tick inputs on BOTH peers. ROOT CAUSE (a): the founder's input changes in the `(edge, present]` window were in the founder's FUTURE when `_onBootRequest` served the snapshot (so the boot inputLog can't contain them) AND were dropped before the joiner went live — `_onBoot` calls `reconstructInputs` which REPLACES the decoders with boot-payload knowledge (only up to edge), discarding any live founder-input broadcasts that arrived during bootstrap; the joiner then replayed the founder's input FROZEN at its tick-`edge` value across the whole window, so the next finalized checkpoint diverged. Ruled out by minimal repro: (b) edge-anchor off-by-one (the anchor MATCHED), (c) the joiner's own input (the frozen stream is the FOUNDER's, player '1'), (d) RNG (a no-RNG integer counter still desynced). THE FIX: buffer `MSG_INPUT` that arrives while `_joining && !_bootstrapped` into `this._preBootInputs` (instead of dropping it in the joining branch of `_onMessage`), and merge it into the reconstructed decoders at the TOP of `_catchUpAndGoLive` (filtered to `tick > edge`, before the re-sim loop). Merge location is `_catchUpAndGoLive` NOT `_onBoot` on purpose: in synced-tick mode `_onBoot` usually DEFERS via `_awaitClockReady` and the founder keeps broadcasting during that wait, so merging only in `_onBoot` would still miss them. Proven by `tests/late-join-bootstrap-resim.test.js` (minimal `{counter}` step, +1 per 'right'/-1 per 'left' player) + the pacman block in `tests/graph-pacman-late-join.test.js`: both go from `desync` (divergeTick=140) to `agree`; a NEGATIVE CONTROL with SEPARATE `NetworkSim` instances per peer (a shared NetworkSim broadcasts to all regardless of `setLink`, so peers would input-sync and converge — vacuous) still reports `desync`, proving the assertion is non-vacuous. **B11b — detect + repair a finalized split (WIRED).** Even with B11a, a split that finalizes (e.g. a transient partition) had NO repair: B8 severe-desync recovery was never enabled by the facade — `_finishStart` turned on only `attendance`+`bootstrap` for syncedTick, so `_onAssert` bailed at `if (!this._recovery) return`. THE FIX is three coupled changes: (1) FACADE — add `recovery: syncedTick || undefined` to the engine options (recovery is ON whenever `syncedTick`, no separate opt-in); (2) `_adoptTransfer` CROSS-TICK RE-ANCHOR — after adopting the snapshot, set `this._tick = transfer.tick` so `_advanceTick` re-simulates forward from the winner's tick to the present (B8's original `_adoptTransfer` assumed TICK-ALIGNED "joined-together" peers; without the re-anchor the loser grafts the winner's state onto its own counter and instantly re-diverges); (3) two B8-semantics fixes that the cross-tick case EXPOSED, both required for the repair to actually converge: (3a) `_onTransferRequest` now serves the latest FINALIZED checkpoint (snapshot at-or-before the finalization horizon) instead of live `this._state` — the live state is still inside the grace window and can be rolled back by a late input, so serving it hands the loser a TRANSIENT anchor the winner itself later corrects, freezing the split permanently (observed: the winner served its tick-240 state, then its own late-arriving input rolled back checkpoint 240, but the loser had already adopted the stale version; player '2's pacman differed direction down/up, edgeProgress .485/.515); falls back to live state when no `_finalizer` is configured, so the B8 tick-aligned recovery tests are unaffected; (3b) `_adoptTransfer` adopts the winner's `simulationAge` but KEEPS its own `peerId` — copying the winner's id verbatim made both peers authority-EQUAL `{age,'1'}`, so the once-only `_onTransfer` guard (`compareAuthority >= 0 return`) rejected every corrective retry and deadlocked the split (frozen at divergeTick=240 while checkpoints 260+ all agreed); keeping our id leaves us strictly below the winner in the authority order (age ties break on lower id), so a residual split re-runs `< 0` and a corrective transfer is allowed without ping-pong. Proven by `tests/graph-pacman-late-join.test.js` block "B11b": two synced peers, run linked ~8s (not desync), kill the link via a 1e9-latency `setLink` and run ~4s (each freezes the other's input and diverges), restore the link, assert `worstHashVerdict([A,B]) === 'desync'` IMMEDIATELY (detector not blind), step forward and assert it heals to non-`desync` and STAYS healed; a NEGATIVE CONTROL leaves the link dead and stays `desync` (proving the repair is the restored link doing real work, not a blind verdict). Full suite 455 passed / 11 skipped on Node 20; B8 `tests/engine-l3-recovery.test.js` adopt-once / winner-never-adopts assertions still hold (the finalized-snapshot serve is a no-op there — no finalizer configured — and convergence after a single adoption means no second adoption fires). HONEST LIMIT (CLAUDE.md §4): deterministic MemoryTransport + VirtualClock only — no real WebRTC/jitter/packet-loss; this proves the LOGIC holds headless (the repro is a sufficient witness that the gap could occur and is now closed, not that every real session hit it); real-transport timing is browser-only (Goal C4). STILL OPEN: the 6 distributed edge cases in `tests/synced-tick-distributed-spec.test.js` (describe.skip, #37) — clock-disagreement/group-merge cases SEPARATE from this single-founder/single-joiner bug. The fuzz/sweep matrix is intentionally deferred. | A correct bootstrap anchor is worthless if the re-sim that follows it replays a frozen input — the only inputs a joiner can be missing are exactly those the founder produced in the joiner's catch-up window AFTER the snapshot was cut, so buffering live input during the bootstrap wait and merging it before re-sim is the minimal datum that closes the gap, and doing it in `_catchUpAndGoLive` (not `_onBoot`) is what survives the synced-tick clock-ready deferral. For repair, recovery had to be wired (it was dead code behind the facade) and B8's tick-aligned assumption generalized to cross-tick; the two B8 fixes are not gold-plating but the precise reason the cross-tick repair otherwise can't converge — a transient served anchor and an authority-equality deadlock each independently freeze the split, and both surfaced only once recovery actually ran across ticks. Keeping our own peerId preserves the #31 loser-stays-loser invariant so re-adoption is corrective, not oscillating; serving the finalized snapshot reuses the B9 horizon as the immutable hand-off point (one concept, not two). | Active |
| 39 | 2026-06-02 | **Completeness-scored, zero-handshake severe-desync recovery + connected-network delta sync (SPEC ONLY — not yet built).** Upgrades the #31 PURE-AUTHORITY recovery model after the user flagged that authority-alone produces "stupid results": a briefly-isolated peer can force a whole agreeing group to adopt ITS state merely because it is older. Documented (not implemented) across `easy_multiplayer_redesign_concretized_architecture.md` (§ Connected-Network Delta Sync, § Desync Detection "Three Outcomes", § Severe Desync Recovery) and `PROTOCOL_SPEC.md` (Delta-Sync Envelope, B8 stateChallenge/stateTransfer). FOUR parts: **(a) Connected-network delta sync** — the network is CONNECTED, not complete; every layer-2 message carries an `{id, ackId}` envelope (monotonic per-sender id + highest id received FROM the recipient), a sender transmits only the delta above the recipient's ack, and each node keeps ONE GLOBAL `{id -> knowledge-diff}` watermark store (NOT per-peer — user-corrected: "just your global knowledge-diff"), bounding memory to the unacknowledged tail and per-peer bookkeeping to a single integer. **(b) Three-way desync classification** at the lowest shared finalized frame: KNOWLEDGE GAP (one peer missing relevant inputs → delta-sync, not a desync) / NON-DETERMINISM (a NON-finalized checkpoint differs under EQUAL inputs → the game step itself is non-deterministic; a transfer cannot fix it because re-running identical inputs re-diverges) / CATASTROPHIC (a FINALIZED checkpoint differs → full-state transfer). **(c) Completeness score** for catastrophic resolution, evaluated at `min(myFinalizedFrame, theirFinalizedFrame)`: `+1` per player I finalize that they don't, `-1` per player they finalize that I don't, `+1`/`-1` per shared player by whose most-recent finalizing input is newer; `>0` win (do nothing), `<0` lose (request winner's full state), `=0` authority tiebreaker (older simulationAge then lower peerId). The most-recent input before the finalized frame is a SUFFICIENT completeness statistic because inputs are accepted only IN ORDER — no gaps (user-corrected my impossible C@10→C@50-skip counterexample). **(d) Zero-handshake** — both peers compute the SAME verdict from already-shared data, so the LOSER is the only actor: it requests state from the exact peer it lost to (not a broadcast, not a poll of others), the winner serves on request and otherwise does nothing ("the other node will find out themselves if they lost"). Lost-request liveness: the loser RE-REQUESTS on a rate-limited timeout (pinned RED by `tests/recovery-lost-request.test.js`; cadence open). **(e) Frame-hash broadcast (added 2026-06-02):** hashes exist only at CHECKPOINTS (the interval grid), never per-frame — so the recovery-relevant "finalized frame" is the latest checkpoint at-or-before the grace horizon, and a frame past grace that is not on the grid is never hashed nor reported as finalized (the finalized-hash stream stays sparse: one per interval). FINALIZED checkpoint hashes are ALWAYS broadcast (they feed the score); NON-finalized checkpoint hashes are sent only behind the opt-in `sendNonFinalizedHashes` flag (default false) since they can reveal only rollback-able / non-deterministic divergence a transfer cannot repair — value unproven, so off by default. PARKED for the build phase (see KNOWN_ISSUES): whether to SCRAP non-finalized desync healing entirely (it's non-determinism, unfixable by transfer); whether the flagged non-finalized hash path ever proves useful enough to default on; whether HashWindow's confirmed/UNCONFIRMED input tracking can be simplified away (unconfirmed is derivable from last-confirmed + "silence = no change") | Pure authority is too blunt for catastrophic merges — completeness (who actually has more of the agreed history) is the principled decider, with authority demoted to a tiebreaker; computing it from the most-recent finalizing input is exact given the no-skipped-inputs invariant, so it needs no extra state. Zero-handshake falls out of both peers sharing the inputs that feed the score: no consensus round, the loser acts alone, the winner is passive. The `{id, ackId}` envelope + single global knowledge-diff store is the minimal delta-sync substrate for a connected (non-complete) graph and honors the sparseness constraint (steady state costs nothing). Spec-first per CLAUDE.md §2 — documented and questioned before any code; the RED test already pins the lost-request liveness requirement | Spec (not implemented) |

---

## Notes

- Pending decisions from the previous version (API pattern, determinism strategy, scaling topology) have been resolved or absorbed:
  - **API pattern** — callback-based, per `EasyMultiplayer.js` shape and recommended direction in the design doc
  - **Determinism strategy** — open, tracked as Goal C5 (Phase C)
  - **Scaling topology** — solved by the Transport interface abstraction (#13); concrete transports can be P2P, server-client, or hybrid without affecting Layer 2/3
- **Synced-tick follow-on research (#37 "STILL UNSOLVED" (i)–(iv)):** the distributed EDGE 1–6 cases left open by #37 — cold-start tie-break, group-merge re-settle, large recalibration, longest-present-wins — are worked in `research/synced-clock/` (plan, prior-art survey, per-EDGE design notes, and honest testing limits; see `research/synced-clock/README.md`). The executable specs are the EDGE blocks in `tests/synced-tick-distributed-spec.test.js`.
