# Architecture — Easy-Multiplayer

## Two-Level Architecture

The system has two distinct architectural pictures worth understanding separately:

1. The **target** v2 architecture (semantic distributed simulation, layered Transport / Rollback / Game)
2. The **current** v1 implementation (functional, but conflates the layers)

Most of this document describes the target. The section "Current State Mapping" near the end shows how the current code maps onto it.

---

## Target: Three-Layer Architecture

```
┌─────────────────────────────────────────────────────────┐
│  Layer 3 — Game Layer                                    │
│  • Game rules, rendering, audio                          │
│  • Local intent construction (getLocalInputs)            │
│  • Query predicates (semantics)                          │
└──────────────────────────┬──────────────────────────────┘
                           │ tick(state, query) /
                           │ getLocalInputs(localState)
┌──────────────────────────▼──────────────────────────────┐
│  Layer 2 — Rollback Simulation Layer                     │
│  • Deterministic stepping, snapshots, hashes             │
│  • Query log, predicate re-evaluation                    │
│  • Sparse input history, prediction                      │
│  • Rollback decision (semantic — query result changes)   │
│  • Hash window broadcasts, uncertainty-aware desync      │
│  • Bus-based re-simulation (forthcoming, Phase C+)       │
└──────────────────────────┬──────────────────────────────┘
                           │ Transport interface
                           │ (send, receive, peers, attendance)
┌──────────────────────────▼──────────────────────────────┐
│  Layer 1 — Transport Layer                               │
│  • Connections, NAT, routing, reliability                │
│  • Attendance, peer discovery, clock sync                │
│  • Implementations: MemoryTransport, TrysteroTransport,  │
│                     (future) ServerTransport             │
└──────────────────────────────────────────────────────────┘
```

The boundary between Layer 1 and Layer 2 is the **Transport interface** (see `TRANSPORT_SPEC.md`). The boundary between Layer 2 and Layer 3 is the **Easy-Multiplayer public API** (the `tick / query / getLocalInputs` contract).

### What each layer must NOT know

- **Layer 1 must not know:** game rules, query semantics, simulation state, rollback decisions
- **Layer 2 must not know:** transport topology, NAT, sockets, peer routing, packet reliability mechanics
- **Layer 3 must not know:** wire formats, peer IDs (except as opaque participant handles), tick timing details

This separation makes Layer 1 fully replaceable (P2P, client-server, hybrid) without touching Layer 2 or 3.

---

## Designing Games For The Engine

The engine constrains **execution semantics, not code layout.** The only boundary it enforces is the Layer-2/3 API: a serializable state, a pure deterministic `tick`, and `query` / `getLocalInputs`. It never sees rendering, audio, or DOM — those are entirely Layer-3-private.

### What the engine sees

- **State** — whatever you `exportState` / `importState` (or `manageState`). This is the *whole* simulation as far as the engine is concerned. It is snapshotted, transferred on bootstrap/recovery, and hashed for desync detection.
- **`tick(state, query, …)`** — a pure step that may run **0..N times per tick** (rollback re-simulates the same tick whenever a correction lands). It must be a function of state + inputs + engine-provided determinism sources only.
- **`query` / `getLocalInputs`** — how the step reads other participants' intents and how local input is sampled.

The engine does **not** see, and does not care about, how you draw, play sound, or read the keyboard.

### Colocation is blessed

Putting simulation and rendering on the **same object** — e.g. one entity with both a `Tick()` (sim) and a `Draw()` (render), plus a sprite handle — is a fully valid, **recommended** default for small games. It matches the "easy multiplayer" thesis and keeps related code together. A sim/renderer split (ECS-style or otherwise) is also valid, but is **not required**. Choose by ergonomics.

`examples/graph-pacman-game.js` is the colocated style: pac-men and ghosts are `DrawableObject`s carrying sim fields *and* a `DrawableSprite`.

### The three disciplines that make colocation safe

These are the actual contract — follow them and either layout works:

1. **State is an explicit allowlist of sim-only fields.** Snapshot only what the simulation needs (positions, velocities, scores), never display-only state (animation frame, interpolated draw position, sprite handles). graph-pacman does this with `exportVars` (`examples/graph-pacman-game.js:1162`). Result: smaller snapshots and no false desyncs from render drift.
2. **The step uses only engine-provided determinism sources** — the seeded `random()`, the frozen `query`, and `time = tick * tickMs`. Never `Math.random`, `Date.now`, `performance.now`, DOM reads, or hardware/network reads inside `tick`. (Rollback re-runs the step on past ticks; any hidden nondeterminism desyncs peers.)
3. **`draw` is read-only with respect to simulation state.** Rendering may be nondeterministic and run once per animation frame, but it must never mutate state the simulation depends on.

### The one load-bearing portability rule

**Renderer construction must be lazy / injectable, never hard-wired into entity construction.** A colocated game must still *boot and tick headless* — for Node tests, server authority, and scale runs — without a display. If an entity's constructor unconditionally does `new SomeSprite(...)` against a browser-only module, the simulation can't be instantiated without a browser.

This is exactly (and *only*) where pacman / graph-pacman fall short today: `DrawableObject`'s constructor builds a `DrawableSprite` from a remote browser-only module (`examples/graph-pacman-game.js:1146`). That is a single-player portability smell — independent of multiplayer and independent of the colocation choice — not a reason to restructure the game. Injecting the renderer (a factory that the headless path can stub) fixes it with zero game-logic changes.

> **Caveat — determinism is dev discipline.** Blessing colocation makes it easier to accidentally drop nondeterminism into the step. Goal C5 (a determinism detector/enforcement tool) exists to catch this; until it ships, disciplines (1)–(3) are enforced by review and by the debug-mode hash/desync detection, not by the engine refusing bad code.

---

## Layer 2 — Key Concepts

### Sparse Input History

Inputs are change-only:

```
[ { tick: 1, intent: {left: true} }, { tick: 80, intent: {left: false} } ]
```

Silence between changes means "unchanged". Decoders reconstruct the continuous stream.

### Local Intent Construction

`getLocalInputs(localGameState)` runs locally and produces semantic intent based on locally visible state, *not* raw hardware. The same button A produces `{jump: true}` in gameplay and `{confirm: true}` in a dialog — bound at input time, immune to rollback reinterpretation. Returning `null` makes the participant passive (no inputs, no rollback pressure).

### Query-Based Semantic Rollback

The game tick calls `query(playerId, ctx, (input, ctx) => predicate)`. The simulation layer records:

```
{ tick, playerId, predicateFn, ctxSnapshot, result }
```

On input correction: re-evaluate the predicate with the corrected input and the *frozen* ctxSnapshot. Rollback only fires if the result changes. Unqueried ticks cannot trigger rollback at all.

**Predicate context freezing.** The `ctx` parameter must be frozen at query time. Implementation:

- **Debug mode** — ctx is duplicated (efficient deep-clone, not JSON round-trip) at query time, and later compared on re-evaluation; mutations produce a warning/error pointing at the offending predicate.
- **Production mode** — ctx is trusted as-is; zero runtime overhead.

Devs writing pure predicates pay nothing; devs writing accidentally-mutating predicates get loud diagnostics during development. The "efficient deep-clone" is itself an open implementation question (structured-clone vs hand-rolled walker — see `KNOWN_ISSUES.md`).

### Hash Window Broadcasts

Periodic per-peer broadcast:

```js
{
  oldestTick,
  interval,
  stateHashes: [hash0, hash100, hash200, hash300],
  usedInputs: [/* relevant queried inputs since oldestTick */]
}
```

Cadence and window size are **tunable parameters**, not fixed protocol numbers.

### Uncertainty-Aware Desync Detection

A hash mismatch is **not** a desync if either peer still has unresolved relevant input uncertainty in the window between the last agreed checkpoint and the divergent one. Only with no uncertainty remaining does a mismatch become a true deterministic divergence triggering recovery.

This is the most important conceptual departure from traditional rollback: different state can be correct-given-different-knowledge, not broken.

### Acceptance / Grace Windows

Tunable parameters (no hard-coded numbers in the protocol — the 200ms/300ms in the design doc were illustrative):

- **Acceptance window** — peers directly accept late inputs up to N1 ms into the past
- **Grace window** — peers accept relayed already-accepted inputs for up to N2 ms (N2 > N1)
- **Finalization** — ticks older than the grace window become immutable; their snapshots, query logs, and sparse input entries become eligible for collection

### Disconnects As Simulation Events

Transport reports "peer unreachable"; simulation decides "disconnect becomes canonical at tick X". `query()` can ask `queryDisconnected(playerId)`. Each peer computes its tick locally and deterministically (`lastAttendanceTick + timeoutTicks`, grow-only-max — `DECISIONS.md` #29). Peers that heard different last beats converge SPARSELY (`DECISIONS.md` #30): by **forwarding beats** (grow-only-max gossip — a node re-broadcasts a beat when it advances its local max for a relevant player; a lost beat self-heals via the next message), with B5-desync → B8 severe-desync recovery as the fallback. (Supersedes the retired #17.)

> **SUPERSEDED 2026-06-05.** The original convergence mechanism here was a relevance-gated *pull-on-suspicion probe* (the never-forward-attendance rule), built as **B7.1** (`DisconnectProbe.js`). It has been **DELETED** — code and tests removed, engine unwired. A probe rode the same reliable transport as a attendance but cost a 2-trip request/response, so it could never rescue a case 1-trip reliable gossip cannot, and being one-shot per `(playerId, tickY)` a single dropped correction caused a false disconnect. Convergence is now beat forwarding; the B8 last-attendance-tick fallback (`mergeLastAttendanceTicks`) is retained as the backstop. See `DESIGN_PARTICIPATION.md` §6.1 (replacement) + §6.2 (reductio).

### Bootstrap For Joining Peers

A new peer joining at tick F needs:

1. The simulation state at the grace-window edge (e.g. tick F − grace)
2. Every known sparse input change since that tick
3. The current per-participant input state

To distribute load, the peer serving the bootstrap is selected **randomly** among eligible peers (not always the same peer). The receiving peer enters an explicit **catching-up state** while it re-simulates the grace window forward — this state must be visible to Layer 3 so the game can show appropriate UI ("connecting…", "synchronizing…"). The exact API for surfacing this state is open (see `KNOWN_ISSUES.md`).

### Bus-Based Rollback (Forthcoming)

A rollback from tick N to tick M is potentially expensive (M − N ticks of re-simulation). Currently this freezes the visible simulation. The bus model decouples them:

- A **bus** is a re-simulation worker traveling from a starting tick forward
- The **visible simulation** keeps running on the pre-rollback state ("known-false but smooth")
- When a bus catches up to the present, it replaces the visible state
- New rollbacks spawn additional buses without stopping prior ones; the system converges via whichever bus reaches the present
- Trade-off: visible simulation is "wrong" for longer, but never freezes. Since the pre-rollback state was already false from a correctness standpoint, this is a UX win.

Open (Phase C+): parallel vs time-sliced execution, max concurrent buses, min tick gap between buses, selection policy when multiple converge.

**Implication for Phase B:** simulation state must remain cleanly snapshottable, and the "displayed state" must stay decoupled from the "authoritative latest state". This constraint is captured in `DECISIONS.md` #21 and must be honored throughout Phase B.

### Authority

Authority is not a permanent master; it's a tie-break mechanism for convergence. Default: older simulations preferred, lower IDs break ties. A peer that fell drastically behind (e.g. several seconds) resets its simulation-age for authority comparison, so stale isolated peers can't dominate after rejoining.

**Core done (B8)** — pure `Recovery.js`: `compareAuthority` is a deterministic total order over `(simulationAge, peerId)` (older wins, lower id breaks ties, never an undecided tie for distinct peers), so peers reach the same verdict without a consensus round. On a B5 desync, `resolveDesync` makes the loser ADOPT the winner's full-state transfer and the winner SERVE it; adopting also adopts the winner's provenance, so a connected component converges monotonically to one history. The transfer (`makeStateTransfer`, opaque snapshot by reference) carries per-participant `lastAttendanceTicks`, folded grow-only-max via `mergeLastAttendanceTicks` — the DECISIONS #30 slow-path fallback that reconciles a disconnect-tick disagreement when the B7.1 probe missed its window. Lagging-peer wake: `shouldResetSimulationAge` detects a peer >= a threshold of ticks behind and drops its age to 0 so it yields instead of dominating. Decisionless about WHEN to compare and the re-simulation itself; in-engine wiring (the v1 engine has no authority model) rides B9.

---

## Layer 1 — Transport Interface

See `TRANSPORT_SPEC.md` for the full contract. Summary:

- `send(peerId, message)` — unreliable, may drop / duplicate / reorder
- `broadcast(message)` — to all currently-connected peers
- `onMessage(callback)` — receive callback
- `onPeerJoined(callback)` / `onPeerLeft(callback)` — transport-level liveness
- `getPeers()` — current peer list
- `localId` — opaque identifier
- `clockHint()` — optional latency / RTT hint for clock sync

Implementations under `transports/`:

- `MemoryTransport` — in-process, deterministic, with controllable latency / loss / partition; the test substrate of record
- `TrysteroTransport` — real Trystero/WebTorrent P2P (Goal C1). Trystero binding (`{ joinRoom, selfId }`) is INJECTED via the constructor; the async `createTrysteroTransport()` factory is the single place that touches the remote URL (deferred dynamic `import()`). Reliable sends map onto a distinct trystero action label (`TRYSTERO_LABELS`). The adapter passes the full 12-point conformance suite over `FakeTrysteroNetwork` (Node-testable); real-WebRTC behavior rides Goal C4.
- (Future) `ServerTransport` — single authoritative hub

---

## Tick Lifecycle (Layer 2, target)

```
1. Get local intent: intent = getLocalInputs(localGameState)
   - if null: participant is passive this tick; skip 2–3
2. If intent differs from last sent: ship change-only packet
3. Decide remote intents (sparse reconstruction + prediction)
4. tick(state, query) — game logic runs; queries get logged
5. State hash + add to rolling window
6. Process inbound:
   a. Late inputs in acceptance window → check queries → rollback only if any query result changes
   b. Relayed inputs in grace window → same
   c. Hash window from peer → compare with uncertainty awareness
7. Periodic: broadcast hash window
8. Periodic: finalize ticks past grace window; collect old data
```

---

## Current State Mapping (v1 → v2)

| v2 Concept | v1 Implementation | Migration |
|---|---|---|
| Transport interface (Layer 1) | `transports/Transport.js` + `transports/TrysteroTransport.js`; `WorldNetworkCommunicator` is now Layer-2 glue over an injected Transport | **Done (A5 + C1):** interface extracted, Trystero is one implementation, clock glue stays in WNC. **C1 (2026-05-29):** trystero binding injected + `createTrysteroTransport()` factory isolates the remote URL; pre-connect-throw / unknown-peer-no-op / reliable-channel gaps fixed; full 12-point conformance runs against the real class over `FakeTrysteroNetwork` (Vitest 17 + Node-12 selftest 16); real-WebRTC subset rides C4 |
| Sparse change-only inputs | Per-tick inputs (`SendInputOverNetwork`); new protocol lives in `SparseInput.js` (encoder/decoder) | **Core done (B1):** pure module + harness node tested; in-engine `RollbackNetcode` finalization migration staged behind B2/B9 |
| Silence = unchanged | Silence interpreted ambiguously | **Done (B2):** liveness is a dedicated transport heartbeat (`transports/HeartbeatLiveness.js`), fully separate from input silence |
| `getLocalInputs(localState)` | `defineInput(name, sampler)` per field | **Core done (B3):** `LocalIntent.js` (`LocalIntentSource` + `fromFieldSamplers` shim); public `EasyMultiplayer.getLocalInputs(fn)` wired into the facade with precedence over `defineInput`. `null`=passive is first-class; pre-history baseline is `null`. Sparse-send/passive-rollback-pressure rides the B9 input-path migration. |
| Query API (predicate, ctx-frozen) | `Query(playerId, predicate)` (closure over live state); new `query(playerId, ctx, predicate)` core lives in `QueryContext.js` (`QueryLog` + `cloneCtx`/`ctxEqual`) | **Core done (B4):** pure module proves debug-mode freeze + mutation-detection and production zero-clone; re-eval (`recheck`) is decisionless. Wiring the 3-arg signature into v1 `RollbackNetcode.Query` + `_checkQueriesForRollback` rides the B5/B9 rollback-and-finalization migration |
| Hash window broadcasts | Per-tick hash exchange (`GetStateHashDataToSend`); new `{oldestTick, interval, stateHashes[], usedInputs[]}` window built by `HashWindow.js` (`HashWindowBuilder`) | **Core done (B5):** positional checkpoint-grid window + builder; wiring into v1's per-tick `ReceiveStateHash` rides B6/B9 |
| Uncertainty-aware desync | "Hash mismatch = desync" (eager `CheckStateHashes`, coarse `IsSyncedState` gate) | **Core done (B5):** `compareHashWindows` → agree/wait/desync/incomparable; mismatch + unresolved relevant uncertainty in `(lastAgreed, diverge]` → wait, else desync (bounded-time flip once inputs confirm). Decisionless about recovery |
| Acceptance / grace windows | Implicit, mixed with rollback logic; new tunable `WindowConfig` + pure `classifyInput` live in `AcceptanceWindows.js` | **Core done (B6):** frozen/validated config (defaults 200/300ms, 20-tick hash interval, 500ms attendance) + per-arrival classify (acceptance accepts raw/relayed; grace accepts relayed-only; beyond → reject + maybe-recover). In-engine wiring rides B8/B9 |
| `queryDisconnected` | `HandleDisconnect` procedural; new pure core in `DisconnectTracker.js` | **Core done (B7):** canonical disconnect tick = `lastAttendanceTick + timeoutTicks` (deterministic from the shared stamp, not local detection time — resolves S-005-06); monotonic forward-only, late beat retroactively un-disconnects `[old,new)` and triggers a disconnect-conditional rollback only if already simulated past `old`. Decisionless. Sparse cross-network convergence (peers that heard different last beats) = beat FORWARDING (grow-only-max gossip) + B8-recovery fallback (`mergeLastAttendanceTicks`). **(The #30 pull-on-suspicion probe / B7.1 `DisconnectProbe.js` is SUPERSEDED + DELETED 2026-06-05 — see §"Disconnects As Simulation Events" above and `DESIGN_PARTICIPATION.md` §6.2.)** |
| Authority + severe-desync recovery | No authority model; no state-challenge flow | **Core done (B8):** pure `Recovery.js` — `compareAuthority` total order (older sim wins, lower id ties), `resolveDesync` (loser adopts / winner serves; provenance travels with the adopted history → monotonic convergence), `shouldResetSimulationAge`/`resetSimulationAge` (lagging-peer wake), `makeStateTransfer` carrying `lastAttendanceTicks` + `mergeLastAttendanceTicks` grow-only-max (#30 slow-path fallback). Decisionless about trigger/re-sim; in-engine wiring rides B9 |
| Tick finalization + memory bounding | Per-tick data retained implicitly; finalization tied to confirmed-tick assumptions | **Core done (B9):** pure `Finalization.js` — `TickFinalizer` horizon = `maxCurrentTick - graceWindowTicks`, GROW-ONLY-MAX (recovery backward-jump can't un-finalize); two payload-agnostic GC policies — `collectAnchored` (carry-forward: snapshots, last-input-per-participant — keep latest entry ≤ horizon + everything after) and `collectBelow` (no-carry-forward: query logs — drop everything strictly below). Proven by a retention PLATEAU vs a GC-disabled growing CONTROL. The engine actually releasing retained data on the finalization sweep rides B-Integrate |
| Bus rollback | Synchronous rollback freezes simulation | New subsystem (Phase C+) |
| Bootstrap on join | `RequestFullState` | **Core done (B10):** pure `Bootstrap.js` — `selectServingPeer` (joiner-side uniform pick, injected seeded rng, clamped index → uniform serving-peer histogram), `eligibleServers` (LIVE-only), `makeBootstrapPayload`/`validateBootstrapPayload` (DECISIONS #18 three-piece: snapshot @ grace-window edge by reference + since-edge sparse log [strictly-after-edge] + per-participant baseline AT the edge), `reconstructInputs` (per-participant `SparseInputDecoder`, baseline-seeded, log-only=null passive), monotonic Layer-3-visible `CatchUpTracker`. Single reliable point-to-point request, never a broadcast (sparseness). Proven by uniform-distribution (chi² < 13.28) + sparse-contact + catching-up-lifecycle + re-sim-fidelity. WHEN-to-request/transport-send/re-sim loop ride B-Integrate |

**In-engine wiring — DONE (B-Integrate, 2026-05-29).** Every "rides B9 / rides B-Integrate" note above is now resolved in `SimulationEngine.js`, a NEW engine (DECISIONS #32) that composes all ten cores over an injected Transport (A5) + injected clock, driven by a manual tick (the KNOWN_ISSUES #7 seam). It is NOT a `RollbackNetcode`/`EasyMultiplayer` retrofit — the v1 files are untouched — and it implements the A3 SimulationNode contract so it runs under `PeerHarness` directly (`nodeFactory = (transport, opts) => new SimulationEngine(transport, opts)`). Validated across five test-gated layers (L1 sparse-input, L2 sim/rollback/query-freeze/hash, L3 acceptance/disconnect/probe/recovery, L4 finalization-GC/bootstrap, L5 end-to-end), with `tests/engine-l1..l4`, `tests/engine-integration.test.js`, and a Node-12 `selftest-b-integrate.mjs`. Recovery's `usedInputs`/"wait"-tier gating is not yet populated (the engine uses a sound, more conservative finalized-divergence gate instead) and three real-transport bootstrap gaps remain — both deferred to Phase C (KNOWN_ISSUES #4, #8).

Existing code that survives roughly unchanged: `imurmurhash.js`, `Utils.js`, `EventSystem.js`, `PresentationHints.js`, `SyncedClock.js`, and the high-level shape of `EasyMultiplayer.js`. Optional modules (`SoundManager`, `VoiceChat`, `PlayerVisualization`) are independent of the protocol changes.

---

## Defaults (current v1 — for reference)

- Tick rate: 30 Hz (RollbackNetcode), 2 Hz (SyncedScene)
- Input delay: 4 ticks
- Broadcast rate: 10 Hz
- Clock sync: 3+ samples required, 2s smooth shift
- Max rollback history: 4–2000 ticks

v2 windows (acceptance, grace, hash interval, attendance) will be tunable per-game with documented defaults.
