# Background Join — unifying cold-join and resync into one assemble-then-switch process

Status: SPEC (agreed 2026-06-02). Supersedes the in-place `_abandonAndResync` / `seedToNow` /
Phase-1-freeze / Phase-2-coast machinery for the synced-tick path.

## 1. The core mental model

> Cold join does not exist. It is just resync.

A node opens the page and immediately starts running **its own** simulation — founds its own frame
at age 0, goes live, ticks. It makes **no** assumptions that other peers or more-senior frames
exist. Soon it exchanges messages with peers and learns a game is already going on.

When a node A learns of a more-senior frame B (via attendance/assert), **nothing about A's live
simulation changes**. A keeps its own clock, its own tick, its own state, and keeps broadcasting on
its own frame exactly as before. The *only* new thing: A opens a **background** effort to assemble
B's frame.

That background effort gathers, **separately from A's live state**:

1. a **second clock** (its own `SyncedPeerClock`) syncing onto B's frame, and
2. B's **state snapshot** (requested point-to-point from the peer A learned B from), and
3. from the moment the join opens, a **buffer of inbound messages** (inputs etc.) stamped for B's
   frame, so none are lost between snapshot capture and the switch.

The switch to B happens in **one instant**, only once BOTH (1) the second clock has *settled*
(`isStable()`) AND (2) the snapshot has arrived. At that instant A atomically replaces its clock,
epoch, authority, state, and decoders with the assembled B frame and catches its sim up to B's
present. Because the second clock is already settled, there is **no post-switch convergence period
and therefore no freeze** — the thing that all the old seed/coast/clamp complexity existed to mask.

If A can never gather what it needs (B's peer goes unreachable, request lost too many times), A
**never switches** and simply stays on its own frame. Nothing was torn down, so there is nothing to
recover.

## 2. The PendingJoin state machine

A node holds at most one `PendingJoin` at a time.

```
            hear senior frame X (election says X outranks live frame)
   none ─────────────────────────────────────────────────────────────▶ GATHERING(X)
                                                                            │
   GATHERING(X):                                                            │
     • second clock syncing to X                                           │
     • request state from targetPeer(X); retry on a deadline               │
     • buffer inbound X-frame inputs                                       │
                                                                            │
     hear Y strictly-more-senior than X  ──▶ discard X, open GATHERING(Y)  │
     request deadline exceeded N times   ──▶ discard, back to `none`       │
     live frame itself becomes >= X      ──▶ discard, back to `none`       │
                                                                            ▼
     snapshot received AND secondClock.isStable()  ───────────▶  COMMIT (atomic) ──▶ none
```

States: `none` and `GATHERING`. `COMMIT` is an instantaneous transition, not a resting state.

### Transitions in detail

- **Open (`none → GATHERING`)**: on a attendance/assert whose frame strictly outranks our live frame
  under the *existing* election (`compareFrameRank`: support-weighted, maturity-gated, age/​id
  tiebreak). The election rule is UNCHANGED — only the COMMIT mechanism changes. `targetPeer` = the
  peer the beat came from (a first-hand supporter of the target epoch, recorded in `_frameSupport`).
- **Re-target (`GATHERING(X) → GATHERING(Y)`)**: on hearing a frame Y that strictly outranks the
  current target X. Discard X's second clock + buffer wholesale; start fresh for Y. Always chase the
  most-senior-seen (no incremental climb).
- **Timeout/fail (`GATHERING → none`)**: the state request is point-to-point and may be lost. Track
  attempts; re-request on a deadline measured on the LIVE clock (always healthy). After `N`
  attempts with no snapshot, discard the PendingJoin. The next qualifying attendance re-opens it —
  attendance ARE the retrigger, so no private retry timer beyond the in-gather deadline.
- **Obsolete (`GATHERING → none`)**: if our own live frame comes to outrank the target (e.g. the
  target group died and support collapsed), drop the join.
- **Commit (`GATHERING → none`, atomic)**: when snapshot present AND `secondClock.isStable()`.

### Commit (the atomic switch)

In one synchronous block:

1. `_clock ← secondClock` (already stable on the target frame).
2. `_epoch ← targetEpoch`; `_frameAuthority ← targetAuthority`.
3. `_state ← clone(snapshot.state)`.
4. `_decoders ← reconstructInputs(snapshot)` (target frame's participants), then **re-overlay our
   own decoder** (keep our self-authored inputs, as `_adoptTransfer` does today) and
   `_intentSource.reset()` so our first post-switch sample re-announces us on the new frame.
5. **Replay buffered inputs** with `tick > snapshot.tick` into the decoders (drop `<= snapshot.tick`;
   the snapshot already finalized those).
6. `_tick ← snapshot.tick`; restart snapshot/hash/finalizer lineage; merge attendance ticks.
7. Catch up: `present = floor((secondClock.now() - targetEpoch)/tickMs)`; re-simulate
   `snapshot.tick → present`. Trustworthy immediately because the clock is settled.
8. `pendingJoin = null`. Continue normal `_advanceTick` on the new frame.

## 3. What gets DELETED (now-unnecessary systems)

The whole apparatus that existed to survive the window between *committing to a frame* and *having
its data/clock* goes away:

- `SimulationEngine._abandonAndResync()` — replaced entirely by opening a PendingJoin.
- `SimulationEngine._onBoot` clock-wait branch + `_awaitClockReady` + `_pendingPresentTick` — the
  switch only happens when the second clock is already stable, so there is no "adopt then wait for
  clock" state.
- `_advanceTick` synced-tick guards: the `_abandonAndResync` large-gap trigger, the
  `target - _maxSeenTick` convergence HOLD, and the not-yet-converged deferral. The live primary
  clock is never dragged onto a foreign epoch, so its derived target never runs away.
- `SyncedPeerClock.seedToNow()` + `_coarseSeed` + the coarse-sample gate in `_addSample`. No one
  seeds the live clock to a guessed present any more.
- `SyncedPeerClock.resetSync()` — we never reset the live clock; we swap in a separately-built one.
- Phase-1 bounded-freeze / Phase-2 coast (task #22 disappears rather than being implemented).
- The bootstrap **decline-retry** path for resync becomes moot: the target peer is a known supporter
  of the target epoch and holds the state, so it serves rather than declines. (Cold-start co-founder
  declines may still exist in count-based mode; see §5.)

KEEP (still load-bearing):
- `compareFrameRank` / `_frameSupport` / maturity gate — still decide *which* frame to join.
- The seniority gate / owner gate in `SyncedPeerClock._addSample` — a settled frame's clock must
  still refuse to follow strictly-junior peers' pings.
- `reconstructInputs`, `makeStateTransfer`/bootstrap payload shape, finalized-floor own-input
  re-overlay.
- The genuinely-alone woken-tab re-anchor (self-owned, no corroborating network): keep the tick
  continuous locally; it will background-join the network once it hears it.

## 4. Open design points already settled

1. Commit gate = `isStable()` (settled band), not merely `isReady()` — committing onto a slewing
   offset would re-introduce a post-switch freeze.
2. Wire `MSG_INPUT` carries the **epoch** so the buffer can tell which frame an input belongs to.
   (It already carries `playerId, tick, intent`; add `epoch`.)
3. Failure = discard the PendingJoin; next attendance re-triggers.
4. Re-target = always jump to the most-senior frame seen.
5. Founding race = age-tie→lower-id in the election makes exactly one side open a join; the other
   stays put and is joined.

## 4a. Two facts forced by "engine owns two clocks on one transport"

Discovered during wiring (the transport has a single clock-sync handler):

- **Clock-sync channel MUX.** Each `SyncedPeerClock` both sends and receives ping/pong on the
  transport's single clock-sync channel. Two clocks would clobber each other's handler and process
  each other's pongs. So the ENGINE owns the channel: it attaches one real handler and gives each
  clock a per-channel transport shim that tags outbound payloads (`ch:'live'|'join'`) and registers
  the clock's receiver under that tag; inbound payloads dispatch by tag.
- **The live clock must NOT slew to seniors.** Today a junior peer's clock follows senior peers —
  that is the old merge-by-slew and the very freeze we are deleting. In the new model the live frame
  must hold steady until the atomic switch, so the LIVE clock follows only same-frame (equal-rank)
  peers (new `followSeniors:false` mode). The JOIN clock keeps the existing follow-equal-and-senior
  behavior, with its `localRank` set to the TARGET authority so it converges onto the senior frame
  and ignores our own (now-junior) live group.

## 5. Scope / honesty (CLAUDE.md §4)

- This redesign targets the **synced-tick** path. Count-based mode (`syncedTick:false`) keeps the
  existing B10 cold-join bootstrap and is untouched.
- All of this is validated on the deterministic `VirtualClock` + `MemoryTransport` harness. Real
  WebRTC loss/NAT/jitter timing of the second-clock convergence and the request retransmit remain
  browser-only (HONEST_LIMITS.md).
- Ways this can still be wrong, to test against: (a) re-target leaking the old second clock's
  samples into the new target; (b) buffered inputs with `tick <= snapshot.tick` corrupting the
  snapshot baseline; (c) two peers each opening a join to the other (election tie not strict);
  (d) committing while the second clock briefly reads `isStable` on a half-converged symmetric-merge
  offset; (e) a snapshot from a target that itself then re-targets, leaving us assembled on a frame
  nobody stands on.
