# Synced-Clock / Synced-Tick Research Plan

Branch: `timer-research` (cut from `refactor/ticks-only`)
Owner: Rob
Status: PLANNING — no mechanism designed yet
Linked context: `DECISIONS.md` #37 ("STILL UNSOLVED"), `tests/synced-tick-distributed-spec.test.js` (6 `describe.skip` EDGE blocks), `SyncedClock.js` (legacy v1, being retired — not under active development here), `test-harness/SyncedPeerClock.js` (the REAL v2 P2P clock-sync algorithm — Cristian + median — wired into the shipping `EasyMultiplayer` facade on the opt-in `syncedTick` path; production-viable despite its `test-harness/` home, pending promotion to a production module. NOT a model of v1; that earlier description was written at PLANNING time and is obsolete).

---

## 0. Purpose & framing

We want to upgrade the synced-clock / synced-tick subsystem so that a **distributed, peer-to-peer, sparse** mesh of peers can agree on "what tick is it right now" across the hard cases that the current `oldest-epoch-wins` rule cannot handle. These cases are captured as 6 executable, currently-skipped specs.

This document is a **research plan**, not a design and not an implementation. Per the working rules (CLAUDE.md §1–2): we assume our intuitions about distributed timing are probably wrong until falsified by tests, we research prior art before inventing, we write the falsifying test before the mechanism, and we actively try to break every approach we design.

### Scope

IN scope (initial focus):
- EDGE 1 — simultaneous cold start, no founder anchor → deterministic tie-break.
- EDGE 2 — two internally-synced groups merge → longest-present group wins (+ tie-break for equal age).
- EDGE 3 — already-synced peer recalibrates → small correction eased, large correction abandon-and-resync, tick never rewinds.
- EDGE 4 — late joiner with a "ready" clock → presence-seniority outranks clock-readiness.
- EDGE 6 — a rightful large correction must re-sync **simulation state**, not just the tick number.

OUT of scope for now (explicit user decision — "pure bonus, not initial focus"):
- EDGE 5b — lone/malicious peer with a **forged** presence claim. The honest, non-adversarial half of EDGE 5 (5a: a peer that genuinely **slept** must self-demote its stale clock on waking) IS in scope, because it is a correctness requirement, not an attack. We will design so that the seniority signal is *not trivially forgeable*, but we will not build Byzantine/cryptographic defenses yet.

### Constraints that bound every candidate mechanism

These come from the system this lives in (`SyncedPeerClock.js` header, `NetworkSim.js`, engine code):
- **No master / no server.** Pure P2P. No peer is privileged a priori.
- **Sparse messaging.** Sync rides existing attendance / ping-pong / bootstrap messages. A candidate that needs a new flood or a consensus round per tick is disqualified on cost. Reuse the attendance (`beat.epoch` today) and boot path.
- **Best-effort transport.** Messages drop, duplicate, reorder; partitions cut sync exactly like game traffic (no sidecar channel). A peer can only learn another's time by getting a message back.
- **Tick monotonicity is sacred.** A derived tick must never run backward in-game (it would re-display already-shown frames). Corrections slow/stall or abandon-and-resync; they never rewind.
- **Determinism for testability.** The harness runs on one master timeline with a seeded RNG. Any tie-break must be deterministic (lowest id, etc.), never a coin flip — EDGE 1 and EDGE 2-equal-age assert reproducibility.
- **Opt-in, non-regressive.** `syncedTick` is opt-in; count-based mode and the existing 438-test suite must stay byte-for-byte unchanged.

---

## 1. Problem decomposition (what each EDGE actually is, in distributed-systems terms)

The current mechanism agrees on a single number: `epoch = strict-min(all epochs seen)`, and `tick = floor((syncedClock.now() - epoch) / tickMs)`. "Oldest epoch wins" is a **min-register CRDT** over wall-clock instants, and the synced clock is **Cristian's algorithm + median offset**. That combination solves late-join (a founder anchors) and bounded skew. It fails the 6 cases because it has **no concept of**:

| EDGE | Underlying problem | Why min-epoch fails |
|------|--------------------|---------------------|
| 1 | Leaderless agreement / symmetry-breaking | No anchor; both reseed from raw clocks, smaller raw epoch hijacks by hours. |
| 2 | Group membership + view merge | Min-epoch can't express "the group that existed longest, as a group, wins"; it averages nothing but picks the numerically-smallest instant, which is arbitrary across groups. |
| 3 (small) | Clock discipline / slewing | Engine derives tick straight from `now()`; a refining offset can step the tick. |
| 3 (large) / 6 | State reconciliation after a jump | Fixing the tick number leaves simulation STATE computed for the wrong ticks. |
| 4 | Authority ≠ confidence | A clock can read "ready" off one sample; readiness is mistaken for the right to set the frame. |
| 5a | Failure detection / staleness | A long-absent peer keeps a stale "I'm oldest" claim; nothing demotes it. |
| 5b (OUT) | Byzantine agreement / quorum | One forged claim can move a healthy group; no corroboration requirement. |

**Key reframing to validate during research:** the problem is really *electing and maintaining a shared "frame authority"* (an epoch + the identity/seniority backing it), gossiped and merged over a sparse mesh — closely analogous to **spanning-tree root-bridge election** (lowest priority/id wins, gossiped, re-elected on topology change) and to **view-synchronous group membership**. The synced *clock* (offset estimation) is a mostly-solved sub-problem; the unsolved part is the *authority/seniority* layer on top of it, plus *state reconciliation* when authority changes.

---

## 2. Phase plan

### Phase R — Literature & prior-art research (do FIRST, before designing)

Goal: not reinvent badly. Produce a short annotated summary (`./PRIOR_ART.md`) mapping each algorithm family to the EDGEs it might serve, with the cost/assumption that makes it fit or not fit our sparse-P2P-no-master constraints.

Algorithm families to study and explicitly accept/reject:
- **Clock sync:** Cristian's algorithm (current), NTP (Marzullo's / intersection algorithm for fault-tolerant selection — directly relevant to EDGE 5b corroboration and to rejecting outlier clocks), Berkeley algorithm (averaging — note: we explicitly do NOT want averaging into a "third time", EDGE 2; understand why to argue against it).
- **Logical / hybrid time:** Lamport timestamps and vector clocks (causality, presence ordering, deterministic tie-break), **Hybrid Logical Clocks (HLC)** (physical+logical, monotonic by construction — strong candidate for the monotonic-tick requirement and for encoding seniority without a forgeable raw wall-clock).
- **Leader / root election:** Bully, Ring, and especially **Spanning Tree Protocol root-bridge election** (priority+id, gossiped, re-converges on merge/partition — the closest analogue to EDGE 1/2/4).
- **Group membership / view change:** virtual synchrony / view-synchronous communication, gossip/epidemic anti-entropy convergence (for sparse merge, EDGE 2).
- **Consensus (study to reject as too heavy, or to borrow ideas):** Paxos/Raft (leader lease, terms-as-seniority) — almost certainly too chatty for per-tick P2P, but "term number that only increases" and "lease/attendance liveness" are reusable ideas for EDGE 4/5a.
- **Quorum / corroboration:** majority intersection, read/write quorums — for "don't move a healthy group on one dissenter" (EDGE 5b, OUT, but informs the corroboration rule).
- **Failure detection:** attendance timeouts, **phi-accrual failure detector** — for EDGE 5a self-demotion ("I haven't confirmed my clock with anyone for T, treat myself as unsynced").
- **Clock disciplining:** PLL/FLL loop filters, slewing vs stepping (how OSes apply `adjtime`) — for EDGE 3 smooth correction without rewind.
- **State reconciliation:** snapshot/anti-entropy, log-replay re-bootstrap (we already have a bootstrap path) — for EDGE 3-large / EDGE 6.

Deliverable check: each EDGE has at least one named candidate mechanism and one named rejected alternative, with reasons tied to our constraints.

### Phase D — Design candidate mechanisms (on paper first)

For each EDGE (or cluster of EDGEs that share a mechanism), write a short design note in `./DESIGN_<edge>.md` containing: the rule, the message/field it rides on (must reuse attendance/boot — name it), the monotonicity argument, the determinism argument, and the failure modes you already suspect.

Likely design spine to evaluate (a hypothesis to falsify, not a decision):
1. **Replace "min raw epoch" with a composite, gossiped frame-authority token**: e.g. `(presenceAge, groupWitnessCount, founderId)` compared lexicographically, where `presenceAge` is measured as elapsed *confirmed-synced* time (not raw wall-clock, so it is harder to forge and self-demotes when stale), `founderId` is the deterministic tie-break. This directly targets EDGE 1 (tie-break), EDGE 2 (longest-present group wins + equal-age tie-break), EDGE 4 (seniority beats readiness).
2. **Staleness self-demotion** (phi-accrual / simple timeout): a peer that hasn't corroborated its clock with anyone for T downgrades its own authority token to "unsynced", so it cannot win on stale seniority (EDGE 5a).
3. **Corroboration gate**: adopting a *different* frame while already in a healthy multi-peer agreement requires ≥k corroborating peers, so a lone token cannot move a group (EDGE 5b, designed-for even though OUT of initial test scope).
4. **Correction policy on the derived tick**: small delta → slew (stall/ease, already half-present via the clock's monotonic clamp); large delta → **abandon-and-resync**: return `catchUpStatus` to `catching-up`, re-bootstrap simulation state from a peer snapshot, then go live on the corrected frame (EDGE 3-large, EDGE 6). Define the small/large threshold and justify it.

Each design note MUST include an explicit "how this stays sparse" paragraph and a "how a peer could exploit/break this" paragraph.

### Phase I — Implement increment-by-increment, test-FIRST

Order by dependency and risk (cheapest falsifier first):
1. EDGE 3-small (slew) — smallest, validates the monotonic-correction primitive in the engine.
2. EDGE 1 (cold-start tie-break) — pure authority-token logic, 2–3 peers.
3. EDGE 4 (seniority > readiness) — adds the presence-age dimension.
4. EDGE 2 (group merge) — adds membership/witness count + partition/heal.
5. EDGE 3-large + EDGE 6 (abandon-and-resync) — adds state reconciliation; do together (same mechanism, EDGE 6 is the state-equality teeth of EDGE 3-large).
6. (Stretch / OUT) EDGE 5a then 5b.

Per increment, the loop is strictly:
1. **Un-skip exactly one `describe` block** in `tests/synced-tick-distributed-spec.test.js` (change `describe.skip` → `describe`). It is now a falsifiable target. Run it, watch it FAIL, and confirm it fails for the *expected* reason (not a setup error).
2. Add narrower unit tests around the new rule (the spec's budgets are deliberately loose; we add tight tests that pin the exact rule, including the off-by-one / tie / equal-age boundaries).
3. Implement the minimal engine change. Keep count-based mode untouched; run the FULL existing suite (438 + selftests, Node 20 AND Node 12 per #37) to prove non-regression.
4. Only then move to the next block.

### Phase B — Adversarial / break-it testing (the part we must not skip)

Per CLAUDE.md §3, for every mechanism we ship we will actively try to break it. Concrete attacks/stressors to build as tests (these go BEYOND the loose specs):
- **Tie-break symmetry:** EDGE 1 with ids that sort differently (numeric vs string), 3-way and 4-way exact-simultaneous starts, and reversed add-order — must converge to the SAME peer's frame regardless of insertion order.
- **Partition flapping:** heal/partition repeatedly during EDGE 2/4 settling — does authority oscillate? Does the tick ever rewind? Does it converge or thrash?
- **Three-way merge:** three groups of different ages meeting near-simultaneously (not just two) — does "longest present" stay transitive and acyclic, or can A>B>C>A?
- **Boundary corrections:** a correction exactly at the small/large threshold; a sequence of many small corrections that sum to a large one — does it slew forever or eventually abandon-and-resync?
- **Monotonicity fuzz:** randomized offsets/drifts/latencies across a seeded sweep; assert the derived tick series is non-decreasing for EVERY peer in EVERY run (this is the invariant most likely to be violated subtly).
- **Stale-but-present:** EDGE 5a with varying sleep durations around the self-demotion threshold T.
- **Lone-dissenter sweep:** EDGE 5b with the dissenter offset swept from tiny to huge — find the exact point (if any) where corroboration fails to hold the group.
- **Re-sync correctness teeth (EDGE 6):** after abandon-and-resync, assert STATE deep-equality (the spec already does), plus that no ticks were double-applied or skipped in the tally.

### Phase V — Honest limits & minimal human verification

Per CLAUDE.md §4–4.1, document explicitly:
- What the deterministic harness CAN prove: convergence within tolerance, tick monotonicity, state equality, determinism of tie-breaks, non-regression — all without a human.
- What it CANNOT faithfully model yet and why: a true "sleep" that freezes a peer's tick driver (EDGE 5a NOTE — partition stops sync but the tick loop keeps free-running); real wall-clock drift characteristics; real WebRTC jitter/asymmetry beyond the seeded model; a forged authority-claim channel (EDGE 5b NOTE). For each, state what we approximate and the residual gap.
- Any genuinely human step (e.g. a real two-browser WebRTC smoke test of the late-join jank that motivated #37) gets a numbered, non-expert checklist with expected outcomes and failure signatures — and we first push as much as possible into headless `graph-pacman` tests (the facade tests already exist).

---

## 3. Success criteria / Definition of Done (per mechanism)

A given EDGE is "done" only when:
1. Its spec block is un-skipped and GREEN, AND tighter unit tests around the exact rule are green.
2. At least 3 concrete ways it could still be wrong have been written down and tested against (CLAUDE.md §3).
3. The full pre-existing suite is still green on Node 20 and Node 12 (non-regression).
4. Tick monotonicity holds across the adversarial fuzz sweep for that case.
5. The honest-limits note for that case is written.
6. A `DECISIONS.md` entry records the rule, why, and the rejected alternatives.

The overall research effort is done when EDGE 1, 2, 3, 4, 6 meet the above and EDGE 5 is either done or explicitly deferred with its limits recorded.

---

## 4. Open questions to resolve during research (not yet answered)

- Is presence-seniority best measured in *confirmed-synced elapsed time*, *number of distinct corroborating peers seen*, or a lexicographic combination? (Forgeability vs. simplicity trade-off.)
- Can the authority token be a CRDT (so merge is associative/commutative/idempotent and partition-merge is trivially correct), or does "longest-present-wins" inherently need a non-commutative election? (STP root election is effectively a CRDT join over `(priority,id)` — can we frame ours the same way and get partition-merge correctness for free?)
- What is the right small/large correction threshold, and should it be absolute (ms), relative (fraction of presence age), or adaptive?
- Does abandon-and-resync risk a livelock under repeated large corrections (flapping authority)? Need a hysteresis / backoff rule.
- How do we keep all of this rideable on the existing attendance without growing message size unacceptably for large meshes?

---

## 5. Working agreements

- Research notes live under `research/`; specs stay in `tests/`; rules-of-record go in `DECISIONS.md`.
- One EDGE in flight at a time; never un-skip more than the block being actively implemented.
- Every claim of "this works" is backed by an executed test, or is flagged as unverified (CLAUDE.md §7).

## 6. Adjacent concern — handshake, protocol version & session id (noted, NOT yet scoped)

Raised by Rob: today any peers on the same network just join each other. In reality (a) different software versions (v47 vs v48) may not be compatible and need to refuse or gate each other, and (b) when peers join they effectively adopt one peer's **session id**. This is not a clock-algorithm change, but it is the same **membership / handshake layer** this research has to introduce, so we record the relationship and leave design room — without expanding committed scope yet.

How it connects:
- **Session id ≈ a generalization of the epoch / frame-authority token.** `_adoptEpoch` already makes a joiner adopt a single number (oldest epoch). Widening that from a bare `epoch` to a session identity that carries `(epoch + seniority/authority token + agreed simulation origin)` means the Phase-D "frame-authority token" and "the session id" can be the SAME object — a unification, not extra mechanism.
- **EDGE 2 / 4 reframed:** "longest-present group wins" is literally "which session id survives the merge."
- **EDGE 6 reframed:** "re-sync the simulation state, not just the tick" becomes "adopting the winning session's id IMPLIES re-bootstrapping its state" — the session id gives the state-resync a natural trigger.
- **Protocol version is a gate one layer UP, upstream of clock math.** Incompatible peers must never clock-sync or merge sessions — equivalent to a permanent partition for that pair. Orthogonal to the timing algorithm, but it rides the SAME handshake: one exchange does version → (if compatible) session id + authority token → (then) epoch/clock sync.

Action for the research: when designing the Phase-D authority token, evaluate framing it AS a session id and shaping the handshake message so a protocol-version field can gate it later. Do not build version negotiation now; just don't design it out.