# Honest Testing Limits — Synced-Tick / Synced-Clock System

This document records, per CLAUDE.md §4–4.1, exactly what the deterministic test harness
**can** prove on its own, what it **cannot** faithfully model and why, and the minimal residual
that genuinely needs a human. Nothing here implies certainty where the harness only approximates.

All test paths are relative to the project root (`tests/`). For the live pass/skip count run
`vitest run` — the suite is the source of truth. The only intentionally **skipped** tests are the
EDGE 5 pair (documented in §2(a)/(d)); everything else is expected green.

---

## Section 1 — What the deterministic harness CAN prove (without any human)

The harness (`test-harness/` — `VirtualClock`, `NetworkSim`, `PeerHarness`, `SyncedPeerClock`)
runs every peer on one deterministic master timeline. Given a fixed seed, the entire run —
message order, latency, loss, tick derivation — is reproducible, so these five properties are
machine-decidable end to end:

- **Convergence within tolerance** — all peers reach the same present tick within a bounded
  number of ticks after joining or reconnecting.
  - Tests: `tests/synced-tick-distributed-spec.test.js` (EDGE 1 "simultaneous cold start with
    desynced clocks" → eventually agree; EDGE 2 "two internally-synced groups meet and settle"),
    `tests/graph-pacman-late-join.test.js` (CONTROL simultaneous-start converge; "late joiner
    converges to the founder", tick agreement to within the one-tick boundary straddle),
    `tests/synced-tick.test.js`.

- **Tick monotonicity** — no peer's displayed tick ever goes backward across a state transition
  (join, catch-up, abandon-and-resync, partition heal).
  - Tests: `tests/adversarial-break-it.test.js` (Attack 5 "Monotonicity fuzz", 20 seeds with
    randomized offset/drift/latency/loss; Attack 2 "Partition flapping" — no tick rewinds across
    5 heal cycles), `tests/edge3large-abandon-resync-unit.test.js` (trace `[0,0,2400]`,
    non-decreasing across the whole abandon→resync cycle),
    `tests/synced-peer-clock-unit.test.js` (clamp non-decreasing once ready).

- **State equality** — peer game-states are byte-equal at the same tick, or differ only within
  the accepted rollback re-simulation window.
  - Tests: `tests/graph-pacman-late-join.test.js` (`equalTickConvergence` — serialized state is
    byte-identical on the instants where `A.tick === B.tick`, plus a NON-VACUOUSNESS guard that
    proves the snapshot actually tracks motion), `tests/adversarial-break-it.test.js` (Attack 8
    "Re-sync correctness teeth EDGE 6" — C's `state.tally` deep-equals the majority after
    abandon-and-resync; replays accounted against `rollbackCount()`),
    `tests/rollback-netcode.test.js`, `tests/engine-l2-sim-rollback.test.js`.

- **Tie-break determinism** — given identical inputs and seeded RNG, authority selection is fully
  reproducible and order-independent (lower `simulationAge`-loser, lexicographic peerId tiebreak).
  - Tests: `tests/adversarial-break-it.test.js` (Attack 1 "Tie-break symmetry" — all permutations
    elect the same peer, lexicographic ids `"2","10","9"` stable; Attack 3 "Three-way merge" —
    `compareAuthority` is a strict, transitive, antisymmetric order, no Condorcet cycle),
    `tests/edge1-authority-unit.test.js`, `tests/edge2-group-merge-unit.test.js`,
    `tests/edge4-seniority-unit.test.js`, `tests/recovery.test.js` (`compareAuthority`).

- **Non-regression** — all prior edge-case behavior (EDGE 1–4) continues to pass with the
  syncedTick path wired in.
  - Tests: `tests/synced-tick-distributed-spec.test.js` (EDGE 1, EDGE 2, EDGE 3-small,
    EDGE 3-large, EDGE 4 all un-skipped and green), `tests/edge1-authority-unit.test.js`,
    `tests/edge2-group-merge-unit.test.js`, `tests/edge4-seniority-unit.test.js`,
    `tests/edge3large-abandon-resync-unit.test.js`, plus the full suite (green except the two
    intentionally-skipped EDGE 5 tests; run `vitest run` for the live count).

---

## Section 2 — What the harness CANNOT faithfully model (and why)

Each gap below is real. The harness substitutes a tractable approximation; the residual is the
part a real deployment can still break that no green test here would catch.

**(a) Partition that freezes the tick driver (EDGE 5a NOTE)**
- **Approximation:** A "sleeping"/absent peer is modeled by partitioning it on `NetworkSim` —
  message delivery (game traffic *and* clock-sync ping/pong) is genuinely cut. But the peer's
  tick loop is driven by `VirtualClock.schedule`, which the master timeline keeps advancing: the
  simulated tick **keeps counting** during the partition. The harness approximates "asleep" as
  "isolated but still ticking." (Documented inline at
  `tests/synced-tick-distributed-spec.test.js` EDGE 5 NOTE; the engine-side defense — now
  support-weighting in `compareFrameRank`, which keeps an established session when a stale waker
  returns regardless of its claimed age — is exercised by `tests/adversarial-break-it.test.js`
  Attack 6. The earlier opt-in `isolationDemoteMs` self-demotion knob was removed as redundant.)
- **Residual gap:** A real frozen driver (minimized tab, `setInterval`/`requestAnimationFrame`
  throttled to ~0 Hz, device sleep) stops the tick loop entirely, so the peer wakes with a tick
  that is *stale by the full sleep duration* and a raw clock that may have jumped (suspend/resume).
  The harness's "kept counting" peer never exhibits that exact stale-tick-plus-clock-jump combo,
  so the precise wake-up correction magnitude — and any UI lurch at the instant of re-entry — is
  not what the real browser would produce. This matters because the abandon-and-resync threshold
  is tuned against correction size.

**(b) Real wall-clock drift characteristics**
- **Approximation:** Per-peer skew is a **fixed offset** (`clockOffset`) plus an optional constant
  `clockDrift` rate; inter-message latency is drawn from a seeded RNG. Drift is therefore linear
  and stationary.
- **Residual gap:** Real hardware-clock drift has temperature-dependent frequency error, is
  corrected by **discrete NTP steps** (the clock can jump forward or backward in a single event),
  and carries jitter that is **not Gaussian**. A real peer can see its `now()` step discontinuously
  mid-session, momentarily crossing the abandon threshold or defeating the monotonic clamp's
  assumptions in ways a constant drift never reproduces. The model cannot show a mid-session NTP
  step triggering a spurious resync, nor drift that accelerates under thermal load.

**(c) Real WebRTC jitter and asymmetry**
- **Approximation:** `NetworkSim` links are configurable but, as used, give **symmetric,
  IID-distributed** latency with independent per-link loss. RTT is effectively `latency*2`, so the
  Cristian `rtt/2` offset estimate is near-exact.
- **Residual gap:** Real WebRTC has **asymmetric RTTs** (uplink ≠ downlink), **bursty correlated
  loss** under congestion (not IID), **ICE renegotiation pauses** and **DTLS rehandshake delays**
  that stall a link for hundreds of ms at a time. Under asymmetry the `rtt/2` assumption is biased,
  so the converged offset carries a residual the harness's symmetric model never injects. Burst
  loss can also starve the median sample window, delaying `ready` or widening rollback windows
  beyond the ~2-tick band the harness observes.

**(d) Forged authority-claim channel (EDGE 5b NOTE)**
- **Approximation:** The corroboration/seniority compare logic is tested **in isolation** —
  `compareAuthority` and `_adoptEpoch` are driven directly with hand-built tokens
  (`tests/edge1-authority-unit.test.js`, `tests/edge2-group-merge-unit.test.js`,
  `tests/adversarial-break-it.test.js` Attack 7 "Lone-dissenter sweep" sweeps deviant clock
  offsets through the *normal* attendance path). The unit logic correctly refuses a lone dissenter.
- **Residual gap:** There is no way in the harness to inject a **forged `beat.authority`** from
  outside the legitimate attendance path — a malicious peer fabricating a high-`simulationAge`
  token it never earned. End-to-end channel injection (a hostile frame on the wire, malformed or
  spoofed) is **not covered**; EDGE 5b's "a forged-seniority token must not override a live group
  of ≥3" is designed for but has no end-to-end test. This matters because the in-isolation unit
  test assumes the token *arrived through the normal path*; it cannot prove the receive path
  validates a token whose provenance is hostile.

---

## Section 3 — Human verification (minimal, numbered checklist)

**What is already headless (no human needed):** The correctness core of the late-join feature —
the very scenario whose *visual jank* motivated syncedTick — is fully covered by
`tests/graph-pacman-late-join.test.js`: a late joiner converges to the founder's present tick
(within the one-tick boundary straddle) and its serialized game-state is **byte-identical** at
every instant where both peers report the same tick, including under movement on both peers, with
a non-vacuousness guard proving the snapshot tracks real motion. So *"does the late joiner end up
in the same place"* needs **no human** — it is asserted on data.

**What genuinely needs a human:** only the **perceptual** layer — pixel-level smoothness across
two *real* browsers over a *real* WebRTC connection. Byte-equal state at equal ticks does not
prove the frames *between* ticks render without a visible jump, and the headless run has no real
ICE/DTLS timing (see §2(c)). Each step below reports its **expected outcome** and a
**failure signature** the tester should report.

1. [Open the graph-pacman demo in two separate browser windows (e.g. Chrome + Firefox, or two
   profiles). In window A, start a game and let it run for ~30 seconds. Then, in window B, join
   the same room.]
   Expected outcome: Window B's view "snaps" once to the live game and from then on both pac-men
   move smoothly and in step; window A shows no stutter or backward jump when B joins.
   Failure signature: Window A's characters visibly rubber-band, freeze, or jump backward at the
   moment B joins; or window B stays visibly behind/ahead and never lines up. Report which window,
   roughly how many seconds after B joined, and whether it self-corrected.

2. [With both windows joined and playing, move both pac-men continuously in different directions
   for ~60 seconds.]
   Expected outcome: Motion stays smooth on both screens; the two views agree on positions with no
   creeping divergence or periodic "teleport" corrections.
   Failure signature: Periodic visible snapping/teleporting (a correction firing too often), or a
   slow drift where the two screens disagree more and more. Report the rough period between snaps,
   or the direction of the drift.

3. [Simulate a frozen tick driver: minimize window A (or switch to another tab) for ~60 seconds
   while window B keeps playing, then bring window A back to the foreground.]
   Expected outcome: On returning, window A re-syncs to B's current state with at most a single
   brief snap, then resumes smooth motion; window B is undisturbed throughout.
   Failure signature: Window A lurches far forward/backward more than once, takes many seconds to
   settle, drags window B with it, or never reconverges. Report how long A took to settle and
   whether B was affected. (This is the real frozen-driver case the harness only approximates —
   §2(a).)
