# DESIGN — Correction Policy: Slew vs Abandon-and-Resync

> **SHIPPED vs DESIGNED (read first).** Two deltas from the text below:
> - **The boundary is expressed in SECONDS, derived from the tick rate — not a fixed `τ ≈ 750 ms`.**
>   The shipped knob is `correctionThresholdSeconds` (default **5 s**), turned into a tick count as
>   `ceil(seconds * 1000 / tickMs)` (= 100 ticks at 20 Hz). It is the wall-clock duration an *ahead*
>   peer would otherwise freeze while the rest catch up. `correctionThresholdTicks` survives as an
>   expert override. (The earlier shipped default was a raw `5000` ticks ≈ 250 s, far too high — a
>   moderately-ahead peer could freeze for minutes; it was lowered to a seconds-based 5 s.)
> - **Small backward corrections HOLD, they do not rate-slew.** The shipped small-case path takes the
>   frame in place and lets the derived tick *hold* (the new present climbs to meet it) rather than
>   easing via a `_slewResidual` term. The monotonicity guarantee is the same; the mechanism is
>   simpler than the eased slew described here.
> Treat `τ ≈ 750 ms`, `_correctionThresholdMs`, and `_slewResidual` below as design-record names.

**Covers: EDGE 3 (3a small recalibration eased; 3b large recalibration without a 72,000-tick lurch) and EDGE 6 (a large legitimate correction must re-sync simulation STATE, not just the tick counter).**

EDGE 3 and EDGE 6 share one mechanism and are designed together: both are about *how a peer applies a
clock/frame correction* once the authority spine (`DESIGN_EDGE1_EDGE2_EDGE4.md`) has told it the
correct frame. The decision splits on magnitude — a small correction is *slewed* locally, a large one
triggers *abandon-and-resync*. EDGE 6 is the *teeth* of the large case: it asserts state deep-equality,
which only passes if a large correction re-bootstraps **state**, not merely snaps the tick number.

## Rule
When adopting a frame yields a tick correction of magnitude `Δ` (ms between the old and new derived
tick), if `Δ ≤ τ` the peer **slews** — it eases the derived tick toward the target by shaping its
*rate* (running slightly slow, or fully stalling for a downward correction) so the tick never jumps and
never rewinds; if `Δ > τ` the peer **abandons and resyncs** — it sets `catchUpStatus()` back to
`CATCHING_UP`, pulls a fresh state snapshot from a peer via the existing bootstrap path, re-simulates to
the corrected synced target, and only then returns to `LIVE` on the new frame. The threshold
`τ ≈ 750 ms` is chosen so that ordinary offset refinement (the tens-to-low-hundreds-of-ms drift of
EDGE 3a) always slews and finishes within ~1–2 s of real time, while a genuinely large jump (EDGE 3b/6's
~1 h) is far over `τ` and re-bootstraps instead of taking hours to slew.

## Message / Field
Slewing is **purely local** and adds no message: the derived-tick computation
(`tick = floor((clock.now() - epoch)/tickMs)`) gains a transient additive term `this._slewResidual` (ms
still to absorb) that eases to zero over successive ticks, plus a config constant
`this._correctionThresholdMs` (= `τ`). The large path **reuses the existing bootstrap exchange**:
`MSG_BOOT_REQ` → `MSG_BOOT` (the snapshot already used for late-join at `SimulationEngine.js`), driven
by flipping the existing `this._catchUp.status` field from `CatchUpStatus.LIVE` back to
`CatchUpStatus.CATCHING_UP` and re-entering the catch-up lifecycle (`_catchUpAndGoLive` /
`_syncedTarget`). The trigger is computed inside `_adoptAuthority` (see the spine file): it sets
`this._pendingCorrectionMs = Δ` and routes to `_slew()` or `_resync()` accordingly — no new message
type and no new field on the wire; the only wire traffic in the large case is the snapshot the bootstrap
path already defines.

## Monotonicity Argument
Slewing shapes only the *rate* of the derived tick and is clamped to be **non-negative**: an upward
correction makes the tick run at most slightly fast (bounded by the spec's "≤ 2000/50 + 5" per-window
cap), and a **downward** correction makes it **stall** (hold the current tick) until real time catches
up — never a backward step. This mirrors the v1 invariant ("never go backwards, slow down instead",
`SyncedClock.js:123`) and the `SyncedPeerClock` monotonic clamp. The abandon-and-resync path freezes the
tick while `CATCHING_UP`, re-simulates, and resumes `LIVE` at `max(currentTick, syncedTarget)` — so the
counter never drops below what was already displayed even when the corrected frame's raw arithmetic would
imply a lower number; it stalls, then advances. Both branches are therefore provably non-decreasing, and
because the *correcting* peer is always the junior one (the senior frame's members never adopt — spine
file), no senior peer's tick is ever touched.

## Determinism Argument
`τ` is a fixed constant and `Δ` is computed deterministically from the adopted token's `epoch`, so two
peers facing the same correction make the **same** slew-vs-resync decision. The slew rate is a fixed
function of `_slewResidual` (e.g. drain a constant fraction per tick), so the tick trajectory is
reproducible; the resync path re-runs the **same deterministic `step` function** over the **same input
log** from the **same snapshot** to the **same target**, producing identical state on every peer and
every run (no randomness anywhere in the decision or the replay). Where the spine's election is
deterministic (it is — lexicographic with a `founderId` tie-break), the correction that follows is a
pure function of the elected frame, so the whole adopt→correct pipeline is deterministic end to end.

## Failure Modes
1. **Boundary flapping at `Δ ≈ τ`.** A correction hovering near `τ` could oscillate between slewing and
   re-bootstrapping as later samples nudge `Δ` across the line, thrashing the peer in and out of
   catching-up. Mitigation: **hysteresis** — once slewing has begun, finish the slew even if `Δ` creeps
   just over `τ`; only escalate to resync if `Δ` grows past a higher escalation bound (e.g. `2τ`).
2. **Accumulating small corrections.** A sequence of individually-sub-`τ` corrections (e.g. EDGE 3a's
   steady 0.1% drift) could sum to a large effective error while each one only slews, leaving the tick
   chronically behind. Mitigation: track the **cumulative undrained `_slewResidual`**; if it cannot drain
   within the slew window `W`, escalate that accumulation to a single abandon-and-resync.
3. **Resync livelock.** During a messy multi-group merge the elected frame could change repeatedly, each
   change a large correction, so a peer re-bootstraps over and over and never reaches `LIVE`. Mitigation:
   a **minimum live-dwell + backoff** — after going live a peer ignores further large corrections for a
   short dwell, and the spine's convergence (a single maximal token wins) bounds how often the frame can
   actually change.
4. **Stale snapshot on arrival.** Latency means the snapshot fetched during resync may already be a few
   ticks behind real time when it lands. Mitigation: resume at `max(syncedTarget, presentTick)` and let
   the normal catch-up loop close the residual gap, exactly as late-join already does.
5. **Over-long downward stall.** A near-`τ` *downward* correction stalls the tick for up to ~`τ` of real
   time, which could read as a visible freeze. Mitigation: `τ` is tuned (~750 ms) so the worst sub-`τ`
   stall stays under ~1 s, below the threshold of a disruptive pause for a 50 ms-tick game.

## How This Stays Sparse
The common case — EDGE 3a small recalibration — is **zero added traffic**: slewing is local rate-shaping
of an arithmetic the peer already computes every tick, costing O(1) state (`_slewResidual`). The
expensive path (abandon-and-resync) reuses the **existing bootstrap request/response** and is invoked
**only on large corrections or merges**, which are rare events, not a per-tick operation — so there is no
new per-tick flood and no new message type. Across a session a healthy peer pays nothing for this rule
beyond a couple of words of state; a peer that genuinely meets a far-off group pays exactly one snapshot
exchange, amortized to ~never. Nothing here scales with the tick rate.

## How a Peer Could Exploit or Break This
A peer could **flap its advertised `epoch`** to repeatedly force a victim over the `Δ > τ` line,
inducing the victim to abandon-and-resync again and again — a disruption/bandwidth-waste (re-pulling
snapshots, never going live) bordering on a DoS. The first-order consequence is a victim stuck in
`CATCHING_UP`, thrashing. Mitigation: the authority spine's **corroboration gate** (`DESIGN_EDGE5A.md`)
means a lone flapper cannot actually *win* the frame, so a peer already in a healthy agreement never
honors its corrections in the first place; and the **min-dwell + backoff** above caps how often *any*
large correction is honored, so even a trusted-but-buggy peer flapping its clock is contained to at most
one resync per dwell. A correction that is *both* large and corroborated by a colluding majority would
still force a legitimate-looking resync; defending that is the Byzantine case explicitly **out of initial
scope** and recorded as an open risk.
