Skip to content

gbn+mailbox: fix transport timeouts, add dynamic pong, harden with rapid tests#123

Open
Roasbeef wants to merge 9 commits intomasterfrom
fix-gbn-transport-timeouts
Open

gbn+mailbox: fix transport timeouts, add dynamic pong, harden with rapid tests#123
Roasbeef wants to merge 9 commits intomasterfrom
fix-gbn-transport-timeouts

Conversation

@Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Mar 5, 2026

In this PR, we fix a class of LNC transport failures where the GBN
keepalive mechanism was too aggressively timing out connections that
traverse the aperture mailbox relay. Users were hitting "cannot send, gbn
exited" errors in the Terminal Web UI, particularly on Pool RPCs, because
the 3s pong timeout couldn't accommodate the variable latency introduced
by the relay hop.

We address this with three changes: bumping the static timeout defaults to
more reasonable values, adding a dynamic pong timeout that adapts to
observed RTT, and fixing a pre-existing deserialize bug found via
property-based testing.

Dynamic Pong Timeout

The core idea is simple: rather than a fixed pong deadline, we compute it
as max(basePongTime, pongMultiplier * latestRTT), capped at
maxPongTime. The RTT is already tracked by the TimeoutManager from ACK
response measurements, so no additional round trips are needed. A new
WithDynamicPongTimeout config option controls the multiplier (default 3x)
and ceiling (default 15s). When dynamic mode is disabled or no RTT data
exists yet, we fall back to the static base pong time, so this is fully
backwards compatible with existing deployments.

In gbn_conn.go, the pong ticker now calls ResetWithInterval on each
ping send to pick up the latest dynamic value rather than reusing the
original static interval.

Updated Defaults

The mailbox client and server connections get new defaults tuned for the
relay transport:

  • Client ping: 7s -> 10s
  • Server ping: 5s -> 8s
  • Pong timeout: 3s -> 5s (base, scaling up to 15s via dynamic RTT)

These are backwards compatible since GBN timeouts are configured
independently per side with no negotiation. A new client connecting to an
old server (or vice versa) works fine because each side's pings arrive well
within the other's receive window.

Deserialize Bug Fix

While writing property-based tests with pgregory.net/rapid, we found an
off-by-one in Deserialize for DATA packets: it checked len(b) < 3 but
then accessed b[3] for the IsPing field, panicking on 3-byte inputs like
[]byte{0x02, 0x00, 0x00}. The fix is simply len(b) < 4.

Property-Based Tests

We add eight property test groups using rapid that exercise GBN invariants
under randomized inputs: message roundtrip, deserialize safety (which
caught the bug above), modular sequence arithmetic, queue size bounds after
random add/ACK/NACK sequences, dynamic pong timeout floor/ceiling
invariants, and timeout booster behavior. We also add two backwards
compatibility integration tests that verify bidirectional data transfer
between mixed old/new timeout configurations.

See each commit message for a detailed description w.r.t the incremental
changes.

Roasbeef added 6 commits March 5, 2026 15:20
In this commit, we fix an off-by-one error in the Deserialize function
for DATA packets. The previous bounds check required `len(b) >= 3` but
then accessed `b[3]` for the IsPing field, which would panic on exactly
3-byte inputs. The minimum length for a DATA packet is 4 bytes: type
byte, sequence number, FinalChunk flag, and IsPing flag (with payload
being optional and starting at `b[4:]`).

This bug was discovered via property-based testing with pgregory.net/rapid,
which generated the crashing input `[]byte{0x02, 0x00, 0x00}`.
In this commit, we introduce adaptive pong timeout behavior for the GBN
keepalive mechanism. Previously the pong timeout was a fixed static value,
which could be too aggressive for relay-based transports like the LNC
mailbox where round-trip times through the intermediary can vary
significantly with network conditions.

The dynamic pong timeout is computed as:

    pongTimeout = max(basePongTime, pongMultiplier * latestRTT)

capped at a configurable maximum (maxPongTime). The RTT is tracked by
the TimeoutManager from existing ACK response measurements, so no
additional round trips are needed. A new WithDynamicPongTimeout config
option controls the multiplier and ceiling. When dynamic mode is disabled
or no RTT data is available yet, the behavior falls back to the static
base pong time, preserving full backwards compatibility.

The pong ticker in gbn_conn.go now uses ResetWithInterval on each ping
send to pick up the latest dynamic value rather than reusing the original
static interval.
In this commit, we add four focused tests for the new dynamic pong
timeout behavior introduced in the TimeoutManager:

- TestDynamicPongTimeout: exercises the core computation verifying
  the floor (basePongTime), ceiling (maxPongTime), and RTT-proportional
  scaling via the pong multiplier.
- TestDynamicPongTimeoutDisabled: confirms static fallback behavior
  when dynamic mode is not enabled.
- TestDynamicPongTimeoutWithDataPackets: verifies that RTT measurements
  from data packet ACKs feed into the dynamic pong computation.
- TestGetLatestRTT: validates the RTT tracking accessor.
In this commit, we adjust the default GBN keepalive timeouts for the
mailbox client and server connections and wire up the new dynamic pong
timeout feature. The previous values were tuned for direct connections
but proved too aggressive for the relay-based LNC transport, where
messages traverse the aperture mailbox server, adding variable latency.

The new defaults are:
- Client ping: 7s -> 10s
- Server ping: 5s -> 8s
- Pong timeout: 3s -> 5s (base, with dynamic scaling up to 15s)

Both client and server connections now pass WithDynamicPongTimeout with
a 3x RTT multiplier and 15s ceiling, allowing the pong deadline to
adapt to observed network conditions while still catching genuinely
dead connections.

These changes are backwards compatible: GBN timeouts are configured
independently per side with no negotiation, so a new client connecting
to an old server (or vice versa) will function correctly since each
side's pings arrive well within the other's receive window.
In this commit, we add two integration-level tests that verify mixed-
version GBN connections work correctly with the new timeout values:

- TestBackwardsCompatMixedTimeouts: connects a new client (10s ping,
  5s base pong, dynamic pong enabled) with an old server (5s ping,
  3s pong, static only) and verifies bidirectional data transfer.
- TestBackwardsCompatOldClientNewServer: the reverse scenario with an
  old client connecting to a new server.

Both tests exercise the full GBN handshake and multiple data round
trips, confirming that the independently-configured timeout values
remain compatible across versions without protocol negotiation.
In this commit, we introduce property-based tests using pgregory.net/rapid
to exercise GBN protocol invariants under randomized inputs. These tests
complement the existing example-based suite by exploring the input space
more broadly, and in fact caught the DATA deserialize bounds-check bug
fixed in a prior commit.

The test suite covers eight property groups:

- MessageRoundTrip: any valid message survives serialize/deserialize.
- DeserializeNeverPanics: arbitrary byte slices never cause a panic.
- ContainsSequenceProperties: modular arithmetic invariants for the
  half-open interval [base, top) used by the send queue.
- ContainsSequenceBruteForce: validates containsSequence against a
  reference implementation via enumeration in small modular spaces.
- DynamicPongTimeoutProperties: the dynamic pong value always respects
  the floor (basePongTime) and ceiling (maxPongTime), falls back to
  static when disabled or when no RTT data exists.
- QueueSizeInvariants: the send queue size never exceeds the window
  after random sequences of add/ACK/NACK operations.
- TimeoutBoosterProperties: boosted timeouts are always >= original,
  and reset correctly returns to a new base value.
- PingPongZeroDisablesPing: zero ping time yields MaxInt64 (disabled).
Roasbeef added 3 commits March 5, 2026 15:38
In this commit, we set the pongMultiplier and maxPongTime fields to
their default constant values in NewTimeOutManager. Previously these
fields relied on the zero-value being safe (guarded by the
dynamicPongTime boolean), but initializing them explicitly provides
defense in depth against future code paths that might enable dynamic
mode without going through WithDynamicPongTimeout.
In this commit, we replace require.NoError calls inside goroutines in
the backwards compatibility tests with error channels that propagate
errors back to the main test goroutine. Calling require.NoError (which
invokes t.FailNow) from a non-test goroutine causes a panic rather
than a clean test failure, as documented by the testify library.
In this commit, we update the reference implementation used in the
brute-force property test to operate in the full uint8 space with
natural wrapping, matching the production containsSequence exactly.
The previous version used a mod-s space which happened to agree for
small values but was a non-obvious equivalence that could silently
break if the test ranges were widened.
@ellemouton ellemouton requested a review from ViktorT-11 March 9, 2026 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant