gbn+mailbox: fix transport timeouts, add dynamic pong, harden with rapid tests#123
Open
gbn+mailbox: fix transport timeouts, add dynamic pong, harden with rapid tests#123
Conversation
In this commit, we fix an off-by-one error in the Deserialize function
for DATA packets. The previous bounds check required `len(b) >= 3` but
then accessed `b[3]` for the IsPing field, which would panic on exactly
3-byte inputs. The minimum length for a DATA packet is 4 bytes: type
byte, sequence number, FinalChunk flag, and IsPing flag (with payload
being optional and starting at `b[4:]`).
This bug was discovered via property-based testing with pgregory.net/rapid,
which generated the crashing input `[]byte{0x02, 0x00, 0x00}`.
In this commit, we introduce adaptive pong timeout behavior for the GBN
keepalive mechanism. Previously the pong timeout was a fixed static value,
which could be too aggressive for relay-based transports like the LNC
mailbox where round-trip times through the intermediary can vary
significantly with network conditions.
The dynamic pong timeout is computed as:
pongTimeout = max(basePongTime, pongMultiplier * latestRTT)
capped at a configurable maximum (maxPongTime). The RTT is tracked by
the TimeoutManager from existing ACK response measurements, so no
additional round trips are needed. A new WithDynamicPongTimeout config
option controls the multiplier and ceiling. When dynamic mode is disabled
or no RTT data is available yet, the behavior falls back to the static
base pong time, preserving full backwards compatibility.
The pong ticker in gbn_conn.go now uses ResetWithInterval on each ping
send to pick up the latest dynamic value rather than reusing the original
static interval.
In this commit, we add four focused tests for the new dynamic pong timeout behavior introduced in the TimeoutManager: - TestDynamicPongTimeout: exercises the core computation verifying the floor (basePongTime), ceiling (maxPongTime), and RTT-proportional scaling via the pong multiplier. - TestDynamicPongTimeoutDisabled: confirms static fallback behavior when dynamic mode is not enabled. - TestDynamicPongTimeoutWithDataPackets: verifies that RTT measurements from data packet ACKs feed into the dynamic pong computation. - TestGetLatestRTT: validates the RTT tracking accessor.
In this commit, we adjust the default GBN keepalive timeouts for the mailbox client and server connections and wire up the new dynamic pong timeout feature. The previous values were tuned for direct connections but proved too aggressive for the relay-based LNC transport, where messages traverse the aperture mailbox server, adding variable latency. The new defaults are: - Client ping: 7s -> 10s - Server ping: 5s -> 8s - Pong timeout: 3s -> 5s (base, with dynamic scaling up to 15s) Both client and server connections now pass WithDynamicPongTimeout with a 3x RTT multiplier and 15s ceiling, allowing the pong deadline to adapt to observed network conditions while still catching genuinely dead connections. These changes are backwards compatible: GBN timeouts are configured independently per side with no negotiation, so a new client connecting to an old server (or vice versa) will function correctly since each side's pings arrive well within the other's receive window.
In this commit, we add two integration-level tests that verify mixed- version GBN connections work correctly with the new timeout values: - TestBackwardsCompatMixedTimeouts: connects a new client (10s ping, 5s base pong, dynamic pong enabled) with an old server (5s ping, 3s pong, static only) and verifies bidirectional data transfer. - TestBackwardsCompatOldClientNewServer: the reverse scenario with an old client connecting to a new server. Both tests exercise the full GBN handshake and multiple data round trips, confirming that the independently-configured timeout values remain compatible across versions without protocol negotiation.
In this commit, we introduce property-based tests using pgregory.net/rapid to exercise GBN protocol invariants under randomized inputs. These tests complement the existing example-based suite by exploring the input space more broadly, and in fact caught the DATA deserialize bounds-check bug fixed in a prior commit. The test suite covers eight property groups: - MessageRoundTrip: any valid message survives serialize/deserialize. - DeserializeNeverPanics: arbitrary byte slices never cause a panic. - ContainsSequenceProperties: modular arithmetic invariants for the half-open interval [base, top) used by the send queue. - ContainsSequenceBruteForce: validates containsSequence against a reference implementation via enumeration in small modular spaces. - DynamicPongTimeoutProperties: the dynamic pong value always respects the floor (basePongTime) and ceiling (maxPongTime), falls back to static when disabled or when no RTT data exists. - QueueSizeInvariants: the send queue size never exceeds the window after random sequences of add/ACK/NACK operations. - TimeoutBoosterProperties: boosted timeouts are always >= original, and reset correctly returns to a new base value. - PingPongZeroDisablesPing: zero ping time yields MaxInt64 (disabled).
In this commit, we set the pongMultiplier and maxPongTime fields to their default constant values in NewTimeOutManager. Previously these fields relied on the zero-value being safe (guarded by the dynamicPongTime boolean), but initializing them explicitly provides defense in depth against future code paths that might enable dynamic mode without going through WithDynamicPongTimeout.
In this commit, we replace require.NoError calls inside goroutines in the backwards compatibility tests with error channels that propagate errors back to the main test goroutine. Calling require.NoError (which invokes t.FailNow) from a non-test goroutine causes a panic rather than a clean test failure, as documented by the testify library.
In this commit, we update the reference implementation used in the brute-force property test to operate in the full uint8 space with natural wrapping, matching the production containsSequence exactly. The previous version used a mod-s space which happened to agree for small values but was a non-obvious equivalence that could silently break if the test ranges were widened.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In this PR, we fix a class of LNC transport failures where the GBN
keepalive mechanism was too aggressively timing out connections that
traverse the aperture mailbox relay. Users were hitting "cannot send, gbn
exited" errors in the Terminal Web UI, particularly on Pool RPCs, because
the 3s pong timeout couldn't accommodate the variable latency introduced
by the relay hop.
We address this with three changes: bumping the static timeout defaults to
more reasonable values, adding a dynamic pong timeout that adapts to
observed RTT, and fixing a pre-existing deserialize bug found via
property-based testing.
Dynamic Pong Timeout
The core idea is simple: rather than a fixed pong deadline, we compute it
as
max(basePongTime, pongMultiplier * latestRTT), capped atmaxPongTime. The RTT is already tracked by theTimeoutManagerfrom ACKresponse measurements, so no additional round trips are needed. A new
WithDynamicPongTimeoutconfig option controls the multiplier (default 3x)and ceiling (default 15s). When dynamic mode is disabled or no RTT data
exists yet, we fall back to the static base pong time, so this is fully
backwards compatible with existing deployments.
In
gbn_conn.go, the pong ticker now callsResetWithIntervalon eachping send to pick up the latest dynamic value rather than reusing the
original static interval.
Updated Defaults
The mailbox client and server connections get new defaults tuned for the
relay transport:
These are backwards compatible since GBN timeouts are configured
independently per side with no negotiation. A new client connecting to an
old server (or vice versa) works fine because each side's pings arrive well
within the other's receive window.
Deserialize Bug Fix
While writing property-based tests with
pgregory.net/rapid, we found anoff-by-one in
Deserializefor DATA packets: it checkedlen(b) < 3butthen accessed
b[3]for the IsPing field, panicking on 3-byte inputs like[]byte{0x02, 0x00, 0x00}. The fix is simplylen(b) < 4.Property-Based Tests
We add eight property test groups using rapid that exercise GBN invariants
under randomized inputs: message roundtrip, deserialize safety (which
caught the bug above), modular sequence arithmetic, queue size bounds after
random add/ACK/NACK sequences, dynamic pong timeout floor/ceiling
invariants, and timeout booster behavior. We also add two backwards
compatibility integration tests that verify bidirectional data transfer
between mixed old/new timeout configurations.
See each commit message for a detailed description w.r.t the incremental
changes.