Skip to content

🤖 fix: add MCP server startup timeout and surface failures#2786

Open
ibetitsmike wants to merge 9 commits intomainfrom
mike/fix-mcp-startup-timeout
Open

🤖 fix: add MCP server startup timeout and surface failures#2786
ibetitsmike wants to merge 9 commits intomainfrom
mike/fix-mcp-startup-timeout

Conversation

@ibetitsmike
Copy link
Contributor

Summary

Misconfigured MCP servers (e.g., wrong command+args format) cause startSingleServer() to hang indefinitely on createMCPClient() / client.tools(), blocking the entire chat stream. This adds a 60s startup timeout and surfaces failures to the user via a system message warning.

Background

  • startSingleServer() awaits MCP client creation with no timeout — a broken server config hangs forever
  • The existing runServerTest() (settings "Test" button) already has a Promise.race timeout, but the production startup path did not
  • Errors were caught and logged but never surfaced to the user

Implementation

Startup timeout (mcpServerManager.ts):

  • New MCP_STARTUP_TIMEOUT_MS = 60_000 constant
  • Extracted startSingleServerstartSingleServerImpl, wrapped with Promise.race timeout following the existing runMCPToolWithDeadline pattern (cleanup via clearTimeout + .unref())
  • Reduced stdio exec timeout from 24h to match startup timeout

Failure tracking (mcpServerManager.ts):

  • startServers() now returns { instances, failedServerNames }
  • MCPWorkspaceStats extended with failedServerNames: string[]
  • Both call sites in getToolsForWorkspace() updated to propagate failures

User-facing warning (aiService.ts):

  • When servers fail, a warning is prepended to the system message so the model informs the user which servers are unavailable — zero frontend changes needed

Generated with mux • Model: anthropic:claude-opus-4-6 • Thinking: xhigh • Cost: $4.32

@ibetitsmike
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ab14a2970

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ibetitsmike
Copy link
Contributor Author

@codex review

Addressed: updated the leased+closed restart path to merge failedServerNames into returned stats (commit df5be4e), so aiService.ts now correctly surfaces the warning.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df5be4e517

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ibetitsmike
Copy link
Contributor Author

@codex review

Addressed: reverted the stdio exec timeout back to 24h (process lifetime), since startup timeout is already handled by the Promise.race wrapper in startSingleServer.

@ibetitsmike
Copy link
Contributor Author

@codex review

The stdio exec timeout is already 24h (line 1367: timeout: 60 * 60 * 24). This was fixed in commit 7014420 — the previous thread was about the same issue and was already resolved. Startup timeout is handled separately by the Promise.race wrapper.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70144207cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ibetitsmike
Copy link
Contributor Author

@codex review

Addressed: added AbortController to startSingleServer so timed-out startups now abort the in-flight impl and clean up spawned processes. Both stdio and HTTP/SSE paths have abort listeners that close transports/clients on timeout.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e70d79d00b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ibetitsmike
Copy link
Contributor Author

@codex review

Addressed: token recount after MCP warning prepend. Uses getTokenizerForModel + resolveModelForMetadata to recount systemMessageTokens after mutating the system message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant