fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery#25035
fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery#25035
Conversation
…scovery The agent's per-server MCP connectTimeout (in agent/x/agentmcp) is 30s. chatd's per-turn workspaceMCPDiscoveryTimeout was 5s, which caused the per-turn dial to abandon legitimate cold-start reloads and surface an empty tool list to the LLM. On a cold workspace, the first GET /api/v0/mcp/tools to the agent races the agent's post-Ready reload; if that reload waits on a slow stdio MCP server (typescript-language-server in the captured incident reliably hits the 30s ceiling), chatd's 5s caller cancels the HTTP request long before the reload settles, the handler returns the pre-reload (empty) cache, and the workspace tools never reach the LLM for that turn. chatd's per-turn cache only stores non-empty results, so subsequent turns retry, but turns issued within 5s of agent Ready see no workspace MCP tools. Bump workspaceMCPDiscoveryTimeout to 35s so the per-turn dial outlasts a single cold reload that touches a slow server. Document the cross-package contract (must be greater than agentmcp.connectTimeout) inline so a future change to either constant is reviewed against the other. This is Phase 3 of the broader fix; Phases 1 and 2 (server-side "first reload settled" gate and handler wait) land separately. Phase 3 is independently shippable and partially mitigates the bug on its own: chatd will now wait long enough to receive the agent's reload result on cold paths. Coder Agents generated.
|
/coder-agents-review
|
There was a problem hiding this comment.
Tight, proportional fix. The timeout mismatch root-cause analysis is solid, the commit message tells the full story, and the regression test genuinely proves the fix by construction. The test reuses established patterns without inventing new infrastructure.
2 P2, 1 P3, 2 Nit, 1 P4.
Both P2s are facets of the same theme: the cross-package contract is documented only on the chatd side, and the comment's budget model is narrower than the actual dependency chain. The P3 follows naturally: even with better documentation, there is no automated enforcement of the invariant.
Leorio said it best: "Two-directional contracts survive. One-directional contracts are how this bug happened in the first place."
The 7s wall-clock cost of the regression test was flagged by seven reviewers independently. All agreed it is the correct tradeoff: the production code uses context.WithTimeout (not quartz), so eliminating the real-time wait would require production refactoring outside this PR's scope.
agent/x/agentmcp/manager.go:38
P2 [DEREM-2] connectTimeout has no comment pointing back to coderd/x/chatd.workspaceMCPDiscoveryTimeout. Someone bumping this constant to, say, 60s sees no local signal that chatd's discovery timeout must stay ahead of it. They grep their own package, find nothing about chatd, ship it, and the exact bug this PR fixes returns silently.
The chatd-side comment is thorough; matching it here makes the contract bidirectional:
// connectTimeout bounds how long we wait for a single MCP server
// to start its transport and complete initialization.
// coderd/x/chatd.workspaceMCPDiscoveryTimeout must exceed this
// value; update it if this constant changes.
const connectTimeout = 30 * time.Second(Leorio P2, Ryosuke P3, Kite P3)
🤖
🤖 This review was automatically generated with Coder Agents.
| // comment is the contract; do not import the agent constant | ||
| // at runtime. If agentmcp.connectTimeout changes, bump this | ||
| // value to match. | ||
| workspaceMCPDiscoveryTimeout = 35 * time.Second |
There was a problem hiding this comment.
P4 [DEREM-6] Pre-existing: g2 (line 6497) is a plain errgroup.Group, not errgroup.WithContext. If ConvertMessagesWithFiles errors immediately, g2.Wait() still blocks until the MCP discovery goroutine finishes, which now takes up to 35s instead of 5s. The MCP goroutine always returns nil (it logs and swallows errors), so it runs to completion or context expiry regardless of peer failures.
An errgroup.WithContext that derives from the parent context would let the MCP goroutine exit early on first error. Not introduced by this PR, but the 7x amplification of the dead-wait window is relevant.
(Hisoka)
🤖
There was a problem hiding this comment.
Acknowledged. g2 is a plain errgroup.Group (chatd.go:6497) and the MCP discovery goroutine swallows its own errors and returns nil, so a peer failure cannot short-circuit it. Bumping the MCP timeout from 5s to 35s lengthens the worst-case dead-wait window 7x, as you noted. Converting the group to errgroup.WithContext requires auditing every g2.Go caller (prompt build, instruction persist, user prompt resolve, MCP connect, workspace MCP discovery) to confirm early-cancel is the desired semantics for each. That audit is meaningfully larger than this PR's intent and risks regressions on paths unrelated to the cold-start race. Documented in the PR description's deferred-work section as future work; will track separately. The 35s worst case remains bounded by the chat-turn context.
🤖 Posted using
/amend-reviewskill via Coder Agents.
Address review feedback on PR #25035: - DEREM-3 (P2): the previous workspaceMCPCtx wrapped DB lookup, dial, and ListMCPTools as a single chain. The constant comment described it as "connectTimeout + small buffer," which only matches the ListMCPTools leg. Scope the timeout to wrap only ListMCPTools so the constant maps one-to-one with the operation that must outlast the agent's per-server connectTimeout. The DB lookup runs under the parent chat-turn ctx, and getWorkspaceConn already has its own internal dialTimeout (30s). - DEREM-1 (P3): export workspaceMCPDiscoveryTimeout via export_test.go and add a fast unit test that asserts the cross-package invariant (chatd value >= agentmcp.connectTimeout + 1s buffer = 31s minimum). This catches the chatd half mechanically; the wall-clock regression test still covers the integration end to end. - DEREM-5 (Nit): rewrite the constant's comment to acknowledge that agentmcp.connectTimeout is unexported (so Go enforces non-import), rather than framing the rule as a developer choice. - DEREM-4 (Nit): remove "old 5s" references in the regression test's comments since that value no longer exists in the codebase. - DEREM-2 (P2): defer to PR #25034. That PR owns agent/x/agentmcp/manager.go and is the right place to add a back-reference comment so the contract becomes bidirectional. - DEREM-6 (P4): defer. g2's plain errgroup.Group amplifies the dead-wait window 7x on peer-error paths now that the MCP timeout is 35s. Pre-existing structural issue; converting to errgroup.WithContext requires auditing every g2.Go caller for early-cancel safety, out of scope for a constant tweak. Coder Agents generated.
|
In reply to DEREM-2 (P2, Valid finding; agree the contract should be bidirectional. The fix is a comment addition on // connectTimeout bounds how long we wait for a single MCP server
// to start its transport and complete initialization.
// coderd/x/chatd.workspaceMCPDiscoveryTimeout must exceed this
// value; update it if this constant changes.
const connectTimeout = 30 * time.SecondIf #25034 has already passed deep review, I can open a small standalone follow-up PR after both branches land. Open question for the operator: prefer routing this to #25034 or a follow-up?
|
Reviewer feedback: the comments I added in commit 40a18d8 read like design documents. Trim each to what the reader actually needs. No behavior change. Coder Agents generated.
Remove the constant export in export_test.go and the matching TestWorkspaceMCPDiscoveryTimeoutExceedsAgentConnectTimeout assertion test. Asserting a constant's value against a hardcoded literal in the same package is not a meaningful test. Coder Agents generated.
Why
workspaceMCPDiscoveryTimeoutwas 5 seconds. The agent's per-serveragentmcp.connectTimeoutis 30 seconds. Any cold-start fetch wherea reload is in flight (the common case on a fresh workspace)
guarantees the chatd dial cancels before the reload settles. The
result is logged by the agent as
mcp reload canceled by callerand surfaced to chatd as a 200 with an empty tool list.
What changes
Bump
workspaceMCPDiscoveryTimeoutto 35 seconds and scope it toexactly the call that races the agent's
connectTimeout:workspaceMCPCtxpreviously wrapped the DB lookup, the dial, andthe
ListMCPToolscall as a single 5-second chain. Now onlyListMCPToolsruns underworkspaceMCPDiscoveryTimeout. The DBlookup uses the parent chat-turn ctx;
getWorkspaceConnalreadyhas its own internal
dialTimeout(30s).agent/x/agentmcp.connectTimeout. The agent constant isunexported, so the comment is the only link between the two.
The MCP discovery goroutine runs concurrently with prompt resolution
inside
g2inrunChat, so the longer wait does not bounduser-visible latency in the steady state.
Tests
TestRunChat_WorkspaceMCPDiscoveryWaitsForSlowAgentmocksListMCPToolsto block for 7 seconds and asserts the workspace MCPtool name reaches the LLM tool list. Fails at the pre-fix 5s
constant; passes at 35s.
Companion change
The agent-side fix that makes the handler actually wait for a settled
reload is in PR #25034. This PR is independent. The reviewer
suggested adding a back-reference comment on
agent/x/agentmcp.connectTimeoutso the contract becomesbidirectional; that comment belongs in PR #25034 because that PR
already owns the file.
Reviewer-flagged future work
g2inrunChatis a plainerrgroup.Group, noterrgroup.WithContext. The MCP discovery goroutine always returnsnil (errors are logged and swallowed), so a peer's failure does
not short-circuit it. With the timeout at 35s instead of 5s, the
worst-case dead-wait window after a peer error is 7x longer.
Converting the group to share a cancel signal requires auditing
every
g2.Gocaller for early-cancel safety; out of scope forthis constant tweak. Tracked separately.