fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery by mafredri · Pull Request #25035 · coder/coder

mafredri · 2026-05-07T11:12:04Z

Why

workspaceMCPDiscoveryTimeout was 5 seconds. The agent's per-server
agentmcp.connectTimeout is 30 seconds. Any cold-start fetch where
a reload is in flight (the common case on a fresh workspace)
guarantees the chatd dial cancels before the reload settles. The
result is logged by the agent as mcp reload canceled by caller
and surfaced to chatd as a 200 with an empty tool list.

What changes

Bump workspaceMCPDiscoveryTimeout to 35 seconds and scope it to
exactly the call that races the agent's connectTimeout:

workspaceMCPCtx previously wrapped the DB lookup, the dial, and
the ListMCPTools call as a single 5-second chain. Now only
ListMCPTools runs under workspaceMCPDiscoveryTimeout. The DB
lookup uses the parent chat-turn ctx; getWorkspaceConn already
has its own internal dialTimeout (30s).
The constant's comment documents the contract: must exceed
agent/x/agentmcp.connectTimeout. The agent constant is
unexported, so the comment is the only link between the two.

The MCP discovery goroutine runs concurrently with prompt resolution
inside g2 in runChat, so the longer wait does not bound
user-visible latency in the steady state.

Tests

TestRunChat_WorkspaceMCPDiscoveryWaitsForSlowAgent mocks
ListMCPTools to block for 7 seconds and asserts the workspace MCP
tool name reaches the LLM tool list. Fails at the pre-fix 5s
constant; passes at 35s.

Companion change

The agent-side fix that makes the handler actually wait for a settled
reload is in PR #25034. This PR is independent. The reviewer
suggested adding a back-reference comment on
agent/x/agentmcp.connectTimeout so the contract becomes
bidirectional; that comment belongs in PR #25034 because that PR
already owns the file.

Reviewer-flagged future work

g2 in runChat is a plain errgroup.Group, not
errgroup.WithContext. The MCP discovery goroutine always returns
nil (errors are logged and swallowed), so a peer's failure does
not short-circuit it. With the timeout at 35s instead of 5s, the
worst-case dead-wait window after a peer error is 7x longer.
Converting the group to share a cancel signal requires auditing
every g2.Go caller for early-cancel safety; out of scope for
this constant tweak. Tracked separately.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

…scovery The agent's per-server MCP connectTimeout (in agent/x/agentmcp) is 30s. chatd's per-turn workspaceMCPDiscoveryTimeout was 5s, which caused the per-turn dial to abandon legitimate cold-start reloads and surface an empty tool list to the LLM. On a cold workspace, the first GET /api/v0/mcp/tools to the agent races the agent's post-Ready reload; if that reload waits on a slow stdio MCP server (typescript-language-server in the captured incident reliably hits the 30s ceiling), chatd's 5s caller cancels the HTTP request long before the reload settles, the handler returns the pre-reload (empty) cache, and the workspace tools never reach the LLM for that turn. chatd's per-turn cache only stores non-empty results, so subsequent turns retry, but turns issued within 5s of agent Ready see no workspace MCP tools. Bump workspaceMCPDiscoveryTimeout to 35s so the per-turn dial outlasts a single cold reload that touches a slow server. Document the cross-package contract (must be greater than agentmcp.connectTimeout) inline so a future change to either constant is reviewed against the other. This is Phase 3 of the broader fix; Phases 1 and 2 (server-side "first reload settled" gate and handler wait) land separately. Phase 3 is independently shippable and partially mitigates the bug on its own: chatd will now wait long enough to receive the agent's reload result on cold paths. Coder Agents generated.

mafredri · 2026-05-07T11:12:31Z

/coder-agents-review

🤖 This review was automatically generated with Coder Agents.

coder-agents-review

Tight, proportional fix. The timeout mismatch root-cause analysis is solid, the commit message tells the full story, and the regression test genuinely proves the fix by construction. The test reuses established patterns without inventing new infrastructure.

2 P2, 1 P3, 2 Nit, 1 P4.

Both P2s are facets of the same theme: the cross-package contract is documented only on the chatd side, and the comment's budget model is narrower than the actual dependency chain. The P3 follows naturally: even with better documentation, there is no automated enforcement of the invariant.

Leorio said it best: "Two-directional contracts survive. One-directional contracts are how this bug happened in the first place."

The 7s wall-clock cost of the regression test was flagged by seven reviewers independently. All agreed it is the correct tradeoff: the production code uses context.WithTimeout (not quartz), so eliminating the real-time wait would require production refactoring outside this PR's scope.

agent/x/agentmcp/manager.go:38

P2 [DEREM-2] connectTimeout has no comment pointing back to coderd/x/chatd.workspaceMCPDiscoveryTimeout. Someone bumping this constant to, say, 60s sees no local signal that chatd's discovery timeout must stay ahead of it. They grep their own package, find nothing about chatd, ship it, and the exact bug this PR fixes returns silently.

The chatd-side comment is thorough; matching it here makes the contract bidirectional:

// connectTimeout bounds how long we wait for a single MCP server
// to start its transport and complete initialization.
// coderd/x/chatd.workspaceMCPDiscoveryTimeout must exceed this
// value; update it if this constant changes.
const connectTimeout = 30 * time.Second

(Leorio P2, Ryosuke P3, Kite P3)

🤖

🤖 This review was automatically generated with Coder Agents.

coder-agents-review · 2026-05-07T11:40:31Z

+	// comment is the contract; do not import the agent constant
+	// at runtime. If agentmcp.connectTimeout changes, bump this
+	// value to match.
+	workspaceMCPDiscoveryTimeout = 35 * time.Second


P4 [DEREM-6] Pre-existing: g2 (line 6497) is a plain errgroup.Group, not errgroup.WithContext. If ConvertMessagesWithFiles errors immediately, g2.Wait() still blocks until the MCP discovery goroutine finishes, which now takes up to 35s instead of 5s. The MCP goroutine always returns nil (it logs and swallows errors), so it runs to completion or context expiry regardless of peer failures.

An errgroup.WithContext that derives from the parent context would let the MCP goroutine exit early on first error. Not introduced by this PR, but the 7x amplification of the dead-wait window is relevant.

(Hisoka)

🤖

Acknowledged. g2 is a plain errgroup.Group (chatd.go:6497) and the MCP discovery goroutine swallows its own errors and returns nil, so a peer failure cannot short-circuit it. Bumping the MCP timeout from 5s to 35s lengthens the worst-case dead-wait window 7x, as you noted. Converting the group to errgroup.WithContext requires auditing every g2.Go caller (prompt build, instruction persist, user prompt resolve, MCP connect, workspace MCP discovery) to confirm early-cancel is the desired semantics for each. That audit is meaningfully larger than this PR's intent and risks regressions on paths unrelated to the cold-start race. Documented in the PR description's deferred-work section as future work; will track separately. The 35s worst case remains bounded by the chat-turn context.

🤖 Posted using /amend-review skill via Coder Agents.

Address review feedback on PR #25035: - DEREM-3 (P2): the previous workspaceMCPCtx wrapped DB lookup, dial, and ListMCPTools as a single chain. The constant comment described it as "connectTimeout + small buffer," which only matches the ListMCPTools leg. Scope the timeout to wrap only ListMCPTools so the constant maps one-to-one with the operation that must outlast the agent's per-server connectTimeout. The DB lookup runs under the parent chat-turn ctx, and getWorkspaceConn already has its own internal dialTimeout (30s). - DEREM-1 (P3): export workspaceMCPDiscoveryTimeout via export_test.go and add a fast unit test that asserts the cross-package invariant (chatd value >= agentmcp.connectTimeout + 1s buffer = 31s minimum). This catches the chatd half mechanically; the wall-clock regression test still covers the integration end to end. - DEREM-5 (Nit): rewrite the constant's comment to acknowledge that agentmcp.connectTimeout is unexported (so Go enforces non-import), rather than framing the rule as a developer choice. - DEREM-4 (Nit): remove "old 5s" references in the regression test's comments since that value no longer exists in the codebase. - DEREM-2 (P2): defer to PR #25034. That PR owns agent/x/agentmcp/manager.go and is the right place to add a back-reference comment so the contract becomes bidirectional. - DEREM-6 (P4): defer. g2's plain errgroup.Group amplifies the dead-wait window 7x on peer-error paths now that the MCP timeout is 35s. Pre-existing structural issue; converting to errgroup.WithContext requires auditing every g2.Go caller for early-cancel safety, out of scope for a constant tweak. Coder Agents generated.

mafredri · 2026-05-07T12:04:28Z

In reply to DEREM-2 (P2, agent/x/agentmcp/manager.go:38): #25035 (review)

Valid finding; agree the contract should be bidirectional. The fix is a comment addition on agent/x/agentmcp/manager.go:38, which is owned by PR #25034 (agent-mcp-handler-wait-integrated, the combined Phase 1+2 of this plan). I cannot land it from this branch without conflicting with that PR. Routing this finding there: the comment to add should mirror the chatd-side one, e.g.

// connectTimeout bounds how long we wait for a single MCP server
// to start its transport and complete initialization.
// coderd/x/chatd.workspaceMCPDiscoveryTimeout must exceed this
// value; update it if this constant changes.
const connectTimeout = 30 * time.Second

If #25034 has already passed deep review, I can open a small standalone follow-up PR after both branches land. Open question for the operator: prefer routing this to #25034 or a follow-up?

🤖 Posted using /amend-review skill via Coder Agents.

Reviewer feedback: the comments I added in commit 40a18d8 read like design documents. Trim each to what the reader actually needs. No behavior change. Coder Agents generated.

Remove the constant export in export_test.go and the matching TestWorkspaceMCPDiscoveryTimeoutExceedsAgentConnectTimeout assertion test. Asserting a constant's value against a hardcoded literal in the same package is not a meaningful test. Coder Agents generated.

github-actions Bot assigned mafredri May 7, 2026

coder-agents-review Bot reviewed May 7, 2026

View reviewed changes

refactor(coderd/x/chatd): trim verbose comments from amend round

8599c8e

Reviewer feedback: the comments I added in commit 40a18d8 read like design documents. Trim each to what the reader actually needs. No behavior change. Coder Agents generated.

mafredri requested a review from johnstcn May 7, 2026 13:12

mafredri marked this pull request as ready for review May 7, 2026 15:12

johnstcn approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery#25035

fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery#25035
mafredri wants to merge 4 commits intomainfrom
chatd-mcp-discovery-timeout

mafredri commented May 7, 2026 •

edited

Loading

Uh oh!

mafredri commented May 7, 2026

Uh oh!

coder-agents-review Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coder-agents-review Bot May 7, 2026

Uh oh!

mafredri May 7, 2026

Uh oh!

mafredri commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mafredri commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changes

Tests

Companion change

Uh oh!

mafredri commented May 7, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coder-agents-review Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mafredri May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mafredri commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mafredri commented May 7, 2026 •

edited

Loading