Skip to content

fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery#25035

Open
mafredri wants to merge 4 commits intomainfrom
chatd-mcp-discovery-timeout
Open

fix(coderd/x/chatd): wait long enough for cold-start workspace MCP discovery#25035
mafredri wants to merge 4 commits intomainfrom
chatd-mcp-discovery-timeout

Conversation

@mafredri
Copy link
Copy Markdown
Member

@mafredri mafredri commented May 7, 2026

Why

workspaceMCPDiscoveryTimeout was 5 seconds. The agent's per-server
agentmcp.connectTimeout is 30 seconds. Any cold-start fetch where
a reload is in flight (the common case on a fresh workspace)
guarantees the chatd dial cancels before the reload settles. The
result is logged by the agent as mcp reload canceled by caller
and surfaced to chatd as a 200 with an empty tool list.

What changes

Bump workspaceMCPDiscoveryTimeout to 35 seconds and scope it to
exactly the call that races the agent's connectTimeout:

  • workspaceMCPCtx previously wrapped the DB lookup, the dial, and
    the ListMCPTools call as a single 5-second chain. Now only
    ListMCPTools runs under workspaceMCPDiscoveryTimeout. The DB
    lookup uses the parent chat-turn ctx; getWorkspaceConn already
    has its own internal dialTimeout (30s).
  • The constant's comment documents the contract: must exceed
    agent/x/agentmcp.connectTimeout. The agent constant is
    unexported, so the comment is the only link between the two.

The MCP discovery goroutine runs concurrently with prompt resolution
inside g2 in runChat, so the longer wait does not bound
user-visible latency in the steady state.

Tests

TestRunChat_WorkspaceMCPDiscoveryWaitsForSlowAgent mocks
ListMCPTools to block for 7 seconds and asserts the workspace MCP
tool name reaches the LLM tool list. Fails at the pre-fix 5s
constant; passes at 35s.

Companion change

The agent-side fix that makes the handler actually wait for a settled
reload is in PR #25034. This PR is independent. The reviewer
suggested adding a back-reference comment on
agent/x/agentmcp.connectTimeout so the contract becomes
bidirectional; that comment belongs in PR #25034 because that PR
already owns the file.

Reviewer-flagged future work

g2 in runChat is a plain errgroup.Group, not
errgroup.WithContext. The MCP discovery goroutine always returns
nil (errors are logged and swallowed), so a peer's failure does
not short-circuit it. With the timeout at 35s instead of 5s, the
worst-case dead-wait window after a peer error is 7x longer.
Converting the group to share a cancel signal requires auditing
every g2.Go caller for early-cancel safety; out of scope for
this constant tweak. Tracked separately.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

…scovery

The agent's per-server MCP connectTimeout (in agent/x/agentmcp) is
30s. chatd's per-turn workspaceMCPDiscoveryTimeout was 5s, which
caused the per-turn dial to abandon legitimate cold-start reloads
and surface an empty tool list to the LLM. On a cold workspace,
the first GET /api/v0/mcp/tools to the agent races the agent's
post-Ready reload; if that reload waits on a slow stdio MCP
server (typescript-language-server in the captured incident
reliably hits the 30s ceiling), chatd's 5s caller cancels the
HTTP request long before the reload settles, the handler returns
the pre-reload (empty) cache, and the workspace tools never
reach the LLM for that turn. chatd's per-turn cache only stores
non-empty results, so subsequent turns retry, but turns issued
within 5s of agent Ready see no workspace MCP tools.

Bump workspaceMCPDiscoveryTimeout to 35s so the per-turn dial
outlasts a single cold reload that touches a slow server.
Document the cross-package contract (must be greater than
agentmcp.connectTimeout) inline so a future change to either
constant is reviewed against the other.

This is Phase 3 of the broader fix; Phases 1 and 2 (server-side
"first reload settled" gate and handler wait) land separately.
Phase 3 is independently shippable and partially mitigates the
bug on its own: chatd will now wait long enough to receive the
agent's reload result on cold paths.

Coder Agents generated.
Copy link
Copy Markdown
Member Author

mafredri commented May 7, 2026

/coder-agents-review

🤖 This review was automatically generated with Coder Agents.

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tight, proportional fix. The timeout mismatch root-cause analysis is solid, the commit message tells the full story, and the regression test genuinely proves the fix by construction. The test reuses established patterns without inventing new infrastructure.

2 P2, 1 P3, 2 Nit, 1 P4.

Both P2s are facets of the same theme: the cross-package contract is documented only on the chatd side, and the comment's budget model is narrower than the actual dependency chain. The P3 follows naturally: even with better documentation, there is no automated enforcement of the invariant.

Leorio said it best: "Two-directional contracts survive. One-directional contracts are how this bug happened in the first place."

The 7s wall-clock cost of the regression test was flagged by seven reviewers independently. All agreed it is the correct tradeoff: the production code uses context.WithTimeout (not quartz), so eliminating the real-time wait would require production refactoring outside this PR's scope.


agent/x/agentmcp/manager.go:38

P2 [DEREM-2] connectTimeout has no comment pointing back to coderd/x/chatd.workspaceMCPDiscoveryTimeout. Someone bumping this constant to, say, 60s sees no local signal that chatd's discovery timeout must stay ahead of it. They grep their own package, find nothing about chatd, ship it, and the exact bug this PR fixes returns silently.

The chatd-side comment is thorough; matching it here makes the contract bidirectional:

// connectTimeout bounds how long we wait for a single MCP server
// to start its transport and complete initialization.
// coderd/x/chatd.workspaceMCPDiscoveryTimeout must exceed this
// value; update it if this constant changes.
const connectTimeout = 30 * time.Second

(Leorio P2, Ryosuke P3, Kite P3)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd_test.go
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd_test.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go
// comment is the contract; do not import the agent constant
// at runtime. If agentmcp.connectTimeout changes, bump this
// value to match.
workspaceMCPDiscoveryTimeout = 35 * time.Second
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P4 [DEREM-6] Pre-existing: g2 (line 6497) is a plain errgroup.Group, not errgroup.WithContext. If ConvertMessagesWithFiles errors immediately, g2.Wait() still blocks until the MCP discovery goroutine finishes, which now takes up to 35s instead of 5s. The MCP goroutine always returns nil (it logs and swallows errors), so it runs to completion or context expiry regardless of peer failures.

An errgroup.WithContext that derives from the parent context would let the MCP goroutine exit early on first error. Not introduced by this PR, but the 7x amplification of the dead-wait window is relevant.

(Hisoka)

🤖

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. g2 is a plain errgroup.Group (chatd.go:6497) and the MCP discovery goroutine swallows its own errors and returns nil, so a peer failure cannot short-circuit it. Bumping the MCP timeout from 5s to 35s lengthens the worst-case dead-wait window 7x, as you noted. Converting the group to errgroup.WithContext requires auditing every g2.Go caller (prompt build, instruction persist, user prompt resolve, MCP connect, workspace MCP discovery) to confirm early-cancel is the desired semantics for each. That audit is meaningfully larger than this PR's intent and risks regressions on paths unrelated to the cold-start race. Documented in the PR description's deferred-work section as future work; will track separately. The 35s worst case remains bounded by the chat-turn context.

🤖 Posted using /amend-review skill via Coder Agents.

Address review feedback on PR #25035:

- DEREM-3 (P2): the previous workspaceMCPCtx wrapped DB lookup, dial,
  and ListMCPTools as a single chain. The constant comment described
  it as "connectTimeout + small buffer," which only matches the
  ListMCPTools leg. Scope the timeout to wrap only ListMCPTools so
  the constant maps one-to-one with the operation that must outlast
  the agent's per-server connectTimeout. The DB lookup runs under
  the parent chat-turn ctx, and getWorkspaceConn already has its
  own internal dialTimeout (30s).

- DEREM-1 (P3): export workspaceMCPDiscoveryTimeout via export_test.go
  and add a fast unit test that asserts the cross-package invariant
  (chatd value >= agentmcp.connectTimeout + 1s buffer = 31s minimum).
  This catches the chatd half mechanically; the wall-clock regression
  test still covers the integration end to end.

- DEREM-5 (Nit): rewrite the constant's comment to acknowledge that
  agentmcp.connectTimeout is unexported (so Go enforces non-import),
  rather than framing the rule as a developer choice.

- DEREM-4 (Nit): remove "old 5s" references in the regression test's
  comments since that value no longer exists in the codebase.

- DEREM-2 (P2): defer to PR #25034. That PR owns
  agent/x/agentmcp/manager.go and is the right place to add a
  back-reference comment so the contract becomes bidirectional.

- DEREM-6 (P4): defer. g2's plain errgroup.Group amplifies the
  dead-wait window 7x on peer-error paths now that the MCP timeout
  is 35s. Pre-existing structural issue; converting to
  errgroup.WithContext requires auditing every g2.Go caller for
  early-cancel safety, out of scope for a constant tweak.

Coder Agents generated.
Copy link
Copy Markdown
Member Author

mafredri commented May 7, 2026

In reply to DEREM-2 (P2, agent/x/agentmcp/manager.go:38): #25035 (review)

Valid finding; agree the contract should be bidirectional. The fix is a comment addition on agent/x/agentmcp/manager.go:38, which is owned by PR #25034 (agent-mcp-handler-wait-integrated, the combined Phase 1+2 of this plan). I cannot land it from this branch without conflicting with that PR. Routing this finding there: the comment to add should mirror the chatd-side one, e.g.

// connectTimeout bounds how long we wait for a single MCP server
// to start its transport and complete initialization.
// coderd/x/chatd.workspaceMCPDiscoveryTimeout must exceed this
// value; update it if this constant changes.
const connectTimeout = 30 * time.Second

If #25034 has already passed deep review, I can open a small standalone follow-up PR after both branches land. Open question for the operator: prefer routing this to #25034 or a follow-up?

🤖 Posted using /amend-review skill via Coder Agents.

Reviewer feedback: the comments I added in commit 40a18d8 read
like design documents. Trim each to what the reader actually needs.

No behavior change.

Coder Agents generated.
@mafredri mafredri requested a review from johnstcn May 7, 2026 13:12
Remove the constant export in export_test.go and the matching
TestWorkspaceMCPDiscoveryTimeoutExceedsAgentConnectTimeout assertion
test. Asserting a constant's value against a hardcoded literal in
the same package is not a meaningful test.

Coder Agents generated.
@mafredri mafredri marked this pull request as ready for review May 7, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants