Workspace MCP Transport Pool
Overview
McpTransportPool (packages/core/src/tools/mcp-transport-pool.ts) is the F2 (#4175 commit 5) workspace-scoped pool: multiple ACP sessions on one daemon share one transport per unique (serverName + configFingerprint) tuple, instead of each spawning its own MCP child process. The pool lives inside the ACP child (QwenAgent.mcpPool), is constructed once at agent startup with the daemon’s bootstrap Config, and survives session lifecycles. Entries reference-count session attaches and close after a configurable grace period when the reference count reaches zero.
It is the main mechanism that prevents a multi-session daemon from forking one copy of every MCP server per session.
Responsibilities
- Acquire or spawn one MCP transport per
(name + fingerprint), deduplicating concurrent acquires viaspawnInFlight. - Release per-session references; arm the entry’s drain timer when the last reference detaches.
- Survive ref-count churn with a hard
MAX_IDLE_MScap so a thrashing client cannot keep an idle transport alive forever. - Reference-count sessions in a reverse index (
sessionToEntries) soreleaseSession(sessionId)is O(refs) rather than O(entries). - Restart entries on demand (
restartByName) — single-entry returns{restarted, durationMs}, multi-entry returns{entries: RestartResult[]}(F2 multi-entry contract). - Drain the entire pool on daemon shutdown with a configurable timeout; refuse new acquires while draining.
- Consult
WorkspaceMcpBudget(see06-mcp-budget-guardrails.md) onacquireto enforce per-name reservation caps; release the slot on entry close when no sibling entry holds the same name. - Produce per-session filtered tool/prompt snapshots via
SessionMcpViewso a discovery in one session does not register tools into other sessions.
Architecture
Public surface
class McpTransportPool {
constructor(cliConfig: Config, options: McpTransportPoolOptions);
acquire(
serverName,
cfg,
sessionId,
sessionToolRegistry,
sessionPromptRegistry,
): Promise<PooledConnection>;
release(id, sessionId): void;
releaseSession(sessionId): void;
restartByName(
name,
opts?,
): Promise<RestartResult | { entries: RestartResult[] }>;
drainAll(opts?): Promise<void>;
getBudget(): WorkspaceMcpBudget | undefined;
getSnapshot(): McpPoolSnapshot;
}McpTransportPoolOptions:
workspaceContext: WorkspaceContext(required).debugMode: boolean.sendSdkMcpMessage?— per-session callback (pool bypasses SDK MCP).pooledTransports?: ReadonlySet<McpTransportKind>— default{stdio, websocket}. HTTP/SSE transports stay unpooled by default because their headers can carry session-specific OAuth state, but operators can explicitly opt them into pooling withQWEN_SERVE_MCP_POOL_TRANSPORTS.drainDelayMs?— default30_000.entryOptions?: (transport) => PoolEntryOptions.budget?: WorkspaceMcpBudget.
Internal state
| State | Type | Purpose |
|---|---|---|
entries | Map<ConnectionId, PoolEntry> | Live pool entries keyed by connectionIdOf(name, fingerprint). |
unpooledIds | Set<ConnectionId> | Entries for transports outside the configured pooledTransports allowlist. |
spawnInFlight | Map<ConnectionId, Promise<PoolEntry>> | Deduplicates concurrent cold acquires for the same key. |
sessionToEntries | Map<string, Set<ConnectionId>> | V21-2 reverse index for O(refs) releaseSession. |
draining | boolean | Drain mutex — once set, all acquire calls reject. |
nextIndexByName | Map<string, number> | V21-7 monotonic entryIndex per server name (dashboards do not reshuffle when a new entry appears). |
PoolEntry (per-entry structure, mcp-pool-entry.ts)
State machine: spawning → active ⇄ (active ↔ reconnect) → (active → draining on last detach, draining → active on attach OR draining → closed on timer).
| Field | Purpose |
|---|---|
localStatus: MCPServerStatus | Driven by MCPServerStatus lifecycle. |
state: PoolEntryState | spawning/active/draining/closed/failed. |
generation: number | Bumped on each restart; subscribers compare to detect reconnect cycles. |
refs: Set<string> | Session ids currently attached. |
subscribers: Map<string, SessionMcpView> | Per-session filtered views. |
subscriberHandles: Map<string, PooledConnectionImpl> | Handles returned from acquire. |
toolsSnapshot[], promptsSnapshot[] | Canonical pool-level snapshots; re-issued on toolsChanged / promptsChanged. |
drainTimer? | Armed when refs.size === 0; default 30s. Reset on attach. |
maxIdleTimer? | Armed at first idle; never reset by acquire/release churn. Default 5 min. |
firstIdleAt? | Watermark for the max-idle hard cap. |
restartInFlight? | Mutex for restart(). |
PoolEntryOptions
interface PoolEntryOptions {
drainDelayMs: number; // default 30_000
maxIdleMs: number; // default 5 * 60_000
maxReconnectAttempts: number; // default 3 (stdio/ws) or 5 (http/sse)
reconnectStrategy:
| { kind: 'fixed'; delayMs: number }
| { kind: 'exponential'; baseMs: number; capMs: number };
}defaultPoolEntryOptions(transport) (mcp-pool-entry.ts) returns stdio/ws defaults {fixed 5s, 3 attempts} and http/sse defaults {exponential 1s → 16s, 5 attempts}. Remote transports get longer retry budgets because their failures are more often transient.
Workflow
acquire
release + drain
hasNameSibling(name) (mcp-transport-pool.ts) iterates both entries.values() and spawnInFlight.keys() parsing the latter with parseConnectionId (server names can legitimately contain ::, so startsWith would false-positive on a sibling name beginning with ${name}::).
releaseSession(sessionId) reads from sessionToEntries, releases all referenced entries in O(refs), then clears the index entry. Used by the bridge’s session-close path so it does not iterate the full entry map.
restartByName
The preflight budget check at the daemon HTTP layer returns {restarted:false, skipped:true, reason:'budget_would_exceed'} (Wave 4 mutation control) when the target’s slot is not already reserved and a restart would push live count over enforce budget.
drainAll
State & Lifecycle
- Pool construction is synchronous; first
acquirecold-starts a transport. drainDelayMs(default 30s) is reset to cancellation on attach.maxIdleMs(default 5 min) is never reset by attach/detach — it starts ticking at the FIRST idle and only stops when the entry actually closes or attaches before the deadline. Defense against thrashing clients.nextIndexByNameis monotonic. Old entries keep their assigned index even after newer ones appear, so dashboards readingentryIndexdo not reshuffle.- Spawn failure releases the reserved budget slot (V21-4 — without this, a cold spawn that crashed mid-connect would leak the reservation forever).
Dependencies
packages/core/src/tools/mcp-client.ts—McpClient, status enum,SendSdkMcpMessage.packages/core/src/tools/mcp-pool-entry.ts—PoolEntry,PoolEntryOptions,defaultPoolEntryOptions.packages/core/src/tools/mcp-pool-key.ts—connectionIdOf,parseConnectionId,isPoolable,mcpTransportOf,POOLED_TRANSPORTS_DEFAULT.packages/core/src/tools/mcp-pool-events.ts—ConnectionId,PoolEntryState,PoolEvent.packages/core/src/tools/session-mcp-view.ts— per-session view that filters pool snapshots.packages/core/src/tools/mcp-workspace-budget.ts—WorkspaceMcpBudget(see06-mcp-budget-guardrails.md).packages/core/src/tools/mcp-discovery-timeout.ts—discoveryTimeoutFor,runWithTimeout.
Configuration
| Source | Knob | Effect |
|---|---|---|
| Env | QWEN_SERVE_NO_MCP_POOL=1 | Kill switch — QwenAgent.mcpPool stays undefined; per-session McpClientManager enforces (pre-F2 path). |
| Flag | --mcp-client-budget=N, --mcp-budget-mode={off,warn,enforce} | Forwarded to ACP child via childEnvOverrides; child constructs WorkspaceMcpBudget and passes to pool. |
| Capability tags (conditional) | mcp_workspace_pool, mcp_pool_restart | Advertised together when pool is on. SDK pre-flights both to branch on pool-aware response shapes. |
Unpooled entries (HTTP / SSE / SDK-MCP)
Transports outside the configured pooledTransports allowlist (HTTP, SSE, and SDK-MCP by default) take a separate path: createUnpooledConnection(name, cfg, sessionId, ...) (mcp-transport-pool.ts) creates a per-session entry with id ${name}::unpooled-${entryIndex}. Differences from pooled entries:
- Stored in
entriesAND tracked inunpooledIds: Set<ConnectionId>sorelease/releaseSessioncan fast-path the close-on-detach behavior (refs always max out at 1). McpClient.discover()is used directly instead of pool replay;applyTools/applyPromptsare no-ops because the session’s registries already hold what was registered (W77 /skipReplay: trueinattach()).- Workspace budget still gates them — the F2 budget follow-up closed the prior loophole where unpooled connections bypassed
tryReserve; the sameWorkspaceMcpBudgetslot is reserved and released on entry close (whether pooled or unpooled).
The W77 race (cb206da36): createUnpooledConnection stores the entry in this.entries BEFORE awaiting client.connect() / client.discover(), but only indexes sessionToEntries[sessionId] AFTER attach() succeeds. A concurrent closeStoredSession() / releaseSession(sessionId) during the connect/discover window saw an empty index, let the unpooled spawn finish, and attach() then registered tools/prompts into an already-closed session. The fix:
mcp-pool-entry.ts: publicisTerminated(): booleanprobe (state === 'closed' || state === 'failed').mcp-pool-entry.ts:markActive()short-circuits ifisTerminated()so a torn-down entry cannot be resurrected to'active'.- Callers (the pool’s unpooled path) probe
isTerminated()between the awaits and abort the attach if the parent session went away.
This race was latent at the time (the W61/W71 per-session releaseSession hooks land in F4), but would become live the moment that hook arrived. The fix was applied early in the F2 series.
GET /workspace/mcp pool-aware snapshot fields
When the pool is active, each ServeWorkspaceMcpStatus server cell
(packages/acp-bridge/src/status.ts) includes three additional fields:
| Field | Type | Purpose |
|---|---|---|
disabledReason | 'config' | 'budget' | Distinguishes operator-disabled servers (disabled: true from disabledMcpServers) from budget refusal (status: 'error', errorKind: 'budget_exhausted'). Dashboards can render one server row without cross-reading errors[] or budgets[]. |
entryCount | number (>=1) | In pool mode a workspace can have multiple PoolEntry instances with the same name when sessions inject different fingerprints such as per-session OAuth headers. This field is absent when QWEN_SERVE_NO_MCP_POOL=1 disables the pool. New clients render an “N entries” badge when entryCount > 1. |
entrySummary | ReadonlyArray<{entryIndex, refs, status}> | Per-entry breakdown. entryIndex is the stable opaque integer assigned when the entry was created; it is not the raw fingerprint, so snapshot diffs do not leak OAuth or env rotation timing. refs is the current attached-session count. status lets dashboards show per-entry health while aggregate mcpStatus is already connected. |
(entryCount, entrySummary) are always broadcast as a pair. The
mcp_workspace_pool capability tag implies both fields. Older SDK clients
ignore them under the additive protocol contract.
Pool snapshots also expose subprocessCount. It counts only the 'stdio'
family. WebSocket, HTTP, and SSE transports connect to remote servers and do
not spawn local child processes. Early versions counted WebSocket transports as
local subprocesses, which inflated resource dashboards.
Drain runs from both shutdown paths
Pool drain is not limited to the SIGTERM handler. The normal IDE shutdown path
(await connection.closed) also calls drainAll via
packages/cli/src/acp-integration/acpAgent.ts’s drainPoolBeforeExit. Whether
the daemon receives a process signal or the IDE closes its connection cleanly,
the pool enters draining, refuses new acquires, and waits for entries to
close.
/mcp refresh shares the boot discovery path
discoverAllMcpTools (boot discovery) and
discoverAllMcpToolsIncremental (/mcp refresh / hot reload) both consult the
pool first in pool mode (packages/core/src/tools/mcp-client-manager.ts). The
shared gate prevents hot reload from accidentally creating a per-session
client, double-counting budget, or leaving an orphan transport behind.
In-flight tool calls during reconnect (MCPCallInterruptedError)
When the underlying MCP transport silently disconnects (the connection jumps
from 'active' / 'draining' to localStatus === DISCONNECTED without an
explicit close), the pool marks the entry 'failed', evicts it from
pool.entries, and emits the failed event before detaching subscriber views.
That emit-before-detach order matters: subscribers receive the failed event
soon enough to route pending callTool promises to
MCPCallInterruptedError, so a stuck await client.callTool(...) rejects
cleanly instead of hanging. forceShutdown uses the same emit-then-detach
ordering.
Fingerprint and canonicalOAuth normalization
The pool key comes from fingerprint(cfg) in mcp-pool-key.ts. The hash covers
all transport-defining fields:
transport, command, args, cwd, env, url, httpUrl, tcp, headers, timeout, oauth
Per-session filtering and metadata fields (includeTools, excludeTools,
trust, description, extensionName, discoveryTimeoutMs) are excluded, so
sessions with different filters can share one entry.
For the OAuth cell, canonicalOAuth(o) hashes every MCPOAuthConfig field:
clientId, clientSecret, sorted scopes, sorted audiences,
authorizationUrl, tokenUrl, redirectUri, tokenParamName, and
registrationUrl. This is the credential-isolation contract: two session
configs that differ only by clientSecret, audiences, or redirectUri get
different fingerprints and cannot share one entry. Confidential clients and
multi-audience token deployments depend on this.
Sorting scopes and audiences makes callsite order irrelevant. Explicit
null is normalized so undefined fields hash the same as explicit null. The
key does not include discoveryTimeoutMs; concurrent acquire calls with the
same key but different timeouts are “first wins”, matching the pre-F2
per-session manager behavior.
PoolEntry keeps cfg: MCPServerConfig private. External code must use the
entry.transportKind getter when it needs the transport family. That prevents
env, header auth, and OAuth fields from leaking to consumers by accident.
Extension unloads rely on MAX_IDLE_MS
There is intentionally no active cleanup path for unloading an MCP extension at
runtime. Orphan entries whose MCPServerConfig no longer appears in the merged
workspace settings are reclaimed naturally by the MAX_IDLE_MS hard cap after
the last subscriber detaches. A synchronous unload-cleanup path would add
complexity for a rare operator edge case; the hard cap limits orphan process
lifetime past the unload point to 5 minutes by default.
Operators who need faster cleanup can restart the daemon or call
POST /workspace/mcp/:server/restart for the now-unconfigured name, which goes
through the disabled-server path and tears the entry down.
Self-heal observability
The pool emits two structured diagnostics on the self-heal path:
McpClient.lastTransportError: Error | undefined (packages/core/src/tools/mcp-client.ts) — McpClient.onerror stores the most recent transport exception in a private field and clears it at connect() entry. The PoolEntry silent-drop path reads client.getLastTransportError() and includes it in emit({kind:'failed', lastError}), so subscribers and dashboards do not have to grep stderr for root cause.
SweepResult (internal interface, not exported; packages/core/src/tools/mcp-pool-entry.ts) — sweepAndDisconnect(reason) returns Promise<SweepResult>:
interface SweepResult {
pidSweepError?: Error; // listDescendantPids itself threw
descendantsFound?: number; // descendant pid count found
descendantsSignaled?: number; // successfully SIGTERM'd count
}The only consumer is the silent-drop block in statusChangeListener. It uses
descendantsFound / descendantsSignaled to detect partial-signal cases
(fewer processes signaled than found, usually because a process exited or EPERM
occurred between listDescendantPids and sigtermPids) and sweep errors, then
logs a structured warning. forceShutdown and doRestart ignore this return
value because their catch paths already carry richer failure signals.
Subprocess cleanup: the pid-descendants snapshot path
When McpTransportPool shuts down stdio subprocesses, it has to enumerate their
descendant processes; npx wrappers and shell wrappers can create multiple fork
levels. packages/core/src/tools/pid-descendants.ts exposes
listDescendantPids(rootPid) → Promise<number[]> and sigtermPids(pids) for
sweepAndDisconnect.
Linux / macOS primary path
A single ps -A -o pid=,ppid= snapshot reads the process table, parses it into
Map<ppid, pid[]>, then walkDescendants(tree, root) performs BFS to extract
the subtree. Any depth requires only one ps fork.
walkDescendants maintains visited: Set<number> and includes root in the
set to defend against PID-reuse cycles. Under fast process churn, the snapshot
can theoretically contain A→B / B→A loops. Without visited, the walker could
fill the MAX_DESCENDANTS quota with bogus data and crowd out real descendants.
Windows primary path
A single Get-CimInstance Win32_Process | ConvertTo-Csv -Delimiter ","
snapshot emits all (ProcessId, ParentProcessId) rows, then the same Map and
walkDescendants path runs.
The explicit -Delimiter "," is required. PowerShell 5.1, which ships with
Windows, defaults ConvertTo-Csv to the system locale list separator; DE, FR,
NL, IT, and similar locales use ;, so the pre-fix parser
^"(\d+)","(\d+)"$ never matched and every daemon shutdown fell back to the
per-pid CIM filter path, adding roughly 0.5-1s of PowerShell startup cost per
child.
Fallback path
BusyBox <v1.28 lacks ps -o, distroless containers might not include ps,
and some Windows environments truncate CIM output via ACLs. When the primary
path parses zero rows or throws, the code falls back to per-pid BFS: Linux /
macOS use pgrep -P <pid>, and Windows uses
Get-CimInstance -Filter "ParentProcessId=$p" where $p is a PowerShell
variable binding rather than string concatenation. The current
Number.isInteger guard is sufficient for the entry point; the binding is
defense-in-depth.
Shared constraints
Both paths are bounded by MAX_DESCENDANTS = 256 and MAX_DEPTH = 8 to keep a
malicious or degenerate process tree from dragging down sweep.
The snapshot path uses maxBuffer: 8MB, enough for pathological hosts with
about 250k processes. Node’s default 1MB buffer can truncate child-process
output around 30k processes.
The performance gain is intentionally modest (typical 200-500 process dev
machines parse in under 10ms, around 2x faster than per-pid pgrep). The main
benefit is fork hygiene and snapshot consistency: BFS sees the full subtree at
once, while the previous per-pid query path could miss a grandchild forked
between two queries.
Embedder note: McpClientManager constructor
McpClientManager is constructed as
(config, toolRegistry, options?: McpClientManagerOptions). Embedders that
import the class directly should pass:
new McpClientManager(config, toolRegistry, {
eventEmitter,
sendSdkMcpMessage,
healthConfig,
budgetConfig,
pool,
});Tests should prefer an mkManager(overrides?) factory so cases that care about
one or two fields stay one line.
Implementation notes
These helpers are internal, but source readers may see them:
McpTransportPool.acquire()usesattachPooledSessionandrollbackReservationOnSpawnFailureto share fast-path attach, post-spawn attach, and pooled spawn-in-flight catch behavior. Runtime behavior is unchanged; race-window invariants still live at the call sites.SessionMcpView.applyTools/applyPromptscompileincludeTools/excludeToolsonce viacompileNameFilter(cfg)and check each tool withcompiledFilterAccepts(compiled, name). ExportedpassesSessionFilter/passesSessionPromptFilteruse the same compiled path.excludeToolsis exact-match;includeToolsstrips the first(...)suffix sotoolName(args)matchestoolName.
Design document: ../../design/f2-mcp-transport-pool.md §6 covers the transport pool state machine, reconnect, drain, and descendant sweep paths.
Caveats & Known Limits
- HTTP / SSE transports are unpooled by default — unless operators explicitly include them in
QWEN_SERVE_MCP_POOL_TRANSPORTS, each acquire mints a fresh entry that lives only as long as its session. Their headers may carry session-specific OAuth state, so pooling them by default would risk leaking credentials across sessions. maxIdleMsis a hard cap that survives attach/detach churn. A 5-minute idle hard cap means even an aggressively attaching/detaching client cannot keep an idle transport pinned past 5 minutes. Operators who want pinned long-lived transports should increasemaxIdleMsor run the server outside the pool.- Per-server-name budget slots mean two pool entries that share a name but differ by fingerprint consume ONE slot together, not two. Subprocess accounting is exposed separately via
pool.getSnapshot().subprocessCount. startsWithregression was avoided inhasNameSiblingbecause MCP server names can legitimately contain::(mcp-pool-key.test.ts). Always useparseConnectionId’slastIndexOf('::')split, never string-prefix matching.- Pool draining is one-way —
drainAllsetsdraining = truepermanently; a fresh pool is required for further work.
References
packages/core/src/tools/mcp-transport-pool.ts(entire file)packages/core/src/tools/mcp-pool-entry.ts(entry lifecycle)packages/core/src/tools/mcp-pool-key.ts(connectionIdOf,parseConnectionId)packages/core/src/tools/mcp-pool-events.ts(event types)packages/core/src/tools/session-mcp-view.ts(per-session filtered view)- F2 design document (v2.2, with the 32-item review fold-in changelog):
../../design/f2-mcp-transport-pool.md. Treat the design contract as authoritative; this page is the developer deep dive. - F2 design notes: issue #4175 (commits 4-6 of the F2 series).