Adaptive Output Token Escalation Design
Reduces GPU slot over-reservation by ~4x through a “low default + escalate on truncation” strategy for output tokens.
Problem
Every API request reserves a fixed GPU slot proportional to max_tokens. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.
Solution
Use a capped default of 8K output tokens. When a response is truncated (the model hits max_tokens), automatically retry once with an escalated limit of 64K. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.
Architecture
┌─────────────────────────┐
│ Request starts │
│ max_tokens = 8K │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Stream response │
└───────────┬─────────────┘
│
┌─────────┴─────────┐
│ │
finish_reason finish_reason
!= MAX_TOKENS == MAX_TOKENS
│ │
▼ ▼
┌───────────┐ ┌─────────────────────┐
│ Done │ │ Check conditions: │
└───────────┘ │ - No user override? │
│ - No env override? │
│ - Not already │
│ escalated? │
└─────────┬───────────┘
YES │ NO
┌─────────┴────┐
│ │
▼ ▼
┌─────────────┐ ┌──────────┐
│ Pop partial │ │ Done │
│ model resp │ │ (truncd) │
│ from history│ └──────────┘
│ │
│ Yield RETRY │
│ event │
│ │
│ Re-send │
│ max_tokens │
│ = 64K │
└─────────────┘Token limit determination
The effective max_tokens is resolved in the following priority order:
| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior |
|---|---|---|---|---|
| 1 (highest) | User config (samplingParams.max_tokens) | min(userValue, modelLimit) | userValue | No escalation |
| 2 | Environment variable (QWEN_CODE_MAX_OUTPUT_TOKENS) | min(envValue, modelLimit) | envValue | No escalation |
| 3 (lowest) | Capped default | min(modelLimit, 8K) | min(32K, 8K) = 8K | Escalates to 64K on truncation |
A “known model” is one that has an explicit entry in OUTPUT_PATTERNS (checked via hasExplicitOutputLimit()). For known models, the effective value is always capped at the model’s declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user’s value through directly, since the backend may support larger limits.
This logic is implemented in three content generators:
DefaultOpenAICompatibleProvider.applyOutputTokenLimit()— OpenAI-compatible providersDashScopeProvider— inheritsapplyOutputTokenLimit()from the default providerAnthropicContentGenerator.buildSamplingParameters()— Anthropic provider
Escalation mechanism
The escalation logic lives in geminiChat.ts, placed outside the main retry loop. This is intentional:
- The retry loop handles transient errors (rate limits, invalid streams, content validation)
- Truncation is not an error — it’s a successful response that was cut short
- Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic
Escalation steps (geminiChat.ts)
1. Stream completes successfully (lastError === null)
2. Last chunk has finishReason === MAX_TOKENS
3. Guard checks pass:
- maxTokensEscalated === false (prevent infinite escalation)
- hasUserMaxTokensOverride === false (respect user intent)
4. Pop the partial model response from chat history
5. Yield RETRY event → UI discards partial output
6. Re-send the same request with maxOutputTokens: 64KState cleanup on RETRY (turn.ts)
When the Turn class receives a RETRY event, it clears accumulated state to prevent inconsistencies:
pendingToolCalls— cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated responsependingCitations— cleared to avoid duplicate citationsdebugResponses— cleared to avoid stale debug datafinishReason— reset toundefinedso the new response’s finish reason is used
Constants
Defined in tokenLimits.ts:
| Constant | Value | Purpose |
|---|---|---|
CAPPED_DEFAULT_MAX_TOKENS | 8,000 | Default output token limit when no user override is set |
ESCALATED_MAX_TOKENS | 64,000 | Output token limit used on truncation retry |
Design decisions
Why 8K default?
- 99% of responses are under 5K tokens
- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
- Reduces average slot reservation from 32K to 8K (4x improvement)
Why 64K escalated limit?
- Covers the vast majority of long outputs that were truncated at 8K
- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate
Why not progressive escalation (8K → 16K → 32K → 64K)?
- Each retry adds latency (the full response must be regenerated)
- A single retry is the simplest approach that captures almost all cases
- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K
Why is escalation outside the retry loop?
- Truncation is a success case, not an error
- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
- Keeps the retry loop focused on its original purpose (transient error recovery)