Skip to Content
DesignAdaptive Output Token EscalationAdaptive Output Token Escalation Design

Adaptive Output Token Escalation Design

Reduces GPU slot over-reservation by ~4x through a “low default + escalate on truncation” strategy for output tokens.

Problem

Every API request reserves a fixed GPU slot proportional to max_tokens. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost.

Solution

Use a capped default of 8K output tokens. When a response is truncated (the model hits max_tokens), automatically retry once with an escalated limit of 64K. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses.

Architecture

┌─────────────────────────┐ │ Request starts │ │ max_tokens = 8K │ └───────────┬─────────────┘ ┌─────────────────────────┐ │ Stream response │ └───────────┬─────────────┘ ┌─────────┴─────────┐ │ │ finish_reason finish_reason != MAX_TOKENS == MAX_TOKENS │ │ ▼ ▼ ┌───────────┐ ┌─────────────────────┐ │ Done │ │ Check conditions: │ └───────────┘ │ - No user override? │ │ - No env override? │ │ - Not already │ │ escalated? │ └─────────┬───────────┘ YES │ NO ┌─────────┴────┐ │ │ ▼ ▼ ┌─────────────┐ ┌──────────┐ │ Pop partial │ │ Done │ │ model resp │ │ (truncd) │ │ from history│ └──────────┘ │ │ │ Yield RETRY │ │ event │ │ │ │ Re-send │ │ max_tokens │ │ = 64K │ └─────────────┘

Token limit determination

The effective max_tokens is resolved in the following priority order:

PrioritySourceValue (known model)Value (unknown model)Escalation behavior
1 (highest)User config (samplingParams.max_tokens)min(userValue, modelLimit)userValueNo escalation
2Environment variable (QWEN_CODE_MAX_OUTPUT_TOKENS)min(envValue, modelLimit)envValueNo escalation
3 (lowest)Capped defaultmin(modelLimit, 8K)min(32K, 8K) = 8KEscalates to 64K on truncation

A “known model” is one that has an explicit entry in OUTPUT_PATTERNS (checked via hasExplicitOutputLimit()). For known models, the effective value is always capped at the model’s declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user’s value through directly, since the backend may support larger limits.

This logic is implemented in three content generators:

  • DefaultOpenAICompatibleProvider.applyOutputTokenLimit() — OpenAI-compatible providers
  • DashScopeProvider — inherits applyOutputTokenLimit() from the default provider
  • AnthropicContentGenerator.buildSamplingParameters() — Anthropic provider

Escalation mechanism

The escalation logic lives in geminiChat.ts, placed outside the main retry loop. This is intentional:

  1. The retry loop handles transient errors (rate limits, invalid streams, content validation)
  2. Truncation is not an error — it’s a successful response that was cut short
  3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic

Escalation steps (geminiChat.ts)

1. Stream completes successfully (lastError === null) 2. Last chunk has finishReason === MAX_TOKENS 3. Guard checks pass: - maxTokensEscalated === false (prevent infinite escalation) - hasUserMaxTokensOverride === false (respect user intent) 4. Pop the partial model response from chat history 5. Yield RETRY event → UI discards partial output 6. Re-send the same request with maxOutputTokens: 64K

State cleanup on RETRY (turn.ts)

When the Turn class receives a RETRY event, it clears accumulated state to prevent inconsistencies:

  • pendingToolCalls — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response
  • pendingCitations — cleared to avoid duplicate citations
  • debugResponses — cleared to avoid stale debug data
  • finishReason — reset to undefined so the new response’s finish reason is used

Constants

Defined in tokenLimits.ts:

ConstantValuePurpose
CAPPED_DEFAULT_MAX_TOKENS8,000Default output token limit when no user override is set
ESCALATED_MAX_TOKENS64,000Output token limit used on truncation retry

Design decisions

Why 8K default?

  • 99% of responses are under 5K tokens
  • 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries
  • Reduces average slot reservation from 32K to 8K (4x improvement)

Why 64K escalated limit?

  • Covers the vast majority of long outputs that were truncated at 8K
  • Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x)
  • Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate

Why not progressive escalation (8K → 16K → 32K → 64K)?

  • Each retry adds latency (the full response must be regenerated)
  • A single retry is the simplest approach that captures almost all cases
  • The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K

Why is escalation outside the retry loop?

  • Truncation is a success case, not an error
  • Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters
  • Keeps the retry loop focused on its original purpose (transient error recovery)
Last updated on