LLM Request Timing Decomposition Design (P3 Phase 4)
Issue #3731 — Phase 4 of hierarchical session tracing. Adds time-to-first-token, request-setup duration, sampling duration, and per-attempt retry telemetry to the
qwen-code.llm_requestspan so operators can answer “why was this LLM call slow?” without guessing.Builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321). Independent of Phase 3 (#4410, in review) — recommended to land Phase 3 first so Phase 4’s per-attempt fields aggregate cleanly under subagent subtrees.
Problem
qwen-code.llm_request spans today carry only model, prompt_id, input_tokens, output_tokens, success, error, duration_ms. Operators reading a single trace cannot tell:
- How much of
duration_mswas the model thinking vs the network setup. A 12-secondduration_mscould be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn’t say. - When the user saw the first token. TTFT (time-to-first-token) is the standard latency SLO for chat UIs. We can’t compute it; we don’t capture it.
- What happened during retries.
retryWithBackoff(utils/retry.ts:285) only callsdebugLogger.warn— no OTel event, no span attribute. The 4 LLM call sites that go through it (client.ts:1540,baseLlmClient.ts:193,282,geminiChat.ts:1039) have zero retry visibility in traces or metrics.ContentRetryEventexists for content-recovery retries insidegeminiChat.ts:806,830but not for the more common rate-limit / 5xx retries. - That
api.request.breakdownis dead code. The metric is defined atmetrics.ts:242-251with 4ApiRequestPhasevalues, exported fromindex.ts:117, tested inmetrics.test.ts:646-675— butrecordApiRequestBreakdown()has zero callers in production code. The metric infrastructure is paid for; the data flow was never connected.
These gaps make qwen-code.llm_request the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.
Existing surface (no change)
| Component | Location | Why we don’t touch it |
|---|---|---|
| LLM request span lifecycle | session-tracing.ts startLLMRequestSpan / endLLMRequestSpan | Phase 1 (#4126) established the helpers. We extend the metadata interface, don’t restructure |
| Active span propagation into provider generators | loggingContentGenerator.ts:213,287 | Phase 1 (#4126) replaced withSpan('api.*') with native helpers; the active context already reaches the stream wrapper |
ContentRetryEvent schema + consumers | types.ts:626, qwen-logger.ts:947, loggers.ts:717 | Existing event keeps its shape and downstreams; we add a sibling event class for the retryWithBackoff path |
LogToSpanProcessor log-bridge spans | log-to-span-processor.ts | ContentRetryEvent’s existing bridge continues to nest under the active LLM span. Phase 4 does not change this |
ApiRequestPhase enum | metrics.ts:330-334 | Public surface (4 values). We populate 3 of the 4 from production code; leave the enum unchanged for backward compatibility |
Per-provider chunk normalization → GenerateContentResponse | loggingContentGenerator.ts:286-393 | Each provider already normalizes to Google’s GenerateContentResponse shape before LoggingContentGenerator sees the stream. TTFT detection runs centrally over this normalized shape; no per-provider code |
retryWithBackoff general-purpose retry | utils/retry.ts:140 | Used by both LLM callers and non-LLM (channels/weixin/src/api.ts). We extend with an opt-in onRetry callback rather than hard-coupling to LLM telemetry |
Non-streaming generateContent | loggingContentGenerator.ts:212 | TTFT is not meaningful for non-streaming; the new fields stay undefined. Span lifecycle and existing attrs unchanged |
Out-of-scope (deferred)
- SDK-level retries (openai SDK
maxRetries=3, google-genai SDK internal retries). These happen entirely inside the third-party SDK; observing them requires disabling SDK retries and reimplementing inretryWithBackoff. Separate decision, not Phase 4. - Per-token streaming metrics (inter-token latency, per-chunk size). Useful for inference-engine perf debugging, not for the user-perceived latency questions Phase 4 targets.
- Separate TTFT for reasoning/thinking blocks. “First token” includes thinking content (see D1). A future enhancement could split
ttft_to_reasoning_msvsttft_to_answer_ms, but only after we know there’s demand. - Sampling phase as a dedicated child span. Computable from
duration_ms - ttft_ms - request_setup_ms; child span adds nothing for OTel-only backends (claude-code uses one for Perfetto only). Stored as a span attribute instead — see D6. - Persistent retry mode (
QWEN_CODE_UNATTENDED_RETRY) event-level rate limiting. A single LLM request can produce 50+ContentRetryEvent/ApiRetryEventrecords under persistent retry. Capping emission is a follow-up — Phase 4 emits all events; if production volumes prove unbearable, add a per-span emission cap with a “+N more attempts (truncated)” summary event in a follow-up PR. TOKEN_PROCESSINGbreakdown phase. Enum value exists but qwen-code has no real post-stream local processing worth measuring (<10ms typical). Skipped in production callers; enum value retained for future use or for callers we don’t control.- Migrating
ContentRetryEventonto LLM span as span events. Same reasoning as Phase 3’ssubagent_executionLogRecord: existing consumers (qwen-logger RUM, future metrics) are tightly coupled to the LogRecord. Bridge-span coverage is good enough.
References (decision evidence)
| Source | Key takeaway |
|---|---|
claude-code (Anthropic) claude.ts:1762, 1789, 1982, 2882 | TTFT captured as Date.now() - start on message_start SSE event; start reset per retry attempt. requestSetupMs = start - startIncludingRetries. attemptStartTimes array preserved per attempt. Confirms feasibility of the approach; their TTFT semantic is “first stream event” (we diverge to “first content” — see D1) |
claude-code perfettoTracing.ts:549-671 | Renders Request Setup → Attempt N (retry) → First Token → Sampling as nested B/E pairs. Demonstrates the visual decomposition; qwen-code does the same decomposition with OTel attributes since we have no Perfetto |
claude-code sessionTracing.ts:447 | Only ttft_ms makes it onto the OTel span (not requestSetupMs, not samplingMs, not per-attempt timing). We deliberately put more on the span — claude-code has Perfetto for visualization; we don’t |
opencode (sst/opencode) session/llm.ts, route/client.ts | No TTFT measurement. Single LLM.run Effect span covers everything. Validates that the gap exists across competing tools; not a reference for what to do |
| OTel GenAI Semantic Conventions (status: Development / Experimental) | gen_ai.usage.input_tokens (Stable), gen_ai.usage.output_tokens (Stable), gen_ai.usage.cached_tokens (Experimental), gen_ai.request.model (Stable), gen_ai.server.time_to_first_token (Experimental, seconds as double). Dual-emit pattern follows #4410 precedent |
| OTel Trace Spec — Span Events | ”Events SHOULD NOT be used to record information that’s better captured as Span Attributes.” Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent |
Phase 3 design doc (telemetry-subagent-spans-design.md) | Established the dual-emit pattern (qwen-code.subagent.id + gen_ai.agent.id) and the “private name is authoritative” rule. Phase 4 follows the same convention for TTFT and token fields |
Design — seven decisions, each justified
D1 — TTFT semantic: “first chunk containing user-visible content”
TTFT measures wall-clock from the successful attempt’s request dispatch to the first stream chunk that contains user-visible output. A chunk is “user-visible” if any normalized Part in candidates[0].content.parts is one of:
textwith non-empty stringfunctionCall(tool use)inlineData(image, binary)executableCodethought/ reasoning content (whatever the provider surfaces — Gemini’sthought, Anthropic’s<thinking>block, OpenAI o1 reasoning chunk)
Chunks containing only role metadata or only usageMetadata (final usage-summary chunk) do not trigger TTFT.
Why not “first stream event of any kind” (claude-code’s choice): claude-code measures TTFT at message_start, an Anthropic-specific metadata event that fires 50–300ms before any actual content. Their internal headlessProfiler.ts already separates time_to_first_response_ms for the “user saw something” semantic, acknowledging the distinction. qwen-code spans multiple providers (Anthropic, OpenAI, Gemini, Qwen) — picking the metadata-event semantic means TTFT for Anthropic is fundamentally different from TTFT for OpenAI (which has no analogous metadata-only first event). The user-visible-content semantic is uniform across all 4 providers and matches “time-to-first-token” literally.
Why include thought / reasoning: from the operator’s perspective, reasoning chunks are still “the model produced output.” Excluding them would understate TTFT for reasoning-heavy models (o1, Qwen thinking variants). Future split into ttft_to_reasoning_ms vs ttft_to_answer_ms is possible; not Phase 4.
Why include tool-call-only chunks: agent tool-decision LLM calls (one tool_use, no text) are common in qwen-code’s workflow. Excluding them means TTFT is undefined for these requests. The functionCall Part is meaningful output.
Cross-product comparison note: design doc explicitly states qwen-code.ttft_ms ≈ claude-code.time_to_first_response_ms ≠ claude-code.ttft_ms. Operators comparing across products should align on the user-visible-content semantic.
D2 — TTFT measurement site: method-local variables in LoggingContentGenerator.generateContentStream
The first-chunk detection runs inside the existing stream wrapper at loggingContentGenerator.ts:393 (async function* processStreamGenerator). Per-call variables (start, ttftMs) live in the method’s closure; never as instance fields.
Why never instance fields: LoggingContentGenerator is instantiated once per ContentGenerator (contentGenerator.ts:377) and shared across all concurrent generateContentStream calls — subagent fan-out, warmup queries, side-queries from geminiChat. An instance field would be overwritten across concurrent calls, producing nonsense TTFT for one of every two interleaved requests.
Why not AsyncLocalStorage: ALS would work but adds a context-management layer for a piece of state that doesn’t need to escape the method. Method-local is simpler, zero overhead, zero risk of leakage.
// loggingContentGenerator.ts — inside generateContentStream
const attemptStart = Date.now(); // per-call local
const requestEntryTime = Date.now(); // also per-call local — see D3
let ttftMs: number | undefined;
const attemptStartTimes: number[] = [attemptStart];
let retryTotalDelayMs = 0;
let finalAttempt = 1;
// stream wrapper inspects each chunk; first one matching hasUserVisibleContent:
// ttftMs = Date.now() - attemptStart;hasUserVisibleContent(chunk) is a small standalone helper colocated with the wrapper, exported for tests:
function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
const parts = chunk.candidates?.[0]?.content?.parts;
if (!parts?.length) return false;
return parts.some(
(p) =>
(typeof p.text === 'string' && p.text.length > 0) ||
p.functionCall !== undefined ||
p.inlineData !== undefined ||
p.executableCode !== undefined ||
// @ts-expect-error — `thought` is not on all SDK versions but providers emit it
p.thought !== undefined,
);
}D3 — request_setup_ms computation: entry-time vs successful-attempt-start
request_setup_ms measures wall-clock from generateContentStream/generateContent entry to the start of the successful attempt — including all failed retries, backoff sleeps, and any pre-retry preparation work.
request_setup_ms = attemptStart_of_successful_attempt - requestEntryTime;When attempt === 1 and no retries happened, request_setup_ms is small (just SDK setup). When retries occurred, it captures the entire retry-budget overhead.
Putting it on the OTel span (diverges from claude-code, which puts it only on Perfetto): rationale at three levels:
- No Perfetto — qwen-code has no out-of-band visualization layer. OTel attributes are the only channel.
- Single-trace debug — operator sees
duration_ms=12000, request_setup_ms=11500, ttft_ms=200, sampling_ms=300→ instantly diagnoses “retries ate 11.5s, model itself was fast.” Computingrequest_setup_msfrom other fields requires also exposingsampling_ms, which we do anyway (D6). - Negligible cost — 1 INT64 attribute. Same order of magnitude as the existing
input_tokens,output_tokensattributes. Backend ingest cost is not material.
D4 — Retry telemetry: onRetry callback option on retryWithBackoff + ApiRetryEvent + AsyncLocalStorage propagation
Phase 4b update (post-design discovery): this section was originally written assuming claude-code’s “one LLM span owns the retry loop” pattern. While implementing Phase 4b, we discovered that qwen-code’s 4
retryWithBackoffcall sites (client.ts:2109,baseLlmClient.ts:235,333,geminiChat.ts:2035— line numbers as of merge) all wrapapiCall = () => contentGenerator.generateContent(...). The retry layer sits above LoggingContentGenerator. Each retry attempt invokesapiCall()fresh → freshqwen-code.llm_requestspan. There is no single shared span across attempts. An in-LoggingContentGeneratoraccumulator wouldn’t work.Resolution: propagate retry state via
AsyncLocalStorage(retryContextinpackages/core/src/utils/retryContext.ts).retryWithBackoffwraps eachawait fn()inretryContext.run({ attempt, requestSetupMs, retryTotalDelayMs }, fn).LoggingContentGeneratorreads the ALS in its synchronous prelude and forwards the values toendLLMRequestSpan. This actually gives richer observability than the original plan — each per-attempt span has its ownduration_ms/ttft_ms/ error details AND knows where in the retry budget it sits via the per-attemptattempt/requestSetupMs/retryTotalDelayMsattributes.The ALS approach matches existing patterns in the codebase (
promptIdContext,subagentNameContext,agent-context) — minimal new surface, well-understood semantics. Plan-mode review process captured this revision through 3 review rounds finding 22 issues, all addressed before merge.
retryWithBackoff currently calls logRetryAttempt (retry.ts:343) which only writes to debugLogger.warn. We extend the RetryOptions interface with an opt-in callback:
// utils/retry.ts
interface RetryOptions<T> {
// ... existing fields ...
/**
* Optional. Called once per failed attempt, before the backoff sleep.
* Receives the attempt number (1-based), the error, and the delay before
* the next attempt. Use this to emit telemetry events for LLM call sites;
* leave undefined for non-LLM callers (e.g., channels/weixin) so they
* stay silent in LLM-specific telemetry channels.
*/
onRetry?: (info: RetryAttemptInfo) => void;
}
interface RetryAttemptInfo {
attempt: number; // 1-based, matches debugLogger output
error: unknown;
errorStatus?: number;
delayMs: number; // backoff delay before next attempt
}The 4 LLM call sites (client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039) register a callback that emits a new ApiRetryEvent:
// types.ts — new event class, sibling to ContentRetryEvent
export class ApiRetryEvent implements BaseTelemetryEvent {
'event.name': typeof EVENT_API_RETRY;
'event.timestamp': string;
model: string;
prompt_id?: string;
attempt_number: number; // 1-based
error_type: string;
error_message: string; // truncated to 256 chars
status_code?: number;
retry_delay_ms: number;
// ... duration_ms set to retry_delay_ms so LogToSpanProcessor renders
// a bridge span of meaningful width
duration_ms: number;
}Why a new event class, not extending ContentRetryEvent:
ContentRetryEventhas 2 downstream consumers (qwen-logger, log-record export). Changing its payload risks breaking them.- The naming “content retry” semantically refers to content-recovery retries (invalid stream, schema repair) — extending it to cover rate-limit retries would muddy the schema.
- New event is additive; no consumer surprise.
Why not embed callback IN retry.ts: retry.ts is called by channels/weixin/src/api.ts too (microsoft messaging API retries). Hard-coupling LLM telemetry inside retry.ts would emit ApiRetryEvent for non-LLM retries. The onRetry callback is opt-in per caller — LLM callers opt in, weixin caller doesn’t.
ContentRetryEvent coexistence: ContentRetryEvent stays as-is for content-recovery retries inside geminiChat.ts:806,830. ApiRetryEvent covers the rate-limit / 5xx retries from retryWithBackoff. The two events fire from different layers and never duplicate. Existing log-bridge behavior for both events is preserved via LogToSpanProcessor — both events nest under the active LLM span automatically (Phase 1 wiring ensures the LLM span is active during retries).
Persistent retry mode (QWEN_CODE_UNATTENDED_RETRY): a single 429-loop request may emit 50+ events. Out of scope to rate-limit emission in Phase 4 — if production volumes prove unbearable, add a per-span cap with summary event in a follow-up PR. The aggregated attempt and retry_total_delay_ms on the parent LLM span (D5) remain accurate regardless of event cap.
D5 — Parent LLM span aggregation: scalar attributes only (no map-typed attrs)
OTel span attributes are scalars (string | number | boolean | array of these). Map-typed attributes (like retry_count_by_status: {429:2, 503:1}) require JSON serialization and are awkward to query. Skip them.
| Attribute | Type | Semantic |
|---|---|---|
attempt | int | 1-based monotonic counter from retryContext.attempt (this attempt’s iteration). Always populated (defaults to 1 when no retry context) |
retry_total_delay_ms | int | Cumulative backoff sleep BEFORE this attempt started. Undefined for direct calls; 0 for attempt 1; > 0 for subsequent retried attempts |
ttft_ms | int | TTFT per D1; undefined for non-streaming or aborted-before-first-chunk requests |
request_setup_ms | int | Per D3 |
sampling_ms | int | Per D6 |
output_tokens_per_second | double | Derived; output_tokens / (sampling_ms / 1000); undefined when sampling_ms === 0 |
Per-attempt status-code distribution (e.g., “2 of the 3 attempts were 429s”) is queryable from log-bridge spans of ApiRetryEvent records. No need to duplicate it as a flattened attribute on the parent.
Why sampling_ms and output_tokens_per_second on the span: derivable but cumbersome to compute in backend queries when summing across many spans. Same cost-benefit as request_setup_ms (D3).
D6 — Activate recordApiRequestBreakdown() for 3 of 4 phases
In endLLMRequestSpan (or the wrapper that calls it), after computing TTFT/setup/sampling, emit:
recordApiRequestBreakdown(config, model, [
{ phase: ApiRequestPhase.REQUEST_PREPARATION, durationMs: requestSetupMs },
{ phase: ApiRequestPhase.NETWORK_LATENCY, durationMs: ttftMs }, // ttftMs = network + first-token-generation
{ phase: ApiRequestPhase.RESPONSE_PROCESSING, durationMs: samplingMs },
]);Why skip TOKEN_PROCESSING: qwen-code does stream chunk processing inline (consolidation happens in the wrapper at loggingContentGenerator.ts:644); the post-stream wrap-up phase is <10ms and not architecturally distinct. Filling it with a meaningless value pollutes the histogram. Leaving the enum value unused is safe — apiRequestBreakdownHistogram.record(value, {model, phase}) is just a histogram with phase as a label; missing labels are simply absent in queries.
Why not redefine NETWORK_LATENCY: the spec name is slightly misleading (it’s network + first-token-generation, not pure network latency), but:
- The enum is part of
metrics.ts:330-334which is exported fromindex.ts:117and tested. - Backend dashboards may already reference these phase names.
- Renaming or adding a new phase would be a breaking change for trivially marginal accuracy improvement.
Document the semantic in the design doc; leave the enum unchanged.
Why on the span path, not parallel: keeps recordApiRequestBreakdown colocated with span attribute writes — single gated emission point (see D7 idempotency), single ordering invariant.
D7 — endLLMRequestSpan idempotency: metric recording gated on existing double-end guard
Phase 1.5 (#4302) established that endLLMRequestSpan may be called twice (abort path + error path collision). The existing guard at session-tracing.ts:~470 (if (!activeSpans.has(...)) return;) prevents double span.end(). Phase 4 metric recording (D6) must sit inside the same guarded block, before span.end():
// session-tracing.ts — endLLMRequestSpan
const llmCtx = activeSpans.get(spanRef);
if (!llmCtx) return; // already ended — double-end guard
activeSpans.delete(spanRef); // claim the end
// ... compute duration, set attributes ...
if (metadata) {
recordApiRequestBreakdown(config, llmCtx.attributes.model, [...]); // NEW — gated
recordTokenUsageMetrics(...); // existing
}
span.end();This guarantees metric is recorded exactly once per LLM request, matching the span lifecycle.
Why not record in loggingContentGenerator: it doesn’t see the abort path. Recording at the span lifecycle layer ensures every LLM request that opens a span produces exactly one breakdown sample, regardless of success/failure/abort.
D8 — GenAI semantic conventions dual-emit (private name authoritative)
Each Phase 4 attribute that corresponds to an OTel GenAI semconv attribute is written twice on the span:
| qwen-code private (authoritative) | GenAI semconv (compat layer) | Unit conversion | Spec status |
|---|---|---|---|
ttft_ms (ms, int) | gen_ai.server.time_to_first_token (s, double) | ttftMs / 1000 | Experimental |
input_tokens (int) | gen_ai.usage.input_tokens (int) | identical | Stable |
output_tokens (int) | gen_ai.usage.output_tokens (int) | identical | Stable |
cached_input_tokens (int) (when present) | gen_ai.usage.cached_tokens (int) | identical | Experimental |
qwen-code.model (string) | gen_ai.request.model (string) | identical | Stable |
Existing token attribute names on the LLM span (set in endLLMRequestSpan before Phase 4): qwen-code uses bare input_tokens and output_tokens already. Phase 4 adds the gen_ai.usage.* siblings to match #4410’s pattern. The bare names stay; don’t rename.
Fields with no GenAI semconv equivalent — request_setup_ms, sampling_ms, retry_total_delay_ms, attempt, output_tokens_per_second — are emitted only under the qwen-code namespace.
Why “private authoritative, semconv as compat”:
- Internal dashboards, SLOs, debugLogger output, qwen-logger RUM, ARMS queries — all reference
ttft_msetc. Treating those as canonical avoids a flag-day migration. - The Experimental GenAI semconv may rename
gen_ai.server.time_to_first_tokenbefore reaching Stable. If/when it does, we update the semconv emission; the qwen-code names don’t move. - Future spec-aware backends (Datadog AI views, Honeycomb AI, ARMS GenAI dashboards) auto-pick up the
gen_ai.*attributes without our involvement.
Why dual-emit unit conversion (ms ↔ seconds): GenAI semconv chose seconds-as-double for latency; qwen-code chose ms-as-int (matches duration_ms already on the span). Both representations have value; the conversion is cheap.
Helper API (additive to session-tracing.ts)
// session-tracing.ts — LLMRequestMetadata interface extended (additive)
export interface LLMRequestMetadata {
// ... existing fields: inputTokens, outputTokens, cachedInputTokens, success, error, ...
/** Time from successful attempt start to first user-visible content chunk (ms). Undefined for non-streaming or aborted-before-first-chunk requests. */
ttftMs?: number;
/** Time from generateContent entry to start of successful attempt (ms). Includes all failed retries + backoff. */
requestSetupMs?: number;
/** Final attempt number (1-based). 1 = no retries. */
attempt?: number;
/** Sum of all backoff delays before the successful attempt (ms). */
retryTotalDelayMs?: number;
}
// No new exported helpers — Phase 4 reuses startLLMRequestSpan / endLLMRequestSpan with extended metadata.// types.ts — new event class
export class ApiRetryEvent implements BaseTelemetryEvent {
'event.name': typeof EVENT_API_RETRY = EVENT_API_RETRY;
'event.timestamp': string;
model: string;
prompt_id?: string;
attempt_number: number;
error_type: string;
error_message: string;
status_code?: number;
retry_delay_ms: number;
duration_ms: number; // = retry_delay_ms, drives LogToSpanProcessor bridge span width
constructor(opts: { model: string; promptId?: string; attemptNumber: number; error: unknown; statusCode?: number; retryDelayMs: number }) { ... }
}
// constants.ts
export const EVENT_API_RETRY = 'qwen-code.api_retry';
// loggers.ts
export function logApiRetry(config: Config, event: ApiRetryEvent): void { ... }// utils/retry.ts — RetryOptions extension
interface RetryOptions<T> {
// ... existing ...
onRetry?: (info: RetryAttemptInfo) => void;
}
interface RetryAttemptInfo {
attempt: number;
error: unknown;
errorStatus?: number;
delayMs: number;
}
// Inside retryWithBackoff, where logRetryAttempt is called today:
options.onRetry?.({ attempt, error, errorStatus, delayMs: actualDelay });
logRetryAttempt(attempt, error, errorStatus); // existing debugLogger call unchangedLifecycle wiring
Streaming path (the common case)
// loggingContentGenerator.ts:283 — generateContentStream
async generateContentStream(req, userPromptId): Promise<AsyncGenerator<GenerateContentResponse>> {
const requestEntryTime = Date.now();
let attemptStart = requestEntryTime;
const attemptStartTimes: number[] = [attemptStart];
let retryTotalDelayMs = 0;
let finalAttempt = 1;
// Use existing startLLMRequestSpan (Phase 1)
// Pass onRetry callback to whatever retry layer is in use:
const onRetry: RetryAttemptInfo & { invoke: ... } = (info) => {
finalAttempt = info.attempt + 1; // we're about to start attempt N+1
retryTotalDelayMs += info.delayMs;
attemptStart = Date.now() + info.delayMs; // approximate; actual reset is at top of next attempt
attemptStartTimes.push(attemptStart);
// emit ApiRetryEvent
logApiRetry(this.config, new ApiRetryEvent({
model: req.model,
promptId: userPromptId,
attemptNumber: info.attempt,
error: info.error,
statusCode: info.errorStatus,
retryDelayMs: info.delayMs,
}));
};
// stream wrapper detects first user-visible chunk:
return this.processStreamGenerator(stream, ..., {
onFirstUserVisibleChunk: (now) => {
ttftMs = now - attemptStart;
},
});
}At span end (already in Phase 1’s endLLMRequestSpan flow), include the new fields in LLMRequestMetadata:
endLLMRequestSpan(llmSpan, {
success: true,
inputTokens,
outputTokens,
cachedInputTokens,
ttftMs,
requestSetupMs: attemptStart - requestEntryTime,
attempt: finalAttempt,
retryTotalDelayMs,
});Non-streaming path
generateContent (loggingContentGenerator.ts:212) does not produce streaming chunks. TTFT is undefined; request_setup_ms is still meaningful (captures retry overhead). The breakdown metric records 2 phases (REQUEST_PREPARATION + RESPONSE_PROCESSING where RESPONSE_PROCESSING = duration_ms - request_setup_ms), not 3.
Retry layer integration (4 sites)
Each of the 4 LLM retryWithBackoff call sites adds onRetry:
// client.ts:1540 (similar at baseLlmClient.ts:193, 282, geminiChat.ts:1039)
const result = await retryWithBackoff(apiCall, {
...existingOptions,
onRetry: (info) => {
logApiRetry(
this.config,
new ApiRetryEvent({
model,
promptId: userPromptId,
attemptNumber: info.attempt,
error: info.error,
statusCode: info.errorStatus,
retryDelayMs: info.delayMs,
}),
);
// also feed back into LoggingContentGenerator's local retry accumulator
// (when in scope — for callers that don't go through LoggingContentGenerator,
// the LLM span still gets `attempt` and `retry_total_delay_ms` via the
// metadata path because endLLMRequestSpan is called at the LLM layer)
},
});The non-LLM caller (channels/weixin/src/api.ts) does not register onRetry — no ApiRetryEvent is emitted for its retries, matching today’s behavior.
Concurrent safety — the headline guarantee
LoggingContentGenerator instance is shared (one per ContentGenerator, contentGenerator.ts:377). Three concurrent generateContentStream calls (e.g., 3 subagents fan out via coreToolScheduler.runConcurrently) execute three independent closures of generateContentStream:
call_A: attemptStart_A, ttftMs_A, ... (closure)
call_B: attemptStart_B, ttftMs_B, ... (closure)
call_C: attemptStart_C, ttftMs_C, ... (closure)Per-call locals never overlap. Stream chunks are detected against the local attemptStart of each call. Span attributes are set at each call’s own endLLMRequestSpan.
AsyncLocalStorageContextManager (registered by NodeSDK at sdk.ts:273) already ensures the active OTel context — and thus the parent span passed to startLLMRequestSpan — is correct per fiber.
Files to change
| File | Change | LOC est |
|---|---|---|
packages/core/src/telemetry/constants.ts | Add EVENT_API_RETRY constant | +2 |
packages/core/src/telemetry/types.ts | Add ApiRetryEvent class + union member | +40 |
packages/core/src/telemetry/loggers.ts | Add logApiRetry() function | +20 |
packages/core/src/telemetry/qwen-logger/qwen-logger.ts | Add logApiRetryEvent() for RUM downstream consistency | +20 |
packages/core/src/telemetry/session-tracing.ts | Extend LLMRequestMetadata (ttftMs, requestSetupMs, attempt, retryTotalDelayMs); extend endLLMRequestSpan to set new attrs + breakdown metric + dual-emit gen_ai.* | +60 |
packages/core/src/telemetry/metrics.ts | Wire recordApiRequestBreakdown callsite inside endLLMRequestSpan (no change to the existing recorder) | 0 |
packages/core/src/utils/retry.ts | Add onRetry?: (info: RetryAttemptInfo) => void to RetryOptions; export RetryAttemptInfo; invoke callback in the existing logRetryAttempt site | +25 |
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts | TTFT capture: method-local accumulators + hasUserVisibleContent helper + first-chunk detection in stream wrapper; pass new metadata to endLLMRequestSpan | +80 |
packages/core/src/core/client.ts | Wire onRetry callback at retryWithBackoff call site (client.ts:1540) | +15 |
packages/core/src/core/baseLlmClient.ts | Wire onRetry callback at 2 retryWithBackoff call sites | +25 |
packages/core/src/core/geminiChat.ts | Wire onRetry callback at retryWithBackoff call site (geminiChat.ts:1039) | +15 |
packages/core/src/telemetry/session-tracing.test.ts | endLLMRequestSpan sets ttft_ms / request_setup_ms / attempt / retry_total_delay_ms / sampling_ms / output_tokens_per_second + gen_ai dual-emit + breakdown metric (each phase) + idempotent end | +120 |
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts | hasUserVisibleContent (text / functionCall / inlineData / executableCode / thought / role-only / usage-only); concurrent calls don’t cross-contaminate; TTFT undefined when aborted before first chunk; TTFT undefined on non-streaming | +100 |
packages/core/src/utils/retry.test.ts | onRetry invoked per failed attempt with correct attempt, delayMs, error, errorStatus; absence of onRetry is silent (no telemetry emitted) | +50 |
packages/core/src/telemetry/loggers.test.ts | logApiRetry emits LogRecord with expected payload; bridges through LogToSpanProcessor to nested span under active LLM span | +40 |
Total: 14 files, ~610 LOC. Larger than Phase 2 (#4321) but comparable to Phase 3 (#4410) and justified by the breadth of integration (4 retry sites + telemetry plumbing + streaming wrapper).
If review pushes back on size: split into Phase 4a + 4b + 4c:
- 4a (~200 LOC): TTFT capture + extended
LLMRequestMetadata+ dual-emit. Self-contained value (TTFT visibility from day one). - 4b (~250 LOC):
onRetrycallback +ApiRetryEvent+ 4 caller wiring. Independently a bug fix for theretryWithBackofftelemetry gap. - 4c (~160 LOC):
recordApiRequestBreakdownactivation + parent span aggregation attrs (attempt,retry_total_delay_ms,sampling_ms,output_tokens_per_second). Depends on 4a + 4b.
Testing strategy
| Test | What it proves |
|---|---|
hasUserVisibleContent returns true for text/functionCall/inlineData/executableCode/thought | D1 semantics across part types |
hasUserVisibleContent returns false for role-only and usage-only chunks | D1 negative cases |
| streaming: TTFT measured from attempt start to first user-visible chunk | End-to-end TTFT detection |
| streaming: TTFT undefined if stream aborts before any user-visible chunk | Edge case |
| streaming: TTFT computed from final attempt’s start (not first attempt) | D3 — TTFT reset on retry |
| non-streaming: TTFT remains undefined | S3 decision |
concurrent generateContentStream calls don’t cross-contaminate TTFT | D2 — method-local guarantee |
endLLMRequestSpan sets all Phase 4 attrs (ttft_ms, request_setup_ms, sampling_ms, attempt, retry_total_delay_ms, output_tokens_per_second) | Attribute presence |
endLLMRequestSpan dual-emits gen_ai.server.time_to_first_token + gen_ai.usage.* + gen_ai.request.model | D8 dual-emit |
endLLMRequestSpan records breakdown metric with 3 phases for streaming, 2 for non-streaming | D6 |
endLLMRequestSpan called twice: metric recorded exactly once, attrs not re-set | D7 idempotency |
retryWithBackoff with onRetry: callback invoked per failed attempt with correct args | D4 callback contract |
retryWithBackoff without onRetry: no telemetry emitted (silent for non-LLM callers) | P2 — channels/weixin scope protection |
client.ts / baseLlmClient.ts / geminiChat.ts retry callsites emit ApiRetryEvent on retry | Integration of D4 at 4 sites |
ApiRetryEvent LogRecord bridges via LogToSpanProcessor to a child span under active LLM span | Trace tree correctness |
LLM span attempt field correctly reflects final attempt number under retries | D5 aggregation |
LLM span retry_total_delay_ms correctly sums onRetry delays | D5 aggregation |
output_tokens_per_second undefined when sampling_ms === 0 (no streaming) | Avoid divide-by-zero |
Edge cases
| Case | Handling |
|---|---|
| Stream aborts before any chunk arrives | ttftMs = undefined, sampling_ms = undefined, output_tokens_per_second = undefined. attempt, request_setup_ms still set. success = false |
| Stream aborts after first chunk | ttftMs set; sampling_ms = duration_ms - ttftMs - request_setup_ms; reflects partial response time. success = false |
| Retry succeeds on attempt 1 (no retries) | attempt = 1, retry_total_delay_ms = 0, no ApiRetryEvent emitted, breakdown metric records request_setup_ms close to 0 |
| Persistent retry mode 50+ attempts | 50+ ApiRetryEvent records emitted (out-of-scope cap deferred); LLM span attempt = 51, retry_total_delay_ms = sum of all delays. Operator sees aggregated view on span; full per-attempt detail in log-bridge spans |
Non-LLM retryWithBackoff caller (channels/weixin) | No onRetry registered; only existing debugLogger.warn fires. No ApiRetryEvent; no breakdown metric (caller isn’t an LLM site) |
endLLMRequestSpan called twice (abort + error race) | Phase 1.5 guard at activeSpans.delete() returns early on second call; recordApiRequestBreakdown is inside the guard, recorded exactly once |
Anthropic message_start chunk arrives before content | hasUserVisibleContent returns false for it (no parts with text/functionCall/etc.); TTFT not triggered until subsequent content_block_delta chunk |
OpenAI first chunk with empty delta.content but role only | hasUserVisibleContent returns false; TTFT not triggered until first chunk with non-empty delta |
| Tool-call-only response (no text) | First chunk with functionCall Part triggers TTFT; output_tokens_per_second computed against tool-call token count |
| Concurrent subagents (3 calls in flight) | Each call’s closure has its own attemptStart, ttftMs, attemptStartTimes. Per-call span receives its own metadata at endLLMRequestSpan. No interleaving (D2) |
SDK-level retries inside openai-sdk (maxRetries=3) | Invisible to qwen-code telemetry — happens entirely inside SDK before retryWithBackoff sees the request. attempt reflects retryWithBackoff attempts only. Out of scope (see Out-of-scope) |
gen_ai.server.time_to_first_token spec renames before reaching Stable | Single-file update: session-tracing.ts:endLLMRequestSpan. The qwen-code-native ttft_ms stays authoritative — no downstream impact |
| Subagent’s LLM request | Parent is the subagent span (Phase 3). Phase 4 fields nest correctly. Aggregations grouped by qwen-code.subagent.id give per-subagent LLM perf — design-doc-future, easy follow-up |
| Reasoning model with long thought blocks | First thought Part triggers TTFT; sampling_ms includes both thinking + answer phases. Split into separate metrics deferred |
Rollback
The change is additive at the OTel and metric level — every new attribute is optional, every new event is a new class. Existing dashboards that don’t filter on the new fields keep working unchanged.
Behavior-affecting changes:
- New
ApiRetryEventLogRecord starts flowing → log volume increases proportional to retry rate (typically <1% of requests retry). Mitigate by sampling LogRecord at the SDK layer if needed. - New breakdown metric
qwen-code.api.request.breakdownstarts producing time series → mild Prometheus cardinality bump ({model, phase}— bounded). output_tokens_per_secondderived attribute may appear unusual on dashboards filtering “all attributes” — document.
Rollback path: revert the single PR (or each of 4a/4b/4c independently). All new fields use defensive defaults (undefined / 0) and don’t change span structure.
Sequencing
- After Phase 3 (#4410, in review): not a hard dependency. Phase 4 attributes attach to
qwen-code.llm_requestspans regardless of whether they’re under aqwen-code.subagent(Phase 3) orqwen-code.interaction(Phase 1) parent. Recommend Phase 3 land first so per-attempt aggregation under subagent subtrees works naturally. - Independent of #4384 (
traceparent+X-Qwen-Code-Session-Idoutbound propagation). They touch the HTTP layer; Phase 4 touches the stream/retry/metric layer. - Independent of
clearDetailedSpanStatechat-compression follow-up (#4097 follow-up). Different surface.
Open questions
onRetrycallback firing semantics: invoked before backoff sleep (current proposal) or after (when the next attempt is about to start)? Before is simpler — callback has all the info immediately; after would require capturing the just-completed delay separately. Pre-sleep is the recommendation; document in callback contract.- Per-attempt timing on the LLM span: should we add
attempt_durations_ms: number[]array? OTel supports array-of-primitive attributes. Useful for “which attempt of N was slow” diagnostics. Defer until production data shows demand — log-bridge spans already carry the equivalent. - Persistent retry mode emission cap: at what
attempt > Nthreshold should we start sampling?N = 5then 1-in-10?N = 10then summary-only? Defer until we have production volume data. TOKEN_PROCESSINGphase: keep enum value dormant or wire it to something (e.g., consolidation time)? Defer — wait for a real use case.- Subagent-level LLM rollups: trivial follow-up once Phase 4 lands — sum
ttft_ms/output_tokens/input_tokensper subagent subtree. Not Phase 4 scope but the data flow enables it.