AI Assistant

API Service

The API service (src/services/api/) is the central interface between Claude Code and the Anthropic Claude API. It handles streaming message queries, retry logic, token estimation, usage tracking, error classification, and prompt caching.

Core Architecture

The API service is built around queryModelWithStreaming() in claude.ts, which assembles the full API request (system prompt, messages, tools, betas, thinking config) and streams the response back as an async generator of StreamEvent objects.

// Simplified call flow
const events = queryModelWithStreaming({
  messages,
  systemPrompt,
  tools,
  model,
  thinkingConfig,
  signal,
})

for await (const event of events) {
  // Handle content_block_delta, message_stop, etc.
}

Streaming Queries

messagesMessage[]

Conversation history normalized for the API (user/assistant alternation, tool result pairing).

systemPromptSystemPrompt

The full system prompt including CLI-specific prefix, context, and MCP instructions.

toolsBetaToolUnion[]

Tool definitions converted to API schema format via toolToAPISchema().

modelstring

Model identifier, normalized for the provider (Anthropic, Bedrock, or Vertex).

thinkingConfigThinkingConfig

Extended thinking configuration with budget tokens (when supported by model).

Model Support

The API service supports multiple model providers through a unified interface:

ProviderConfigurationAuthentication
Anthropic (first-party)Default or ANTHROPIC_API_KEYAPI key or OAuth
AWS BedrockCLAUDE_CODE_USE_BEDROCK=1AWS credentials
Google VertexCLAUDE_CODE_USE_VERTEX=1GCP credentials

Retry Logic

The retry system in withRetry.ts wraps API calls with configurable retry behavior using an async generator pattern that yields status messages between attempts.

export async function* withRetry<T>(
  getClient: () => Promise<Anthropic>,
  operation: (client: Anthropic, attempt: number, context: RetryContext) => Promise<T>,
  options: RetryOptions,
): AsyncGenerator<SystemAPIErrorMessage, T>

Retry Behavior

maxRetriesnumber

Maximum retry attempts. Defaults to 10, configurable via CLAUDE_CODE_MAX_RETRIES.

BASE_DELAY_MSnumber

Base delay of 500ms with exponential backoff and 25% jitter.

MAX_529_RETRIESnumber

After 3 consecutive 529 (overloaded) errors, triggers fallback to a secondary model if configured.

  • 408: Request timeout
  • 409: Lock timeout
  • 429: Rate limit (non-subscriber or enterprise users)
  • 401: Authentication (triggers token refresh)
  • 5xx: Server errors
  • 529: Overloaded (foreground queries only)
  • Connection errors: ECONNRESET, EPIPE (disables keep-alive on retry)

Persistent Retry Mode

When CLAUDE_CODE_UNATTENDED_RETRY is enabled, the service retries 429/529 errors indefinitely with up to 5-minute backoff, yielding periodic heartbeat messages every 30 seconds to prevent the host from marking the session idle.

Fast Mode Fallback

When fast mode is active and a 429/529 occurs:

  1. Short retry-after (under 20s): Wait and retry with fast mode still active to preserve prompt cache
  2. Long retry-after: Enter cooldown (10-30 minutes), switching to standard speed model
  3. Overage rejection: Permanently disable fast mode with a specific reason

Error Handling

The error handling system in errors.ts classifies API errors into user-facing messages:

ErrorMessage
Prompt too longToken counts parsed from API error for reactive compact
Credit balance low"Credit balance is too low"
Invalid API key"Not logged in - Please run /login"
Token revoked"OAuth token revoked - Please run /login"
PDF too largeIncludes page/size limits with recovery instructions
Image too largeSize-specific message with retry guidance
Repeated 529"Repeated 529 Overloaded errors"
Request timeout"Request timed out"

Context Overflow Recovery

When the API returns a 400 error indicating max_tokens exceeds the context limit, the retry system automatically adjusts:

const overflowData = parseMaxTokensContextOverflowError(error)
if (overflowData) {
  const availableContext = contextLimit - inputTokens - 1000 // safety buffer
  retryContext.maxTokensOverride = Math.max(3000, availableContext)
}

Usage Tracking

The usage.ts module fetches rate limit utilization for Claude.ai subscribers:

five_hourRateLimit | null

5-hour rolling window utilization percentage and reset timestamp.

seven_dayRateLimit | null

7-day rolling window utilization.

extra_usageExtraUsage | null

Extra usage (overage) status including monthly limit and used credits.

Prompt Caching

The API service supports Anthropic's prompt caching through several mechanisms:

  • Cache scope headers: PROMPT_CACHING_SCOPE_BETA_HEADER controls cache scope (global vs. per-session)
  • 1-hour cache eligibility: Checked via allowlist for extended cache TTL
  • Cache break detection: promptCacheBreakDetection.ts monitors for unexpected cache invalidation after compaction
  • Cache-safe parameters: Forked agents receive CacheSafeParams to share the parent's prompt cache

Attribution Headers

For first-party Anthropic requests, the service includes attribution headers identifying the CLI version and product:

const headers = {
  'anthropic-beta': getMergedBetas(model),
  ...getAttributionHeader(),
  'anthropic-client-request-id': randomUUID(),
}

Beta Headers

The service manages a set of beta feature headers dynamically:

BetaHeader
Extended thinkingModel-specific betas
Context managementCONTEXT_MANAGEMENT_BETA_HEADER
1M contextCONTEXT_1M_BETA_HEADER
Structured outputsSTRUCTURED_OUTPUTS_BETA_HEADER
Fast modeFAST_MODE_BETA_HEADER
AFK modeAFK_MODE_BETA_HEADER
Effort controlEFFORT_BETA_HEADER
Tool searchDynamic based on model support