API Service

The API service (src/services/api/) is the central interface between Claude Code and the Anthropic Claude API. It handles streaming message queries, retry logic, token estimation, usage tracking, error classification, and prompt caching.

Core Architecture

The API service is built around queryModelWithStreaming() in claude.ts, which assembles the full API request (system prompt, messages, tools, betas, thinking config) and streams the response back as an async generator of StreamEvent objects.

// Simplified call flow
const events = queryModelWithStreaming({
  messages,
  systemPrompt,
  tools,
  model,
  thinkingConfig,
  signal,
})

for await (const event of events) {
  // Handle content_block_delta, message_stop, etc.
}

Streaming Queries

messagesMessage[]

Conversation history normalized for the API (user/assistant alternation, tool result pairing).

systemPromptSystemPrompt

The full system prompt including CLI-specific prefix, context, and MCP instructions.

toolsBetaToolUnion[]

Tool definitions converted to API schema format via toolToAPISchema().

modelstring

Model identifier, normalized for the provider (Anthropic, Bedrock, or Vertex).

thinkingConfigThinkingConfig

Extended thinking configuration with budget tokens (when supported by model).

Model Support

The API service supports multiple model providers through a unified interface:

Provider	Configuration	Authentication
Anthropic (first-party)	Default or `ANTHROPIC_API_KEY`	API key or OAuth
AWS Bedrock	`CLAUDE_CODE_USE_BEDROCK=1`	AWS credentials
Google Vertex	`CLAUDE_CODE_USE_VERTEX=1`	GCP credentials

Retry Logic

The retry system in withRetry.ts wraps API calls with configurable retry behavior using an async generator pattern that yields status messages between attempts.

export async function* withRetry<T>(
  getClient: () => Promise<Anthropic>,
  operation: (client: Anthropic, attempt: number, context: RetryContext) => Promise<T>,
  options: RetryOptions,
): AsyncGenerator<SystemAPIErrorMessage, T>

Retry Behavior

maxRetriesnumber

Maximum retry attempts. Defaults to 10, configurable via CLAUDE_CODE_MAX_RETRIES.

BASE_DELAY_MSnumber

Base delay of 500ms with exponential backoff and 25% jitter.

MAX_529_RETRIESnumber

After 3 consecutive 529 (overloaded) errors, triggers fallback to a secondary model if configured.

408: Request timeout
409: Lock timeout
429: Rate limit (non-subscriber or enterprise users)
401: Authentication (triggers token refresh)
5xx: Server errors
529: Overloaded (foreground queries only)
Connection errors: ECONNRESET, EPIPE (disables keep-alive on retry)

Persistent Retry Mode

When CLAUDE_CODE_UNATTENDED_RETRY is enabled, the service retries 429/529 errors indefinitely with up to 5-minute backoff, yielding periodic heartbeat messages every 30 seconds to prevent the host from marking the session idle.

Fast Mode Fallback

When fast mode is active and a 429/529 occurs:

Short retry-after (under 20s): Wait and retry with fast mode still active to preserve prompt cache
Long retry-after: Enter cooldown (10-30 minutes), switching to standard speed model
Overage rejection: Permanently disable fast mode with a specific reason

Error Handling

The error handling system in errors.ts classifies API errors into user-facing messages:

Error Classifications

Error	Message
Prompt too long	Token counts parsed from API error for reactive compact
Credit balance low	`"Credit balance is too low"`
Invalid API key	`"Not logged in - Please run /login"`
Token revoked	`"OAuth token revoked - Please run /login"`
PDF too large	Includes page/size limits with recovery instructions
Image too large	Size-specific message with retry guidance
Repeated 529	`"Repeated 529 Overloaded errors"`
Request timeout	`"Request timed out"`

Context Overflow Recovery

When the API returns a 400 error indicating max_tokens exceeds the context limit, the retry system automatically adjusts:

const overflowData = parseMaxTokensContextOverflowError(error)
if (overflowData) {
  const availableContext = contextLimit - inputTokens - 1000 // safety buffer
  retryContext.maxTokensOverride = Math.max(3000, availableContext)
}

Usage Tracking

The usage.ts module fetches rate limit utilization for Claude.ai subscribers:

five_hourRateLimit | null

5-hour rolling window utilization percentage and reset timestamp.

seven_dayRateLimit | null

7-day rolling window utilization.

extra_usageExtraUsage | null

Extra usage (overage) status including monthly limit and used credits.

Prompt Caching

The API service supports Anthropic's prompt caching through several mechanisms:

Cache scope headers: PROMPT_CACHING_SCOPE_BETA_HEADER controls cache scope (global vs. per-session)
1-hour cache eligibility: Checked via allowlist for extended cache TTL
Cache break detection: promptCacheBreakDetection.ts monitors for unexpected cache invalidation after compaction
Cache-safe parameters: Forked agents receive CacheSafeParams to share the parent's prompt cache

Attribution Headers

For first-party Anthropic requests, the service includes attribution headers identifying the CLI version and product:

const headers = {
  'anthropic-beta': getMergedBetas(model),
  ...getAttributionHeader(),
  'anthropic-client-request-id': randomUUID(),
}

Beta Headers

The service manages a set of beta feature headers dynamically:

Beta	Header
Extended thinking	Model-specific betas
Context management	`CONTEXT_MANAGEMENT_BETA_HEADER`
1M context	`CONTEXT_1M_BETA_HEADER`
Structured outputs	`STRUCTURED_OUTPUTS_BETA_HEADER`
Fast mode	`FAST_MODE_BETA_HEADER`
AFK mode	`AFK_MODE_BETA_HEADER`
Effort control	`EFFORT_BETA_HEADER`
Tool search	Dynamic based on model support

On this page
API Service
Core Architecture
Streaming Queries
Model Support
Retry Logic
Retry Behavior
Persistent Retry Mode
Fast Mode Fallback
Error Handling
Context Overflow Recovery
Usage Tracking
Prompt Caching
Attribution Headers
Beta Headers

Services Overview Previous MCP ServiceNext

AI Assistant