API Service
The API service (src/services/api/) is the central interface between Claude Code and the Anthropic Claude API. It handles streaming message queries, retry logic, token estimation, usage tracking, error classification, and prompt caching.
Core Architecture
The API service is built around queryModelWithStreaming() in claude.ts, which assembles the full API request (system prompt, messages, tools, betas, thinking config) and streams the response back as an async generator of StreamEvent objects.
// Simplified call flow
const events = queryModelWithStreaming({
messages,
systemPrompt,
tools,
model,
thinkingConfig,
signal,
})
for await (const event of events) {
// Handle content_block_delta, message_stop, etc.
}Streaming Queries
Conversation history normalized for the API (user/assistant alternation, tool result pairing).
The full system prompt including CLI-specific prefix, context, and MCP instructions.
Tool definitions converted to API schema format via toolToAPISchema().
Model identifier, normalized for the provider (Anthropic, Bedrock, or Vertex).
Extended thinking configuration with budget tokens (when supported by model).
Model Support
The API service supports multiple model providers through a unified interface:
| Provider | Configuration | Authentication |
|---|---|---|
| Anthropic (first-party) | Default or ANTHROPIC_API_KEY | API key or OAuth |
| AWS Bedrock | CLAUDE_CODE_USE_BEDROCK=1 | AWS credentials |
| Google Vertex | CLAUDE_CODE_USE_VERTEX=1 | GCP credentials |
Retry Logic
The retry system in withRetry.ts wraps API calls with configurable retry behavior using an async generator pattern that yields status messages between attempts.
export async function* withRetry<T>(
getClient: () => Promise<Anthropic>,
operation: (client: Anthropic, attempt: number, context: RetryContext) => Promise<T>,
options: RetryOptions,
): AsyncGenerator<SystemAPIErrorMessage, T>Retry Behavior
Maximum retry attempts. Defaults to 10, configurable via CLAUDE_CODE_MAX_RETRIES.
Base delay of 500ms with exponential backoff and 25% jitter.
After 3 consecutive 529 (overloaded) errors, triggers fallback to a secondary model if configured.
- 408: Request timeout
- 409: Lock timeout
- 429: Rate limit (non-subscriber or enterprise users)
- 401: Authentication (triggers token refresh)
- 5xx: Server errors
- 529: Overloaded (foreground queries only)
- Connection errors:
ECONNRESET,EPIPE(disables keep-alive on retry)
Persistent Retry Mode
When CLAUDE_CODE_UNATTENDED_RETRY is enabled, the service retries 429/529 errors indefinitely with up to 5-minute backoff, yielding periodic heartbeat messages every 30 seconds to prevent the host from marking the session idle.
Fast Mode Fallback
When fast mode is active and a 429/529 occurs:
- Short retry-after (under 20s): Wait and retry with fast mode still active to preserve prompt cache
- Long retry-after: Enter cooldown (10-30 minutes), switching to standard speed model
- Overage rejection: Permanently disable fast mode with a specific reason
Error Handling
The error handling system in errors.ts classifies API errors into user-facing messages:
Error Classifications
| Error | Message |
|---|---|
| Prompt too long | Token counts parsed from API error for reactive compact |
| Credit balance low | "Credit balance is too low" |
| Invalid API key | "Not logged in - Please run /login" |
| Token revoked | "OAuth token revoked - Please run /login" |
| PDF too large | Includes page/size limits with recovery instructions |
| Image too large | Size-specific message with retry guidance |
| Repeated 529 | "Repeated 529 Overloaded errors" |
| Request timeout | "Request timed out" |
Context Overflow Recovery
When the API returns a 400 error indicating max_tokens exceeds the context limit, the retry system automatically adjusts:
const overflowData = parseMaxTokensContextOverflowError(error)
if (overflowData) {
const availableContext = contextLimit - inputTokens - 1000 // safety buffer
retryContext.maxTokensOverride = Math.max(3000, availableContext)
}Usage Tracking
The usage.ts module fetches rate limit utilization for Claude.ai subscribers:
5-hour rolling window utilization percentage and reset timestamp.
7-day rolling window utilization.
Extra usage (overage) status including monthly limit and used credits.
Prompt Caching
The API service supports Anthropic's prompt caching through several mechanisms:
- Cache scope headers:
PROMPT_CACHING_SCOPE_BETA_HEADERcontrols cache scope (global vs. per-session) - 1-hour cache eligibility: Checked via allowlist for extended cache TTL
- Cache break detection:
promptCacheBreakDetection.tsmonitors for unexpected cache invalidation after compaction - Cache-safe parameters: Forked agents receive
CacheSafeParamsto share the parent's prompt cache
Attribution Headers
For first-party Anthropic requests, the service includes attribution headers identifying the CLI version and product:
const headers = {
'anthropic-beta': getMergedBetas(model),
...getAttributionHeader(),
'anthropic-client-request-id': randomUUID(),
}Beta Headers
The service manages a set of beta feature headers dynamically:
| Beta | Header |
|---|---|
| Extended thinking | Model-specific betas |
| Context management | CONTEXT_MANAGEMENT_BETA_HEADER |
| 1M context | CONTEXT_1M_BETA_HEADER |
| Structured outputs | STRUCTURED_OUTPUTS_BETA_HEADER |
| Fast mode | FAST_MODE_BETA_HEADER |
| AFK mode | AFK_MODE_BETA_HEADER |
| Effort control | EFFORT_BETA_HEADER |
| Tool search | Dynamic based on model support |