Rate limiting

LLM providers enforce rate limits. Exceed them and the provider returns 429 — sustained overage causes account degradation. Factflow’s factflow-llm package ships two cooperating mechanisms: an AIMD concurrency limiter (gates how many requests are in flight) and a retry-with-jitter wrapper (handles transient errors). The two layers compose.

Two layers, one call

flowchart LR
Adapter[Adapter] --> RLC[RateLimitedClient<br/>retry + jitter]
RLC --> ARL[AdaptiveRateLimiter<br/>AIMD semaphore]
ARL --> RealCall[Real LLM call]
RealCall -.429.-> ARL
RealCall -.5xx/timeout.-> RLC
style ARL fill:#e6f2ff,stroke:#2563eb,color:#111
style RLC fill:#f0fdf4,stroke:#16a34a,color:#111

Two layers. RateLimitedClient wraps the base client for retry. AdaptiveRateLimiter gates concurrency inside each attempt.

Order matters. The adapter calls RateLimitedClient.complete(...). RateLimitedClient makes up to 7 attempts. Each attempt goes through the AdaptiveRateLimiter.slot() context manager before hitting the real API.

Layer 1 — AdaptiveRateLimiter (AIMD concurrency)

Lives at factflow_llm.rate_limiter.AdaptiveRateLimiter. Implements TCP-style congestion control applied to API concurrency. Not RPS, not token-counting — just how many requests are allowed to be in flight simultaneously.

Algorithm

on success:  limit = min(max_concurrency, limit + 1)       # additive increase
on 429:      limit = max(floor,           limit // 2)      # multiplicative decrease

Two tunables:

max_concurrency — hard ceiling. Default 50.
floor — minimum slots even after repeated decreases. Default 5.

Starts at max_concurrency, converges downward through 429 feedback, climbs back when the provider lets it. No per-provider tuning required — the ceiling is discovered empirically.

stateDiagram-v2
[*] --> MaxConcurrency: init
MaxConcurrency --> Saturated: each success limit+1
Saturated --> Backoff: 429 received
Backoff --> Recovering: limit = max(floor, limit//2)
Recovering --> Saturated: each success limit+1
Recovering --> Backoff: another 429
Backoff --> Floor: cannot decrease further
Floor --> Recovering: success

AIMD state machine. Limit climbs on success, halves on 429, bottoms out at the configured floor.

Usage pattern

from factflow_llm.rate_limiter import AdaptiveRateLimiter

limiter = AdaptiveRateLimiter(max_concurrency=50, floor=5)

async with limiter.slot() as signal:
    try:
        response = await llm_client.complete(...)
    except RateLimitError:
        signal.was_rate_limited = True
        raise

Critical: the caller MUST set signal.was_rate_limited = True on 429 before the slot() context exits. Without that, the limiter never triggers a decrease, and throughput never adapts. The rule is enforced by code review — it’s not automated.

Observable state

AdaptiveRateLimiter.metrics returns an AIMDMetrics dataclass:

Field	Meaning
`current_limit`	Current concurrency ceiling (adapts over time)
`total_acquires`	Total slots ever acquired
`total_rate_limits`	Total 429 signals received
`total_decreases`	Times the limit was actually decreased (may be less than rate_limits if already at floor)
`peak_active`	Highest simultaneous in-flight count observed
`limit_history`	Deque of last 100 limit values after decreases (for charting)

Layer 2 — RateLimitedClient (retry + jitter)

Lives at factflow_llm.rate_limited_client.RateLimitedClient. Wraps any BaseLLMClient or embedding client with retry logic.

Policy

Max retries: 7
Base delay: 0.5 seconds
Cap delay: 60 seconds
Jitter: full (randomised between 0 and computed delay)
Schedule: exponential — base * 2^attempt capped at 60s, then jittered

sequenceDiagram
participant Adapter
participant RLC as RateLimitedClient
participant ARL as AdaptiveRateLimiter
participant API as LLM provider
Adapter->>RLC: complete(...)
loop up to 7 attempts
  RLC->>ARL: slot()
  ARL-->>RLC: slot acquired
  RLC->>API: HTTP POST
  alt success
    API-->>RLC: 200
    RLC-->>Adapter: response
  else 429
    API-->>RLC: 429
    Note over RLC,ARL: signal.was_rate_limited = True
    RLC->>RLC: sleep(exp_backoff with jitter)
  else 5xx or timeout
    API-->>RLC: transient error
    RLC->>RLC: sleep(exp_backoff with jitter)
  else fatal (401, 403, 404)
    API-->>RLC: fatal error
    RLC-->>Adapter: raise (no retry)
  end
end

Per-attempt flow. Retries on 429 / 5xx / timeout. Fatal errors (auth/permission/not-found) short-circuit immediately.

Why jitter

Without jitter, bursts of simultaneous 429s produce synchronised retries → another burst of 429s → thundering herd. Full jitter breaks the phase alignment. Standard cloud-retry practice.

Error classification

factflow_llm.error_classification.classify_llm_error(exc) maps raw exceptions to one of three bands:

Band	Examples	Retry behaviour
Fatal	401 auth, 403 permission, 404 not found	Never retry; propagate to caller
Rate-limited	429	Retry (via RateLimitedClient); trigger AIMD decrease
Retryable	500+, connection error, timeout	Retry with backoff

Unknown exception types default to retryable (conservative — avoids masking bugs with silent non-retries).

get_error_metadata(exc) returns a dict for AdapterResult.metadata:

{
    "fatal": bool,
    "retryable": bool,
    "status_code": int | None,
    "error_type": str,
}

Adapters typically ship this dict back to the processor on failure, which stamps it into lineage for debugging.

Putting it together — one adapter’s call path

sequenceDiagram
participant Proc as Processor
participant Ad as Adapter
participant Fac as LLMClientFactory
participant RLC as RateLimitedClient
participant ARL as AdaptiveRateLimiter
participant API as Real LLM API
Proc->>Ad: await process(ctx)
Ad->>Fac: create_completion_client("default")
Fac-->>Ad: RateLimitedClient wrapping OpenAIClient
Ad->>RLC: complete(messages=[...])
loop up to 7 attempts
  RLC->>ARL: async with slot() as signal
  ARL-->>RLC: slot acquired (throttle if saturated)
  RLC->>API: POST /v1/chat/completions
  alt 200 OK
    API-->>RLC: response
    RLC-->>ARL: exit context
    Note over ARL: limit += 1
    RLC-->>Ad: CompletionResponse
  else 429
    API-->>RLC: RateLimitError
    RLC->>RLC: signal.was_rate_limited = True
    Note over ARL: limit = max(floor, limit // 2)
    RLC->>RLC: sleep exp+jitter
  end
end
Ad-->>Proc: AdapterResult

The full call path. Factory returns a ready-wrapped client; adapter stays ignorant of retry and rate-limit plumbing.

Invariants worth protecting

No retry inside clients. Raw API errors propagate from OpenAIClient / AnthropicClient / etc. Retry is external (RateLimitedClient).
No circuit breaker inside clients. Circuit breakers live on adapters, not clients.
Caller must mark the 429 signal. If you implement a custom client wrapper, signal.was_rate_limited = True is mandatory on rate-limit errors.
Unknown errors default retryable. Conservative by choice.

Configuration

RateLimitConfig controls the AIMD limiter per adapter:

adapters:
  - type: "llm_translator"
    config:
      provider: "default"
      model: "claude-sonnet-4-6"
      rate_limit:
        max_concurrency: 50
        floor: 5

The retry wrapper’s 7-retry policy is currently non-configurable — a deliberate choice, tuned for LLM error patterns.

Webscraper uses a different model

HTTP fetching is not LLM calling. The webscraper package ships its own, separate rate-limiting mechanism: token-bucket RPS limiting with five pre-set strategies (conservative / moderate / aggressive / adaptive / none). Appropriate when concurrency is cheap but per-endpoint RPS matters. See Workflow: Webscraper.