Skip to content

Rate limiting

LLM providers enforce rate limits. Exceed them and the provider returns 429 — sustained overage causes account degradation. Factflow’s factflow-llm package ships two cooperating mechanisms: an AIMD concurrency limiter (gates how many requests are in flight) and a retry-with-jitter wrapper (handles transient errors). The two layers compose.

flowchart LR
Adapter[Adapter] --> RLC[RateLimitedClient<br/>retry + jitter]
RLC --> ARL[AdaptiveRateLimiter<br/>AIMD semaphore]
ARL --> RealCall[Real LLM call]
RealCall -.429.-> ARL
RealCall -.5xx/timeout.-> RLC
style ARL fill:#e6f2ff,stroke:#2563eb,color:#111
style RLC fill:#f0fdf4,stroke:#16a34a,color:#111
Two layers. RateLimitedClient wraps the base client for retry. AdaptiveRateLimiter gates concurrency inside each attempt.

Order matters. The adapter calls RateLimitedClient.complete(...). RateLimitedClient makes up to 7 attempts. Each attempt goes through the AdaptiveRateLimiter.slot() context manager before hitting the real API.

Layer 1 — AdaptiveRateLimiter (AIMD concurrency)

Section titled “Layer 1 — AdaptiveRateLimiter (AIMD concurrency)”

Lives at factflow_llm.rate_limiter.AdaptiveRateLimiter. Implements TCP-style congestion control applied to API concurrency. Not RPS, not token-counting — just how many requests are allowed to be in flight simultaneously.

on success: limit = min(max_concurrency, limit + 1) # additive increase
on 429: limit = max(floor, limit // 2) # multiplicative decrease

Two tunables:

  • max_concurrency — hard ceiling. Default 50.
  • floor — minimum slots even after repeated decreases. Default 5.

Starts at max_concurrency, converges downward through 429 feedback, climbs back when the provider lets it. No per-provider tuning required — the ceiling is discovered empirically.

stateDiagram-v2
[*] --> MaxConcurrency: init
MaxConcurrency --> Saturated: each success limit+1
Saturated --> Backoff: 429 received
Backoff --> Recovering: limit = max(floor, limit//2)
Recovering --> Saturated: each success limit+1
Recovering --> Backoff: another 429
Backoff --> Floor: cannot decrease further
Floor --> Recovering: success
AIMD state machine. Limit climbs on success, halves on 429, bottoms out at the configured floor.
from factflow_llm.rate_limiter import AdaptiveRateLimiter
limiter = AdaptiveRateLimiter(max_concurrency=50, floor=5)
async with limiter.slot() as signal:
try:
response = await llm_client.complete(...)
except RateLimitError:
signal.was_rate_limited = True
raise

Critical: the caller MUST set signal.was_rate_limited = True on 429 before the slot() context exits. Without that, the limiter never triggers a decrease, and throughput never adapts. The rule is enforced by code review — it’s not automated.

AdaptiveRateLimiter.metrics returns an AIMDMetrics dataclass:

FieldMeaning
current_limitCurrent concurrency ceiling (adapts over time)
total_acquiresTotal slots ever acquired
total_rate_limitsTotal 429 signals received
total_decreasesTimes the limit was actually decreased (may be less than rate_limits if already at floor)
peak_activeHighest simultaneous in-flight count observed
limit_historyDeque of last 100 limit values after decreases (for charting)

Layer 2 — RateLimitedClient (retry + jitter)

Section titled “Layer 2 — RateLimitedClient (retry + jitter)”

Lives at factflow_llm.rate_limited_client.RateLimitedClient. Wraps any BaseLLMClient or embedding client with retry logic.

  • Max retries: 7
  • Base delay: 0.5 seconds
  • Cap delay: 60 seconds
  • Jitter: full (randomised between 0 and computed delay)
  • Schedule: exponential — base * 2^attempt capped at 60s, then jittered
sequenceDiagram
participant Adapter
participant RLC as RateLimitedClient
participant ARL as AdaptiveRateLimiter
participant API as LLM provider
Adapter->>RLC: complete(...)
loop up to 7 attempts
  RLC->>ARL: slot()
  ARL-->>RLC: slot acquired
  RLC->>API: HTTP POST
  alt success
    API-->>RLC: 200
    RLC-->>Adapter: response
  else 429
    API-->>RLC: 429
    Note over RLC,ARL: signal.was_rate_limited = True
    RLC->>RLC: sleep(exp_backoff with jitter)
  else 5xx or timeout
    API-->>RLC: transient error
    RLC->>RLC: sleep(exp_backoff with jitter)
  else fatal (401, 403, 404)
    API-->>RLC: fatal error
    RLC-->>Adapter: raise (no retry)
  end
end
Per-attempt flow. Retries on 429 / 5xx / timeout. Fatal errors (auth/permission/not-found) short-circuit immediately.

Without jitter, bursts of simultaneous 429s produce synchronised retries → another burst of 429s → thundering herd. Full jitter breaks the phase alignment. Standard cloud-retry practice.

factflow_llm.error_classification.classify_llm_error(exc) maps raw exceptions to one of three bands:

BandExamplesRetry behaviour
Fatal401 auth, 403 permission, 404 not foundNever retry; propagate to caller
Rate-limited429Retry (via RateLimitedClient); trigger AIMD decrease
Retryable500+, connection error, timeoutRetry with backoff

Unknown exception types default to retryable (conservative — avoids masking bugs with silent non-retries).

get_error_metadata(exc) returns a dict for AdapterResult.metadata:

{
"fatal": bool,
"retryable": bool,
"status_code": int | None,
"error_type": str,
}

Adapters typically ship this dict back to the processor on failure, which stamps it into lineage for debugging.

Putting it together — one adapter’s call path

Section titled “Putting it together — one adapter’s call path”
sequenceDiagram
participant Proc as Processor
participant Ad as Adapter
participant Fac as LLMClientFactory
participant RLC as RateLimitedClient
participant ARL as AdaptiveRateLimiter
participant API as Real LLM API
Proc->>Ad: await process(ctx)
Ad->>Fac: create_completion_client("default")
Fac-->>Ad: RateLimitedClient wrapping OpenAIClient
Ad->>RLC: complete(messages=[...])
loop up to 7 attempts
  RLC->>ARL: async with slot() as signal
  ARL-->>RLC: slot acquired (throttle if saturated)
  RLC->>API: POST /v1/chat/completions
  alt 200 OK
    API-->>RLC: response
    RLC-->>ARL: exit context
    Note over ARL: limit += 1
    RLC-->>Ad: CompletionResponse
  else 429
    API-->>RLC: RateLimitError
    RLC->>RLC: signal.was_rate_limited = True
    Note over ARL: limit = max(floor, limit // 2)
    RLC->>RLC: sleep exp+jitter
  end
end
Ad-->>Proc: AdapterResult
The full call path. Factory returns a ready-wrapped client; adapter stays ignorant of retry and rate-limit plumbing.
  • No retry inside clients. Raw API errors propagate from OpenAIClient / AnthropicClient / etc. Retry is external (RateLimitedClient).
  • No circuit breaker inside clients. Circuit breakers live on adapters, not clients.
  • Caller must mark the 429 signal. If you implement a custom client wrapper, signal.was_rate_limited = True is mandatory on rate-limit errors.
  • Unknown errors default retryable. Conservative by choice.

RateLimitConfig controls the AIMD limiter per adapter:

adapters:
- type: "llm_translator"
config:
provider: "default"
model: "claude-sonnet-4-6"
rate_limit:
max_concurrency: 50
floor: 5

The retry wrapper’s 7-retry policy is currently non-configurable — a deliberate choice, tuned for LLM error patterns.

HTTP fetching is not LLM calling. The webscraper package ships its own, separate rate-limiting mechanism: token-bucket RPS limiting with five pre-set strategies (conservative / moderate / aggressive / adaptive / none). Appropriate when concurrency is cheap but per-endpoint RPS matters. See Workflow: Webscraper.