Rate limiting
LLM providers enforce rate limits. Exceed them and the provider returns 429 — sustained overage causes account degradation. Factflow’s factflow-llm package ships two cooperating mechanisms: an AIMD concurrency limiter (gates how many requests are in flight) and a retry-with-jitter wrapper (handles transient errors). The two layers compose.
Two layers, one call
Section titled “Two layers, one call”flowchart LR Adapter[Adapter] --> RLC[RateLimitedClient<br/>retry + jitter] RLC --> ARL[AdaptiveRateLimiter<br/>AIMD semaphore] ARL --> RealCall[Real LLM call] RealCall -.429.-> ARL RealCall -.5xx/timeout.-> RLC style ARL fill:#e6f2ff,stroke:#2563eb,color:#111 style RLC fill:#f0fdf4,stroke:#16a34a,color:#111
Order matters. The adapter calls RateLimitedClient.complete(...). RateLimitedClient makes up to 7 attempts. Each attempt goes through the AdaptiveRateLimiter.slot() context manager before hitting the real API.
Layer 1 — AdaptiveRateLimiter (AIMD concurrency)
Section titled “Layer 1 — AdaptiveRateLimiter (AIMD concurrency)”Lives at factflow_llm.rate_limiter.AdaptiveRateLimiter. Implements TCP-style congestion control applied to API concurrency. Not RPS, not token-counting — just how many requests are allowed to be in flight simultaneously.
Algorithm
Section titled “Algorithm”on success: limit = min(max_concurrency, limit + 1) # additive increaseon 429: limit = max(floor, limit // 2) # multiplicative decreaseTwo tunables:
max_concurrency— hard ceiling. Default 50.floor— minimum slots even after repeated decreases. Default 5.
Starts at max_concurrency, converges downward through 429 feedback, climbs back when the provider lets it. No per-provider tuning required — the ceiling is discovered empirically.
stateDiagram-v2 [*] --> MaxConcurrency: init MaxConcurrency --> Saturated: each success limit+1 Saturated --> Backoff: 429 received Backoff --> Recovering: limit = max(floor, limit//2) Recovering --> Saturated: each success limit+1 Recovering --> Backoff: another 429 Backoff --> Floor: cannot decrease further Floor --> Recovering: success
Usage pattern
Section titled “Usage pattern”from factflow_llm.rate_limiter import AdaptiveRateLimiter
limiter = AdaptiveRateLimiter(max_concurrency=50, floor=5)
async with limiter.slot() as signal: try: response = await llm_client.complete(...) except RateLimitError: signal.was_rate_limited = True raiseCritical: the caller MUST set signal.was_rate_limited = True on 429 before the slot() context exits. Without that, the limiter never triggers a decrease, and throughput never adapts. The rule is enforced by code review — it’s not automated.
Observable state
Section titled “Observable state”AdaptiveRateLimiter.metrics returns an AIMDMetrics dataclass:
| Field | Meaning |
|---|---|
current_limit | Current concurrency ceiling (adapts over time) |
total_acquires | Total slots ever acquired |
total_rate_limits | Total 429 signals received |
total_decreases | Times the limit was actually decreased (may be less than rate_limits if already at floor) |
peak_active | Highest simultaneous in-flight count observed |
limit_history | Deque of last 100 limit values after decreases (for charting) |
Layer 2 — RateLimitedClient (retry + jitter)
Section titled “Layer 2 — RateLimitedClient (retry + jitter)”Lives at factflow_llm.rate_limited_client.RateLimitedClient. Wraps any BaseLLMClient or embedding client with retry logic.
Policy
Section titled “Policy”- Max retries: 7
- Base delay: 0.5 seconds
- Cap delay: 60 seconds
- Jitter: full (randomised between 0 and computed delay)
- Schedule: exponential —
base * 2^attemptcapped at 60s, then jittered
sequenceDiagram
participant Adapter
participant RLC as RateLimitedClient
participant ARL as AdaptiveRateLimiter
participant API as LLM provider
Adapter->>RLC: complete(...)
loop up to 7 attempts
RLC->>ARL: slot()
ARL-->>RLC: slot acquired
RLC->>API: HTTP POST
alt success
API-->>RLC: 200
RLC-->>Adapter: response
else 429
API-->>RLC: 429
Note over RLC,ARL: signal.was_rate_limited = True
RLC->>RLC: sleep(exp_backoff with jitter)
else 5xx or timeout
API-->>RLC: transient error
RLC->>RLC: sleep(exp_backoff with jitter)
else fatal (401, 403, 404)
API-->>RLC: fatal error
RLC-->>Adapter: raise (no retry)
end
end
Why jitter
Section titled “Why jitter”Without jitter, bursts of simultaneous 429s produce synchronised retries → another burst of 429s → thundering herd. Full jitter breaks the phase alignment. Standard cloud-retry practice.
Error classification
Section titled “Error classification”factflow_llm.error_classification.classify_llm_error(exc) maps raw exceptions to one of three bands:
| Band | Examples | Retry behaviour |
|---|---|---|
| Fatal | 401 auth, 403 permission, 404 not found | Never retry; propagate to caller |
| Rate-limited | 429 | Retry (via RateLimitedClient); trigger AIMD decrease |
| Retryable | 500+, connection error, timeout | Retry with backoff |
Unknown exception types default to retryable (conservative — avoids masking bugs with silent non-retries).
get_error_metadata(exc) returns a dict for AdapterResult.metadata:
{ "fatal": bool, "retryable": bool, "status_code": int | None, "error_type": str,}Adapters typically ship this dict back to the processor on failure, which stamps it into lineage for debugging.
Putting it together — one adapter’s call path
Section titled “Putting it together — one adapter’s call path”
sequenceDiagram
participant Proc as Processor
participant Ad as Adapter
participant Fac as LLMClientFactory
participant RLC as RateLimitedClient
participant ARL as AdaptiveRateLimiter
participant API as Real LLM API
Proc->>Ad: await process(ctx)
Ad->>Fac: create_completion_client("default")
Fac-->>Ad: RateLimitedClient wrapping OpenAIClient
Ad->>RLC: complete(messages=[...])
loop up to 7 attempts
RLC->>ARL: async with slot() as signal
ARL-->>RLC: slot acquired (throttle if saturated)
RLC->>API: POST /v1/chat/completions
alt 200 OK
API-->>RLC: response
RLC-->>ARL: exit context
Note over ARL: limit += 1
RLC-->>Ad: CompletionResponse
else 429
API-->>RLC: RateLimitError
RLC->>RLC: signal.was_rate_limited = True
Note over ARL: limit = max(floor, limit // 2)
RLC->>RLC: sleep exp+jitter
end
end
Ad-->>Proc: AdapterResult
Invariants worth protecting
Section titled “Invariants worth protecting”- No retry inside clients. Raw API errors propagate from
OpenAIClient/AnthropicClient/ etc. Retry is external (RateLimitedClient). - No circuit breaker inside clients. Circuit breakers live on adapters, not clients.
- Caller must mark the 429 signal. If you implement a custom client wrapper,
signal.was_rate_limited = Trueis mandatory on rate-limit errors. - Unknown errors default retryable. Conservative by choice.
Configuration
Section titled “Configuration”RateLimitConfig controls the AIMD limiter per adapter:
adapters: - type: "llm_translator" config: provider: "default" model: "claude-sonnet-4-6" rate_limit: max_concurrency: 50 floor: 5The retry wrapper’s 7-retry policy is currently non-configurable — a deliberate choice, tuned for LLM error patterns.
Webscraper uses a different model
Section titled “Webscraper uses a different model”HTTP fetching is not LLM calling. The webscraper package ships its own, separate rate-limiting mechanism: token-bucket RPS limiting with five pre-set strategies (conservative / moderate / aggressive / adaptive / none). Appropriate when concurrency is cheap but per-endpoint RPS matters. See Workflow: Webscraper.