Skip to content

LLM clients

Every adapter that calls a language model talks to an LLMClientProtocol or EmbeddingClientProtocol. The concrete implementation is chosen at construction time by factflow-llm based on config. Adapter authors never import a specific provider.

Three practical reasons:

  • Cost optimisation — swap from Claude Opus to Claude Sonnet via a config change, no redeploy
  • Vendor redundancy — if OpenAI is down, route to Bedrock without touching code
  • Testability — production code depends on a protocol; tests inject a mock satisfying the same protocol
ProviderChatEmbeddingsNotes
OpenAIOpenAIClient, AzureOpenAIClient
AnthropicAnthropicClient
BedrockBedrockEmbeddingClient via Titan
HuggingFaceHuggingFaceEmbeddingClient via sentence-transformers, local

Adding a provider is one file implementing LLMClientProtocol / EmbeddingClientProtocol plus registration in the factory.

Optional dependencies are gated by _AVAILABLE flags (ANTHROPIC_AVAILABLE, BEDROCK_AVAILABLE, SENTENCE_TRANSFORMERS_AVAILABLE). The factory skips providers whose libraries aren’t installed.

from factflow_llm import LLMClientFactory
from factflow_llm.settings import LLMConfig
config: LLMConfig = ... # loaded from app config
factory = LLMClientFactory(config)
chat = factory.create_completion_client(provider_name="default")
emb = factory.create_embedding_client(provider_name="openai-embed")
response = await chat.complete(messages=[...])
vectors = await emb.embed(texts=[...])

Clients are cached per provider profile. First call constructs; subsequent calls reuse.

Production LLM providers enforce rate limits. Hitting the limit produces 429 / RateLimitError; exceeding it sustained causes account degradation.

Factflow wraps every client in RateLimitedClient, which uses AIMD (additive increase / multiplicative decrease):

  • Additive increase — on sustained success, gradually raise the per-second token + request budget
  • Multiplicative decrease — on a rate-limit signal, halve the budget and back off
  • Recovery — after cooldown, start probing again

Effect: throughput climbs until you hit the ceiling, backs off cleanly, finds equilibrium, and adapts automatically when the provider’s rate limit changes without notice.

Configuration: RateLimitConfig on each provider profile. Sensible defaults ship; tune only if your workload is unusual.

Every provider exception is classified:

ClassificationMeaningCaller behaviour
RETRYABLETransient network or server issueRetry with backoff
TERMINALBad request, invalid prompt, auth failureDon’t retry; propagate
RATE_LIMITED429 or provider backoff signalBack off per the rate limiter

Adapter authors rarely implement custom classification — classify_llm_error(exc) does the right thing across providers.

There is no global “default model”. Each pipeline’s config specifies what it wants:

- type: "llm_translator"
config:
provider: "default" # points at the provider profile
model: "claude-sonnet-4-6" # explicit model id
max_tokens: 4096

A pipeline wanting GPT-4o and a pipeline wanting Claude Opus coexist without conflict.

  1. Implement LLMClientProtocol (and/or EmbeddingClientProtocol) in a new file under factflow-llm/src/factflow_llm/
  2. Add a discovery flag (MYPROVIDER_AVAILABLE = try_import("myprovider"))
  3. Register in LLMClientFactory._create_client_for_provider
  4. Add provider-specific settings to LLMProviderConfig (or extend via the extra dict pattern)
  5. Add a test that constructs the client without credentials — should fail fast, not hang