Guide

LLM Providers

Connect Omni to one or more LLM providers. Use cloud APIs, run models locally, or bring your own endpoint.

Overview

Omni's LLM bridge is a provider-agnostic abstraction layer. You configure one or more providers, and Omni handles streaming, token counting, and automatic failover. All providers support tool calling (function calling), which is how the agent invokes native tools and extension tools.

OpenAI

GPT-4o, GPT-4, o1, etc.

Anthropic

Claude Opus, Sonnet, Haiku

Google Gemini

Gemini Pro, Ultra, Flash

Ollama

Llama, Mistral, Phi, etc.

AWS Bedrock

Claude, Titan, Llama via AWS

Custom HTTP

Any OpenAI-compatible API

OpenAI

Access GPT-4o, GPT-4, and other OpenAI models. Requires an API key from platform.openai.com.

omni.toml

[providers.openai]

provider_type = "openai"

default_model = "gpt-4o"

max_tokens = 4096

temperature = 0.7

The API key is stored in the OS keychain (not the config file). Enter it through Settings → Providers → OpenAI. Token counting uses the cl100k_base tokenizer via tiktoken-rs for accurate billing estimation.

Anthropic

Access Claude models. Requires an API key from console.anthropic.com.

omni.toml

[providers.anthropic]

provider_type = "anthropic"

default_model = "claude-opus-4-6"

temperature = 0.8

Uses streaming SSE for real-time responses. Token counting uses cl100k_base (same as OpenAI) for estimation.

Google Gemini

Access Gemini models. Get an API key from aistudio.google.com.

omni.toml

[providers.gemini]

provider_type = "gemini"

default_model = "gemini-pro"

max_tokens = 4096

Uses the Gemini API v1 beta. Token counting uses character-based estimation (characters / 4).

Ollama (Local)

Run models locally on your machine using Ollama. No API key needed — completely private and offline.

terminal

# Install Ollama and pull a model

ollama pull llama3.1

ollama pull mistral

ollama pull phi3

omni.toml

[providers.ollama]

provider_type = "ollama"

default_model = "llama3.1"

endpoint = "http://localhost:11434"

Ollama runs on port 11434 by default. Set the endpoint if you're running it on a different host or port. Token counting uses character-based estimation.

AWS Bedrock

Access models through your AWS account using Bedrock. Supports Claude, Titan, Llama, and other models available in your AWS region.

omni.toml

[providers.bedrock]

provider_type = "bedrock"

default_model = "anthropic.claude-v2"

endpoint = "https://bedrock-runtime.us-east-1.amazonaws.com"

Authentication uses AWS SigV4 signing. Configure your AWS credentials via environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or the standard AWS credentials file. Uses the InvokeModelWithResponseStream API for streaming.

Custom HTTP

Connect to any API endpoint that follows the OpenAI chat completions format. Use this for self-hosted models, LLM proxies, or alternative providers.

omni.toml

[providers.my-proxy]

provider_type = "custom"

default_model = "my-model"

endpoint = "https://my-llm-proxy.example.com/v1"

max_tokens = 2048

The endpoint must support the /chat/completions path with SSE streaming. API key authentication is optional.

Provider Rotation & Fallback

When you configure multiple providers, Omni automatically handles failover. If the primary provider returns an error or times out, the request is retried with the next available provider using exponential backoff.

Backoff Schedule

1st retry

5 seconds

2nd retry

15 seconds

3rd retry

60 seconds

4th retry

300 seconds

Providers are tried in order of their configured priority. A provider that repeatedly fails is temporarily marked unavailable and skipped until its backoff period expires. You can disable a provider without removing it by setting enabled = false.

Streaming

All providers use Server-Sent Events (SSE) streaming for real-time token delivery. As the LLM generates its response, tokens are streamed to the UI character by character.

The streaming architecture uses Tokio async streams with byte buffer accumulation. Each chunk can contain:

TextDelta

Partial text content

ToolCallDelta

Tool call being assembled

Usage

Token count update

Done

Stream complete signal

Token Counting

Omni tracks token usage per request to help you monitor costs. The counting method varies by provider.

Provider

Method

Notes

OpenAI

tiktoken cl100k_base

Exact tokenizer match for GPT-4 and GPT-3.5 models.

Anthropic

tiktoken cl100k_base

Close approximation. Actual usage may differ slightly.

Gemini

chars / 4

Character-based estimation.

Ollama

chars / 4

Character-based estimation.

Bedrock

chars / 4

Character-based estimation. AWS may report exact counts.

Custom

chars / 4

Default estimation for unknown tokenizers.

Next Steps

Provider Config Reference

All configuration fields for the [providers] section.

LLM Client SDK

Make LLM requests from extensions using ctx.llm().