← Back to Guides

Provider Integration Guide

Integrate Aixgo with OpenAI, Anthropic, Google Vertex AI, HuggingFace, Ollama (local), and vector databases.

Provider Status

LLM Providers

ProviderStatusNotes
OpenAIAVAILABLEChat, streaming SSE, function calling, JSON mode
Anthropic (Claude)AVAILABLEMessages API, streaming SSE, tool use
Google GeminiAVAILABLEGenerateContent API, streaming SSE, function calling
xAI (Grok)AVAILABLEChat, streaming SSE, function calling (OpenAI-compatible)
Vertex AIAVAILABLEGoogle Cloud AI Platform, streaming SSE, function calling
HuggingFaceAVAILABLEFree Inference API, cloud backends
OllamaAVAILABLELocal models, zero costs, enterprise SSRF protection, hybrid fallback

Vector Databases

ProviderStatusNotes
FirestoreAVAILABLEGoogle Cloud serverless vector search
In-MemoryAVAILABLEDevelopment and testing
QdrantPLANNEDPlanned for v0.2
pgvectorPLANNEDPlanned for v0.2

Embedding Providers

ProviderStatusNotes
OpenAIAVAILABLEtext-embedding-3-small, text-embedding-3-large
HuggingFace APIAVAILABLEFree inference API, 100+ models
HuggingFace TEIAVAILABLESelf-hosted high-performance server

LLM Providers

OpenAI (GPT-4, GPT-3.5)

Supported models:

  • gpt-4 - Most capable, higher cost
  • gpt-4-turbo - Faster, lower cost than GPT-4
  • gpt-3.5-turbo - Fast, cost-effective

Configuration:

# config/agents.yaml
agents:
  - name: analyzer
    role: react
    model: gpt-4-turbo
    provider: openai
    api_key: ${OPENAI_API_KEY}
    temperature: 0.7
    max_tokens: 1000

Environment variables:

export OPENAI_API_KEY=sk-...

Go code:

import "github.com/aixgo-dev/aixgo/providers/openai"

agent := aixgo.NewAgent(
    aixgo.WithName("analyzer"),
    aixgo.WithModel("gpt-4-turbo"),
    aixgo.WithProvider(openai.Provider{
        APIKey: os.Getenv("OPENAI_API_KEY"),
    }),
)

Features:

  • ✅ Chat completions
  • ✅ Function calling (tools)
  • ✅ Streaming SSE responses
  • ✅ JSON mode
  • ✅ Token usage tracking

Pricing (as of 2025):

  • GPT-4 Turbo: $0.01 per 1K input tokens, $0.03 per 1K output tokens
  • GPT-3.5 Turbo: $0.0005 per 1K input tokens, $0.0015 per 1K output tokens

Anthropic (Claude)

Supported models:

  • claude-3-opus - Most capable
  • claude-3-sonnet - Balanced performance/cost
  • claude-3-haiku - Fastest, lowest cost

Configuration:

agents:
  - name: analyst
    role: react
    model: claude-3-sonnet
    provider: anthropic
    api_key: ${ANTHROPIC_API_KEY}
    temperature: 0.5
    max_tokens: 2000

Environment variables:

export ANTHROPIC_API_KEY=sk-ant-...

Go code:

import "github.com/aixgo-dev/aixgo/providers/anthropic"

agent := aixgo.NewAgent(
    aixgo.WithName("analyst"),
    aixgo.WithModel("claude-3-sonnet"),
    aixgo.WithProvider(anthropic.Provider{
        APIKey: os.Getenv("ANTHROPIC_API_KEY"),
    }),
)

Features:

  • ✅ Long context window (200K tokens supported by API)
  • ✅ Tool use
  • ✅ Streaming SSE responses
  • 🚧 Vision support (Planned)

Pricing:

  • Claude 3 Opus: $0.015 per 1K input tokens, $0.075 per 1K output tokens
  • Claude 3 Sonnet: $0.003 per 1K input tokens, $0.015 per 1K output tokens
  • Claude 3 Haiku: $0.00025 per 1K input tokens, $0.00125 per 1K output tokens

Google Vertex AI (Gemini)

Supported models:

  • gemini-1.5-pro - Most capable
  • gemini-1.5-flash - Fast, cost-effective

Configuration:

agents:
  - name: processor
    role: react
    model: gemini-1.5-flash
    provider: vertexai
    project_id: ${GCP_PROJECT_ID}
    location: us-central1
    temperature: 0.8

Authentication:

# Service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

# Or use gcloud default credentials
gcloud auth application-default login

Go code:

import "github.com/aixgo-dev/aixgo/providers/vertexai"

agent := aixgo.NewAgent(
    aixgo.WithName("processor"),
    aixgo.WithModel("gemini-1.5-flash"),
    aixgo.WithProvider(vertexai.Provider{
        ProjectID: os.Getenv("GCP_PROJECT_ID"),
        Location:  "us-central1",
    }),
)

Features:

  • ✅ Long context (2M tokens for Gemini 1.5)
  • ✅ Multimodal (text, images, video)
  • ✅ Function calling
  • ✅ Grounding with Google Search

Pricing:

  • Gemini 1.5 Pro: $0.00125 per 1K input chars, $0.005 per 1K output chars
  • Gemini 1.5 Flash: $0.000125 per 1K input chars, $0.000375 per 1K output chars

HuggingFace Inference API

Supported backends:

  • HuggingFace Inference API (cloud)
  • Ollama (local)
  • vLLM (self-hosted)

Supported models:

  • Any model on HuggingFace with Inference API enabled
  • Popular: meta-llama/Llama-2-70b-chat-hf, mistralai/Mixtral-8x7B-Instruct-v0.1

Configuration:

agents:
  - name: classifier
    role: react
    model: meta-llama/Llama-2-70b-chat-hf
    provider: huggingface
    api_key: ${HUGGINGFACE_API_KEY}
    endpoint: https://api-inference.huggingface.co

Environment variables:

export HUGGINGFACE_API_KEY=hf_...

Go code:

import "github.com/aixgo-dev/aixgo/providers/huggingface"

agent := aixgo.NewAgent(
    aixgo.WithName("classifier"),
    aixgo.WithModel("meta-llama/Llama-2-70b-chat-hf"),
    aixgo.WithProvider(huggingface.Provider{
        APIKey:   os.Getenv("HUGGINGFACE_API_KEY"),
        Endpoint: "https://api-inference.huggingface.co",
    }),
)

Features:

  • ✅ Open-source models
  • ✅ Self-hosted option (Ollama, vLLM)
  • ✅ Cloud backends
  • ✅ Streaming support
  • ✅ Custom fine-tuned models
  • ⚠️ Tool calling support (model-dependent)

Pricing:

  • Pay-per-request or dedicated endpoints
  • Varies by model size and usage

Ollama (Local Models)

Run production AI models on your own infrastructure with zero API costs and complete data privacy.

Ollama enables you to run state-of-the-art open-source models locally or on-premises. Aixgo provides enterprise-grade Ollama integration with hardened security, automatic fallback to cloud APIs, and production-ready deployment templates.

Supported models:

Any model from the Ollama library:

  • phi3.5:3.8b-mini-instruct-q4_K_M - Fast, efficient (3.8B parameters)
  • gemma2:2b-instruct-q4_0 - Google’s lightweight model (2B)
  • llama3.1:8b - Meta’s Llama 3.1 (8B)
  • mistral:7b - Mistral 7B
  • codellama:7b - Code-focused model
  • Custom quantized models (int4, int8, fp16)

Quick Start:

  1. Install Ollama:
# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - download from https://ollama.com/download
  1. Pull a model:
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M
  1. Configure Aixgo:
# config/agents.yaml
model_services:
  - name: phi-local
    provider: huggingface
    model: phi3.5:3.8b-mini-instruct-q4_K_M
    runtime: ollama
    transport: local
    config:
      address: http://localhost:11434  # Default, can be omitted
      quantization: int4

agents:
  - name: local-assistant
    role: react
    model: phi-local
    prompt: |
      You are a helpful AI assistant running locally.
    temperature: 0.7
    max_tokens: 1000

Go SDK:

import (
    "context"
    "github.com/aixgo-dev/aixgo/internal/llm/inference"
)

// Create Ollama service
ollama := inference.NewOllamaService("http://localhost:11434")

// Check availability
if !ollama.Available() {
    log.Fatal("Ollama not running")
}

// List available models
models, err := ollama.ListModels(context.Background())
if err != nil {
    log.Fatal(err)
}
for _, model := range models {
    fmt.Printf("Model: %s (Size: %d bytes)\n", model.Name, model.Size)
}

// Generate text
req := inference.GenerateRequest{
    Model:       "phi3.5:3.8b-mini-instruct-q4_K_M",
    Prompt:      "Explain quantum computing in simple terms.",
    MaxTokens:   500,
    Temperature: 0.7,
    Stop:        []string{"\n\n"},
}
resp, err := ollama.Generate(context.Background(), req)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Response: %s\n", resp.Text)
fmt.Printf("Tokens used: %d prompt + %d completion = %d total\n",
    resp.Usage.PromptTokens, resp.Usage.CompletionTokens, resp.Usage.TotalTokens)

// Chat completions
chatResp, err := ollama.Chat(context.Background(), "phi3.5:3.8b-mini-instruct-q4_K_M", []inference.ChatMessage{
    {Role: "user", Content: "What is Aixgo?"},
})

Configuration Options:

model_services:
  - name: my-ollama-service
    provider: huggingface
    model: llama3.1:8b
    runtime: ollama
    transport: local
    config:
      # Ollama server address (optional, defaults to http://localhost:11434)
      address: http://localhost:11434

      # Quantization level (optional)
      quantization: int4  # Options: int4, int8, fp16

      # Request timeout (optional, defaults to 5 minutes)
      timeout: 300s

Hybrid Inference with Automatic Fallback:

Aixgo can automatically fall back to cloud APIs if Ollama is unavailable:

agents:
  - name: resilient-agent
    role: react
    providers:
      # Try local Ollama first
      - model: phi-local
        provider: huggingface
        runtime: ollama

      # Fallback to cloud if local unavailable
      - model: gpt-4-turbo
        provider: openai
        api_key: ${OPENAI_API_KEY}

      - model: claude-3-haiku
        provider: anthropic
        api_key: ${ANTHROPIC_API_KEY}

    fallback_strategy: cascade  # Try each in order
    prompt: |
      You are a resilient assistant with automatic failover.

Security Features:

Aixgo’s Ollama integration includes enterprise-grade security:

  • SSRF Protection: Strict host allowlist (localhost, 127.0.0.1, ::1, ollama)
  • No Redirects: Prevents redirect-based SSRF attacks
  • IP Validation: Blocks private ranges, link-local, multicast, cloud metadata endpoints
  • DNS Rebinding Protection: Per-connection hostname validation
  • 40+ Security Test Cases: Comprehensive security validation

Production Deployment:

Docker Compose:

# docker-compose.yaml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:0.5.4
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 10s
      timeout: 5s
      retries: 3

  aixgo:
    build: .
    depends_on:
      ollama:
        condition: service_healthy
    environment:
      - OLLAMA_HOST=http://ollama:11434
    ports:
      - "8080:8080"

volumes:
  ollama-data:

Kubernetes:

Aixgo provides production-ready Kubernetes manifests at deploy/k8s/base/ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: aixgo
spec:
  replicas: 1
  template:
    spec:
      # Security: Non-root user, seccomp, capabilities dropped
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault

      containers:
      - name: ollama
        image: ollama/ollama:0.5.4
        ports:
        - containerPort: 11434

        resources:
          requests:
            cpu: 2
            memory: 4Gi
          limits:
            cpu: 4
            memory: 8Gi

        # Health checks
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10

        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 10
          periodSeconds: 5

        volumeMounts:
        - name: ollama-data
          mountPath: /.ollama

      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

Deploy with:

kubectl apply -f deploy/k8s/base/ollama-deployment.yaml

Custom Docker Image:

Build a security-hardened Ollama image with pre-loaded models:

# docker/ollama.Dockerfile
FROM ollama/ollama:0.5.4

# Pre-pull models at build time (optional)
ARG MODELS="phi3.5:3.8b-mini-instruct-q4_K_M gemma2:2b-instruct-q4_0"
RUN ollama serve & sleep 5 && \
    for model in $MODELS; do ollama pull $model; done && \
    pkill ollama

# Run as non-root user
USER 1000

EXPOSE 11434
CMD ["serve"]

Build and run:

docker build -f docker/ollama.Dockerfile -t my-ollama:latest .
docker run -d -p 11434:11434 -v ollama-data:/root/.ollama my-ollama:latest

API Endpoints:

Aixgo supports these Ollama API endpoints:

EndpointMethodPurposeSupported
/api/generatePOSTText generation✅ Yes
/api/chatPOSTChat completions✅ Yes
/api/tagsGETList models / health check✅ Yes
/GETHealth check✅ Yes

Environment Variables:

# Ollama server address (optional, defaults to http://localhost:11434)
export OLLAMA_HOST=http://localhost:11434

# For custom deployments
export OLLAMA_HOST=http://ollama-service.aixgo.svc.cluster.local:11434

Features:

  • ✅ Zero API costs
  • ✅ Complete data privacy
  • ✅ Any Ollama-compatible model
  • ✅ Text generation and chat completions
  • ✅ Token usage tracking
  • ✅ Model listing and health checks
  • ✅ Hybrid inference with cloud fallback
  • ✅ Enterprise security (SSRF protection)
  • ✅ Production Kubernetes manifests
  • ✅ Docker and Docker Compose support
  • ✅ Non-root container execution
  • ❌ Streaming (not yet supported)
  • ❌ Function calling (model-dependent)

Performance:

ModelSizeSpeed (tokens/sec)MemoryBest For
phi3.5:3.8b-q42.2GB~50-1004GBGeneral purpose, fast responses
gemma2:2b-q41.6GB~80-1503GBLightweight, edge deployment
llama3.1:8b4.7GB~30-608GBHigher quality, reasoning
mistral:7b4.1GB~40-806GBBalanced performance/quality

Model Selection Guide:

  • Development/Testing: gemma2:2b-q4_0 - Fastest, smallest
  • Production (CPU): phi3.5:3.8b-mini-instruct-q4_K_M - Best quality/speed balance
  • Production (GPU): llama3.1:8b or mistral:7b - Higher quality
  • Code Generation: codellama:7b - Specialized for code
  • Edge Devices: gemma2:2b-q4_0 - Smallest memory footprint

Troubleshooting:

Ollama not available:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama
ollama serve

# Check logs
ollama logs

Model not found:

# List available models
ollama list

# Pull missing model
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M

Connection refused in Kubernetes:

# Check service
kubectl get svc ollama-service -n aixgo

# Check pod
kubectl get pods -n aixgo -l app=ollama

# View logs
kubectl logs -n aixgo -l app=ollama

# Port forward for testing
kubectl port-forward -n aixgo svc/ollama-service 11434:11434

High memory usage:

  • Use quantized models (q4_K_M, q4_0)
  • Reduce num_ctx parameter
  • Limit concurrent requests
  • Use smaller models (2B-7B vs 13B+)

Learn More:

xAI (Grok)

Supported models:

  • gpt-4-turbo - Latest Grok model

Configuration:

agents:
  - name: researcher
    role: react
    model: gpt-4-turbo
    provider: xai
    api_key: ${XAI_API_KEY}

Environment variables:

export XAI_API_KEY=xai-...

Features:

  • ✅ Real-time web access
  • ✅ Tool calling
  • ✅ Long context window

Provider Comparison

ProviderBest ForContext LengthTool SupportCost
OpenAIGeneral purpose, function calling128K tokens✅ Excellent$$$
AnthropicLong documents, safety200K tokens✅ Excellent$$$$
Google VertexMultimodal, grounding2M tokens✅ Good$$
HuggingFaceOpen source, custom modelsVaries⚠️ Limited$
xAIReal-time info, research128K tokens✅ Good$$$
OllamaLocal inference, data privacyVaries (4K-32K)⚠️ LimitedFree

Multi-Provider Strategy

Fallback Configuration

Use multiple providers with automatic fallback:

agents:
  - name: resilient-analyzer
    role: react
    providers:
      - model: gpt-4-turbo
        provider: openai
        api_key: ${OPENAI_API_KEY}
      - model: claude-3-sonnet
        provider: anthropic
        api_key: ${ANTHROPIC_API_KEY}
      - model: gemini-1.5-flash
        provider: vertexai
        project_id: ${GCP_PROJECT_ID}
    fallback_strategy: cascade # Try each in order

If OpenAI fails, automatically try Anthropic, then Google.

Cost Optimization

Route based on complexity:

# Simple tasks: cheap model
- name: simple-classifier
  role: react
  model: gpt-3.5-turbo
  provider: openai

# Complex reasoning: capable model
- name: complex-analyzer
  role: react
  model: gpt-4-turbo
  provider: openai

Region-Specific Routing

# US region: Vertex AI (low latency)
- name: us-agent
  role: react
  model: gemini-1.5-flash
  provider: vertexai
  location: us-central1

# EU region: OpenAI EU endpoint
- name: eu-agent
  role: react
  model: gpt-4-turbo
  provider: openai
  endpoint: https://api.openai.com/v1 # or EU-specific endpoint

Vector Databases & Embeddings

Overview

Aixgo provides integrated support for vector databases and embeddings, enabling Retrieval-Augmented Generation (RAG) systems. The architecture separates embedding generation from vector storage for maximum flexibility.

Architecture:

Documents → Embeddings Service → Vector Database → Semantic Search

Embedding Providers

OpenAI Embeddings

Best for: Production deployments, highest quality

Configuration:

embeddings:
  provider: openai
  openai:
    api_key: ${OPENAI_API_KEY}
    model: text-embedding-3-small # or text-embedding-3-large

Go code:

import "github.com/aixgo-dev/aixgo/pkg/embeddings"

config := embeddings.Config{
    Provider: "openai",
    OpenAI: &embeddings.OpenAIConfig{
        APIKey: os.Getenv("OPENAI_API_KEY"),
        Model:  "text-embedding-3-small",
    },
}

embSvc, err := embeddings.New(config)
if err != nil {
    log.Fatal(err)
}
defer embSvc.Close()

// Generate embedding
embedding, err := embSvc.Embed(ctx, "Your text here")

Models:

  • text-embedding-3-small: 1536 dimensions, $0.02 per 1M tokens
  • text-embedding-3-large: 3072 dimensions, $0.13 per 1M tokens
  • text-embedding-ada-002: 1536 dimensions (legacy)

HuggingFace Inference API

Best for: Development, cost-sensitive deployments

Configuration:

embeddings:
  provider: huggingface
  huggingface:
    model: sentence-transformers/all-MiniLM-L6-v2
    api_key: ${HUGGINGFACE_API_KEY} # Optional
    wait_for_model: true
    use_cache: true

Popular models:

  • sentence-transformers/all-MiniLM-L6-v2: 384 dims, fast
  • BAAI/bge-large-en-v1.5: 1024 dims, excellent quality
  • thenlper/gte-large: 1024 dims, multilingual

Pricing: FREE (Inference API) with rate limits

HuggingFace TEI (Self-Hosted)

Best for: High-throughput production workloads

Docker setup:

docker run -d \
  --name tei \
  -p 8080:8080 \
  --gpus all \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-large-en-v1.5

Configuration:

embeddings:
  provider: huggingface_tei
  huggingface_tei:
    endpoint: http://localhost:8080
    model: BAAI/bge-large-en-v1.5
    normalize: true

Vector Store Providers

Best for: Serverless production deployments on GCP

Setup:

# Enable Firestore
gcloud services enable firestore.googleapis.com

# Create vector index
gcloud firestore indexes composite create \
  --collection-group=embeddings \
  --query-scope=COLLECTION \
  --field-config=field-path=embedding,vector-config='{"dimension":"384","flat":{}}'

Configuration:

vectorstore:
  provider: firestore
  embedding_dimensions: 384
  firestore:
    project_id: ${GCP_PROJECT_ID}
    collection: embeddings
    credentials_file: /path/to/key.json # Optional

Go code:

import "github.com/aixgo-dev/aixgo/pkg/vectorstore"

config := vectorstore.Config{
    Provider:            "firestore",
    EmbeddingDimensions: 384,
    Firestore: &vectorstore.FirestoreConfig{
        ProjectID:  os.Getenv("GCP_PROJECT_ID"),
        Collection: "embeddings",
    },
}

store, err := vectorstore.New(config)
if err != nil {
    log.Fatal(err)
}
defer store.Close()

// Upsert documents
doc := vectorstore.Document{
    ID:        "doc-1",
    Content:   "Your document content",
    Embedding: embedding,
    Metadata: map[string]interface{}{
        "category": "documentation",
    },
}
store.Upsert(ctx, []vectorstore.Document{doc})

// Search
results, err := store.Search(ctx, vectorstore.SearchQuery{
    Embedding: queryEmbedding,
    TopK:      5,
    MinScore:  0.7,
})

Features:

  • ✅ Serverless, auto-scaling
  • ✅ Persistent storage
  • ✅ Real-time updates
  • ✅ ACID transactions

Pricing: ~$0.06 per 100K reads + storage

In-Memory Vector Store

Best for: Development, testing, prototyping

Configuration:

vectorstore:
  provider: memory
  embedding_dimensions: 384
  memory:
    max_documents: 10000

Features:

  • ✅ Zero setup
  • ✅ Fast for small datasets
  • ❌ Data lost on restart
  • ❌ Limited capacity

Qdrant (Planned - v0.2)

High-performance dedicated vector database:

# Coming soon
vectorstore:
  provider: qdrant
  embedding_dimensions: 384
  qdrant:
    host: localhost
    port: 6333
    collection: knowledge_base

pgvector (Planned - v0.2)

PostgreSQL extension for vector search:

# Coming soon
vectorstore:
  provider: pgvector
  embedding_dimensions: 384
  pgvector:
    connection_string: postgresql://user:pass@localhost/db
    table: embeddings

Complete RAG Example

package main

import (
    "context"
    "log"

    "github.com/aixgo-dev/aixgo/pkg/embeddings"
    "github.com/aixgo-dev/aixgo/pkg/vectorstore"
)

func main() {
    ctx := context.Background()

    // Setup embeddings
    embConfig := embeddings.Config{
        Provider: "huggingface",
        HuggingFace: &embeddings.HuggingFaceConfig{
            Model: "sentence-transformers/all-MiniLM-L6-v2",
        },
    }
    embSvc, _ := embeddings.New(embConfig)
    defer embSvc.Close()

    // Setup vector store
    storeConfig := vectorstore.Config{
        Provider:            "firestore",
        EmbeddingDimensions: embSvc.Dimensions(),
        Firestore: &vectorstore.FirestoreConfig{
            ProjectID:  "my-project",
            Collection: "knowledge_base",
        },
    }
    store, _ := vectorstore.New(storeConfig)
    defer store.Close()

    // Index documents
    docs := []string{
        "Aixgo is a production-grade AI framework",
        "RAG combines retrieval with generation",
    }

    for i, content := range docs {
        emb, _ := embSvc.Embed(ctx, content)
        doc := vectorstore.Document{
            ID:        fmt.Sprintf("doc-%d", i),
            Content:   content,
            Embedding: emb,
        }
        store.Upsert(ctx, []vectorstore.Document{doc})
    }

    // Search
    query := "What is Aixgo?"
    queryEmb, _ := embSvc.Embed(ctx, query)
    results, _ := store.Search(ctx, vectorstore.SearchQuery{
        Embedding: queryEmb,
        TopK:      3,
    })

    for _, result := range results {
        fmt.Printf("Score: %.2f - %s\n", result.Score, result.Document.Content)
    }
}

Provider Comparison: Embeddings

ProviderCostQualitySpeedBest For
OpenAI$0.02-0.13/1M tokensExcellentFastProduction
HuggingFace APIFreeGood-ExcellentMediumDevelopment
HuggingFace TEIFree (self-host)Good-ExcellentVery FastHigh-volume

Provider Comparison: Vector Stores

ProviderPersistenceScalabilitySetupCost
MemoryNoLowNoneFree
FirestoreYesUnlimitedMedium$$
Qdrant (planned)YesVery HighMediumSelf-host
pgvector (planned)YesHighMediumSelf-host

Learn More

API Key Management

Environment Variables

# .env file
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GCP_PROJECT_ID=my-project
HUGGINGFACE_API_KEY=hf_...

Load with:

export $(cat .env | xargs)

Kubernetes Secrets

kubectl create secret generic llm-keys \
  --from-literal=OPENAI_API_KEY=sk-... \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-...

Reference in deployment:

env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: llm-keys
        key: OPENAI_API_KEY

Cloud Secret Managers

Google Secret Manager:

import "cloud.google.com/go/secretmanager/apiv1"

func getAPIKey(ctx context.Context, secretName string) (string, error) {
    client, _ := secretmanager.NewClient(ctx)
    result, _ := client.AccessSecretVersion(ctx, &secretmanagerpb.AccessSecretVersionRequest{
        Name: secretName,
    })
    return string(result.Payload.Data), nil
}

Rate Limiting & Retries

Provider Rate Limits

ProviderTierRequests/MinTokens/Min
OpenAIFree340,000
OpenAIPaid Tier 150090,000
AnthropicFree525,000
AnthropicPaid50100,000
Vertex AIDefault6060,000

Retry Configuration

agents:
  - name: resilient-agent
    role: react
    model: gpt-4-turbo
    provider: openai
    retry:
      max_attempts: 3
      initial_backoff: 1s
      max_backoff: 10s
      multiplier: 2
      retry_on:
        - rate_limit
        - timeout
        - server_error

Monitoring Provider Performance

Track Latency by Provider

import "github.com/prometheus/client_golang/prometheus"

var providerLatency = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name: "llm_provider_latency_seconds",
        Help: "LLM API call latency by provider",
    },
    []string{"provider", "model"},
)

// Aixgo tracks this automatically

Cost Tracking

observability:
  cost_tracking: true
  cost_alert_threshold: 100 # Alert if daily cost > $100

Best Practices

1. Use Environment-Specific Keys

# Development
OPENAI_API_KEY=sk-dev-...

# Production
OPENAI_API_KEY=sk-prod-...

2. Implement Fallback Providers

Always have a backup provider to avoid single point of failure.

3. Monitor Token Usage

Track and alert on unexpected token consumption:

observability:
  llm_observability:
    enabled: true
    track_tokens: true
    daily_token_limit: 1000000

4. Choose Models Strategically

  • Simple tasks: gpt-3.5-turbo, gemini-flash, claude-haiku
  • Complex reasoning: gpt-4-turbo, claude-3-opus
  • Long documents: claude-3-opus (200K), gemini-pro (2M)
  • Cost-sensitive: gemini-flash, gpt-3.5-turbo

5. Use Caching

Cache LLM responses for repeated queries:

import "github.com/aixgo-dev/aixgo/cache"

agent := aixgo.NewAgent(
    aixgo.WithName("cached-analyzer"),
    aixgo.WithCache(cache.NewRedisCache("localhost:6379")),
    aixgo.WithCacheTTL(1 * time.Hour),
)

Troubleshooting

Authentication Errors

Error: 401 Unauthorized

Solution:

  • Verify API key is correct
  • Check key has not expired
  • Ensure environment variable is loaded

Rate Limit Exceeded

Error: 429 Too Many Requests

Solution:

  • Implement exponential backoff
  • Reduce request rate
  • Upgrade to higher tier
  • Add multiple API keys for rotation

Timeout Errors

Error: Request timeout

Solution:

agents:
  - name: patient-agent
    role: react
    model: gpt-4-turbo
    timeout: 60s # Increase timeout

Next Steps