LiteLLM Proxy

LiteLLM Proxy Integration

Enterprise developers often need a unified routing layer. With LiteLLM, you can easily load-balance and route requests between your local Privane models and fallback cloud APIs (like OpenAI or Anthropic) seamlessly.

This architecture ensures you get the absolute lowest latency and zero-cost inference for 90% of requests (handled locally by Privane), while seamlessly falling back to a massive 70B+ parameter cloud model only for complex queries.

Configuring LiteLLM

Install the LiteLLM proxy:

pip install litellm

Create a config.yaml that routes the default traffic to Privane, and complex traffic to OpenAI:

model_list:
  # Route 1: Local Sovereign AI (Zero Cost, Zero Latency)
  - model_name: "default-model"
    litellm_params:
      model: "openai/gemma-2b"
      api_base: "http://localhost:8080/v1"
      api_key: "privane"
 
  # Route 2: Cloud Fallback
  - model_name: "complex-model"
    litellm_params:
      model: "gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"
 
router_settings:
  routing_strategy: usage-based-routing
  fallback_models: ["complex-model"]

Running the Proxy

litellm --config config.yaml --port 4000

Now, your application simply points to http://localhost:4000. LiteLLM will automatically route standard traffic to the Privane local server running on localhost:8080, vastly reducing your cloud API bills while maximizing data privacy.