Load balancing

Load balancing is a method used to distribute incoming requests across multiple instances of the same model. It can be used to improve performance of the system by distributing the load or by the selection of the model with the best performance. Load balancing can be configured globally for all model groups:

load_balancer:
  default: "round-robin"

or for a specific model group:

llm_groups:
  gpt-4o:
    models:
      - azure-us-gpt4o
      - azure-uk-gpt4o
      - azure-sweden-gpt4o
    load_balancer: "epsilon-greedy"

Round Robin load balancing

load_balancer:
  default: "round-robin"

Round Robin is a simple load balancing strategy, where the requests are distributed evenly across all models in the group. Requests are distributed in a cycle, where each model is selected once per cycle. This strategy is useful when all models have the similar performance and you want to distribute the load evenly.

When some model is unavailable, the load balancer will skip it and continue with the next model in the cycle.

Multi-arm bandit based load balancing

Multi-arm bandit-based load balancing strategies are used to select the best model based on the performance of the models in the group. The load balancer keeps track of the performance of each model and uses this information to select the best model for the next request. This strategy contains two phases: exploration and exploitation. In the exploration phase, the load balancer tries to assess the performance of each model in the group. Exploitation then uses the information gathered from exploration to select the best target model for the next request.

Epsilon greedy

load_balancer:
  default: "epsilon-greedy"
  epsilon: 0.1

Epsilon greedy is a simple implementation of the multi-arm bandit load balancing strategy. In the exploration phase, the load balancer selects a random model from the group and sends the request to it. The exploitation phase then selects the model with the best performance based on the exploration phase. The epsilon parameter defines the probability of the exploration phase in each request. The higher the epsilon, the more often the load balancer will explore the models in the group.

Health check

load_balancer:
  default: "health-check"

health_check:
  enabled: True
  interval_seconds: 900
  excluded_groups: [
    "text-embedding-3-large"
  ]
  health_check_prompt: "Translate 'Hello, how are you?' to French."
  user: health-check
  project: health-check-project

Health check works similarly to the epsilon greedy strategy, but the exploration phase is handled by the health check. The health check periodically sends requests to all models in the group and evaluates the performance of each model. The load balancer then only sends requests to the models with the best performance.

Example of the health check load balancing strategy:

Sweden model returns best performance, so the load balancer will send the next request to this model.

Priorities

Priorities allow you to prefer certain models over others in the selected group. When a user sends a request to the group, models with the highest priority are chosen to process the request. If all models with the highest priority are inaccessible, then the request is sent to models with a lower priority. A lower number means higher priority. If two models have the same priority, the gateway will choose one based on the configured load balancing strategy. Priority can be any positive integer. By default, all models have their priority set to 50. Priorities can be used in many ways, for example, to prefer geographically closer models or to prefer PTU instances over non-PTU instances.

  azure-uk-gpt35:
    deployment_name: gpt-35-turbo
    type: AzureOpenAIText
    url: "https://dummy-model-models-cognitive-account-uk.openai.azure.com"
    api_key: "plaintext:secret"
    cost_profile: azure-gpt35-turbo
    priority: 1

Usage example - PTU

Prioritize single instance of model endpoint with PTU over model endpoints without PTU. Call non-PTU instances only when PTU instance is not available.

llms:    
   # Azure GPT-4.0
  azure-us-gpt4o-ptu:
    deployment_name: gpt-4o
    type: AzureOpenAIText
    url: "https://models-cognitive-account-eastus2.openai.azure.com"
    api_key: "plaintext:secret"
    cost_profile: azure-gpt-4o
    priority: 1
  
  azure-uk-gpt4o:
    deployment_name: gpt-4o
    type: AzureOpenAIText
    url: "https://models-cognitive-account-uk.openai.azure.com"
    api_key: "plaintext:secret"
    cost_profile: azure-gpt-4o
  
  azure-sweden-gpt4o:
    deployment_name: gpt-4o
    type: AzureOpenAIText
    url: "https://models-cognitive-account-swedencentral.openai.azure.com"
    api_key: "plaintext:secret"
    cost_profile: azure-gpt-4o

llm_groups:
  gpt-4o:
    models:
      # Prioritize US PTU model, UK and Sweden models are the next option
      - azure-us-gpt4o-ptu
      - azure-uk-gpt4o
      - azure-sweden-gpt4o

cost_profiles:
  - id: azure-gpt-4o
    usd_per_1k_input_tokens: 0.005
    usd_per_1k_output_tokens: 0.015

load_balancer:
  default: "round-robin"

Non-PTU instances don’t have a set priority, so they have a default priority of 50.

Usage example - prefer geographically closer models

Prioritize models based on their location. Call the closest models first, and if they are not available, call the other ones.

llms:    
   # Azure GPT-4.0
  azure-us-gpt4o:
    deployment_name: gpt-4o
    type: AzureOpenAIText
    url: "https://models-cognitive-account-eastus2.openai.azure.com"
    api_key: "plaintext:secret"
    cost_profile: azure-gpt-4o
    priority: 2
  
  azure-uk-gpt4o:
    deployment_name: gpt-4o
    type: AzureOpenAIText
    url: "https://models-cognitive-account-uk.openai.azure.com"
    api_key: "plaintext:secret"
    cost_profile: azure-gpt-4o
    priority: 1
  
  azure-sweden-gpt4o:
    deployment_name: gpt-4o
    type: AzureOpenAIText
    url: "https://models-cognitive-account-swedencentral.openai.azure.com"
    api_key: "plaintext:secret"
    cost_profile: azure-gpt-4o
    priority: 1

llm_groups:
  gpt-4o:
    models:
      # Prioritize UK and Sweden models, US model is the last option
      - azure-us-gpt4o
      - azure-uk-gpt4o
      - azure-sweden-gpt4o

cost_profiles:
  - id: azure-gpt-4o
    usd_per_1k_input_tokens: 0.005
    usd_per_1k_output_tokens: 0.015

load_balancer:
  default: "round-robin"