HA Multi-AZ Setup

If you need High Availability(HA) for a Kloudfuse cluster, you have the option of deploying it across multiple availability zones (multi-AZ). A multi-AZ setup minimizes downtime and maintains observability workflows in the event of a zone failure. Each zone must have at least 2 pods to run the supported components.

Benefits of Multi-AZ Deployment

  • Protects against zone-level failures

  • Ensures service continuity

  • Balances workload across zones

  • Aligns with SRE best practices

Prerequisites

Before you begin:

  • Kubernetes Cluster Infrastructure

    • A fresh installation. A Kloudfuse cluster cannot be upgraded to a Multi-AZ setup.

    • Nodes to be set up across 3 availability zones (required)

    • Each availability zone must have an equal number of nodes (required)

    • The total number of nodes must be a multiple of 6 (required)

    • Nodes cannot have additional taints outside what is used by Kloudfuse (required) Additional taints will cause validations to fail and those will not be used

  • Cloud Managed PostgreSQL instance

    • Cloud Provider managed PostgreSQL instance:

      • AWS RDS

      • GCP Cloud SQL

      • Azure Database for PostgreSQL

      • PostgreSQL must be available to all 3 availability zones.

      • PostgreSQL version 14.11+

      • Kubernetes Secret with PostgreSQL credentials

  • AWS-only

    • A NLB and Kloudfuse DNS mapping

    • Use a NLB (Network Load Balancer) for Kloudfuse DNS mapping. Elastic IPs are AZ-specific, the DNS for the Kloudfuse endpoint must be a CNAME pointing to the NLB DNS to ensure traffic routing during a zone failure.

Step 1: Configure Helm Values

In the custom_values.yaml, configure the following fields:

global:
  cloudProvider: <aws | gcp | azure>
  numNodes: <Total number of nodes across all zones>
  multiAzDeployment:
    enabled: true
  configDB:
    host: <Postgres host for configDB>
  orchestratorDB:
    host: <Postgres host for orchestratorDB>

installKfusePgCredentials: false
yaml

This configuration ensures Kloudfuse uses external PostgreSQL and skips deploying its own credentials secret.

Step 2: Disable Embedded PostgreSQL

To use cloud-managed PostgreSQL, disable the internal PostgreSQL services:

ingester:
  postgresql:
    enabled: false

kfuse-configdb:
  enabled: false
yaml

Step 3: Automatic Scaling and Anti-Affinity Rules

When multiAzDeployment.enabled is set to true, Kloudfuse will automatically:

  • Adjusts replicaCount for services based on global.numNodes

  • Applies pod anti-affinity rules to distribute replicas across availability zones

Do not manually set replicaCount for most services.

Service Behavior in Multi-AZ Mode

These services will auto-scale based on the number of nodes that are being used in the cluster.

  • advance-functions-service

  • events-query-service

  • ingester

  • kafka

  • llm-evaluation-service

  • llm-query-service

  • logs-parser

  • logs-query-service

  • logs-transformer

  • kfuse-observability-agent

  • kfuse-vector

  • metrics-transformer

  • pinot

  • query-service

  • rum-query-service

  • trace-query-service

  • trace-transformer

  • zapper

Services that, by default, will have 3 replicas (1 per zone)

  • beffe

  • ingress-nginx

  • kafka zookeeper

  • kfuse-profiling-server (requires cloud storage)

  • pinot zookeeper

  • redis

  • ui

Services that always use 1 replica

These components remain single-instance. On zone failure, Kubernetes reschedules the pod to a healthy node in another zone.

  • az-service

  • config-mgmt-service

  • grafana

  • hydration-service

  • kfuse-auth

  • kfuse-cloud-exporter (scraper and exporters)

  • kfuse-profiler

  • kfuse-saml

  • rule-manager

  • user-mgmt-service

Additional Notes

  • Replication factors will be automatically managed for Kafka and Pinot

  • Do not override replicaCount unless given specific guidance to do so

  • global.numNodes is used to influence replica scaling.