Deploy Kloudfuse in a Multi-AZ Kubernetes Cluster

Deploy Kloudfuse across multiple availability zones (multi-AZ) to ensure high availability and fault tolerance. A multi-AZ setup minimizes downtime and maintains observability workflows in the event of a zone failure. Each zone must have at least one pod to run the supported components.

Benefits of Multi-AZ Deployment

  • Protects against zone-level failures

  • Ensures service continuity

  • Balances workload across zones

  • Aligns with SRE best practices

Prerequisites

Before you begin:

  • Kubernetes Cluster Infrastructure

    • This must be a fresh installation. Multi-AZ support is not compatible with upgrades from a single-zone deployment.

    • A Kubernetes cluster with nodes across three availability zones is required.

    • Each zone must have an equal number of nodes, ideally grouped into separate node pools per zone.

    • The total number of nodes must be a multiple of 6.

    • Nodes must not have any additional taints. Kloudfuse performs strict taint validation and will disregard tainted nodes.

  • Cloud Managed Postgres

    • Create postgres instances from the cloud

      • AWS RDS

      • GCP Cloud SQL

      • Azure Database for PostgreSQL

    • PostgreSQL must run in all three availability zones.

    • Tested version: Postgres 14.11.

    • Create a Kubernetes secret named kfuse-pg-credentials that contains the base64-encoded Postgres password:

kubectl create secret generic kfuse-pg-credentials \
  --from-literal=postgres=<base64-encoded-password>
  • AWS NLB and Kfuse DNS mapping (on AWS)

    • Use an AWS NLB (Network Load Balancer) for Kloudfuse DNS mapping. Since Elastic IPs are AZ-specific, the DNS for the Kloudfuse endpoint must be a CNAME pointing to the NLB DNS to ensure traffic routing during zone failure.

Step 1: Configure Helm Values

In your custom_values.yaml, configure the following fields:

global:
  cloudProvider: <aws | gcp | azure>
  numNodes: <Total number of nodes across all zones>
  multiAzDeployment:
    enabled: true
  configDB:
    host: <Postgres host for configDB>
  orchestratorDB:
    host: <Postgres host for orchestratorDB>

installKfusePgCredentials: false
yaml

This configuration ensures Kloudfuse uses external Postgres and skips deploying its own credentials secret.

Step 2: Disable Embedded PostgreSQL

To use cloud-managed Postgres, disable the internal PostgreSQL services:

ingester:
  postgresql:
    enabled: false

kfuse-configdb:
  enabled: false
yaml

Step 3: Automatic Scaling and Anti-Affinity Rules

When multiAzDeployment.enabled is set to true, Kloudfuse automatically:

  • Adjusts replicaCount for services based on global.numNodes

  • Applies pod anti-affinity rules to distribute replicas across availability zones

You do not need to manually set replicaCount for most services.

Service Behavior in Multi-AZ Mode

Services that auto-scale based on numNodes

  • kafka

  • ingester

  • logs-transformer

  • metrics-transformer

  • trace-transformer

  • pinot

  • query-service

  • advance-functions-service

  • events-query-service

  • trace-query-service

  • logs-query-service

  • llm-query-service

  • llm-evaluation-service

  • rum-query-service

  • zapper

  • kfuse-vector

  • kfuse-observability-agent

Logs Parser uses the Kafka topic partition count (logs_ingest_topic) to determine its replica count.

Services that default to 3 replicas (1 pod per zone)

  • ingress-nginx

  • kafka zookeeper

  • pinot zookeeper

  • redis

  • kfuse-profiling-server (requires cloud storage)

  • ui

  • beffe

Services that always use 1 replica

These components remain single-instance. On zone failure, Kubernetes reschedules the pod to a healthy node in another zone.

  • grafana

  • hydration-service

  • recorder

  • az-service

  • config-mgmt-service

  • rule-manager

  • user-mgmt-service

  • kfuse-auth

  • kfuse-saml

  • kfuse-cloud-exporter (scraper and exporters)

  • kfuse-profiler

Notes

  • Kafka topic partition replication factor and Pinot segment replication factor are automatically managed.

  • Most services calculate replica count automatically. Do not override unless required.

  • Set global.numNodes to influence replica scaling.

  • Ensure your cloud-managed PostgreSQL is accessible from all AZs.

Validation Checklist

After deployment, validate your setup:

kubectl get pods -o wide -n <kloudfuse-namespace>
bash

Confirm:

  • Pods are distributed across three zones.

  • Critical services (like Redis, Kafka) have one pod in each zone.

  • Single-instance services are running and can be rescheduled.