HA Multi-AZ Setup
If you need High Availability(HA) for a Kloudfuse cluster, you have the option of deploying it across multiple availability zones (multi-AZ). A multi-AZ setup minimizes downtime and maintains observability workflows in the event of a zone failure. Each zone must have at least 2 pods to run the supported components.
Benefits of Multi-AZ Deployment
-
Protects against zone-level failures
-
Ensures service continuity
-
Balances workload across zones
-
Aligns with SRE best practices
Prerequisites
Before you begin:
-
Kubernetes Cluster Infrastructure
-
A fresh installation. A Kloudfuse cluster cannot be upgraded to a Multi-AZ setup.
-
Nodes to be set up across 3 availability zones (required)
-
Each availability zone must have an equal number of nodes (required)
-
The total number of nodes must be a multiple of 6 (required)
-
Nodes cannot have additional taints outside what is used by Kloudfuse (required) Additional taints will cause validations to fail and those will not be used
-
-
Cloud Managed PostgreSQL instance
-
Cloud Provider managed PostgreSQL instance:
-
AWS RDS
-
GCP Cloud SQL
-
Azure Database for PostgreSQL
-
PostgreSQL must be available to all 3 availability zones.
-
PostgreSQL version 14.11+
-
Kubernetes Secret with PostgreSQL credentials
-
-
-
AWS-only
-
A NLB and Kloudfuse DNS mapping
-
Use a NLB (Network Load Balancer) for Kloudfuse DNS mapping. Elastic IPs are AZ-specific, the DNS for the Kloudfuse endpoint must be a CNAME pointing to the NLB DNS to ensure traffic routing during a zone failure.
-
Step 1: Configure Helm Values
In the custom_values.yaml, configure the following fields:
global:
cloudProvider: <aws | gcp | azure>
numNodes: <Total number of nodes across all zones>
multiAzDeployment:
enabled: true
configDB:
host: <Postgres host for configDB>
orchestratorDB:
host: <Postgres host for orchestratorDB>
installKfusePgCredentials: false
This configuration ensures Kloudfuse uses external PostgreSQL and skips deploying its own credentials secret.
Step 2: Disable Embedded PostgreSQL
To use cloud-managed PostgreSQL, disable the internal PostgreSQL services:
ingester:
postgresql:
enabled: false
kfuse-configdb:
enabled: false
Step 3: Automatic Scaling and Anti-Affinity Rules
When multiAzDeployment.enabled is set to true, Kloudfuse will automatically:
-
Adjusts
replicaCountfor services based onglobal.numNodes -
Applies pod anti-affinity rules to distribute replicas across availability zones
Do not manually set replicaCount for most services.
Service Behavior in Multi-AZ Mode
These services will auto-scale based on the number of nodes that are being used in the cluster.
-
advance-functions-service
-
events-query-service
-
ingester
-
kafka
-
llm-evaluation-service
-
llm-query-service
-
logs-parser
-
logs-query-service
-
logs-transformer
-
kfuse-observability-agent
-
kfuse-vector
-
metrics-transformer
-
pinot
-
query-service
-
rum-query-service
-
trace-query-service
-
trace-transformer
-
zapper
Services that, by default, will have 3 replicas (1 per zone)
-
beffe
-
ingress-nginx
-
kafka zookeeper
-
kfuse-profiling-server (requires cloud storage)
-
pinot zookeeper
-
redis
-
ui
Services that always use 1 replica
These components remain single-instance. On zone failure, Kubernetes reschedules the pod to a healthy node in another zone.
-
az-service
-
config-mgmt-service
-
grafana
-
hydration-service
-
kfuse-auth
-
kfuse-cloud-exporter (scraper and exporters)
-
kfuse-profiler
-
kfuse-saml
-
rule-manager
-
user-mgmt-service
Additional Notes
-
Replication factors will be automatically managed for Kafka and Pinot
-
Do not override
replicaCountunless given specific guidance to do so -
global.numNodesis used to influence replica scaling.