Cross-AZ Recovery
Kloudfuse supports a cost-efficient multi-AZ deployment model that uses cloud storage (S3, GCS, or Azure Blob) as a synchronization mechanism between availability zones. This avoids continuous cross-AZ data transfer costs while still enabling fast failover and full data recovery after an AZ outage.
Deployment Architecture
The design keeps each Kloudfuse cluster isolated within a single AZ, while sharing cloud storage across AZs at the region level.
Key characteristics:
-
Node groups are deployed within a single AZ. Kloudfuse 2 on AZ2 can be pre-created with zero nodes running, keeping standby costs near zero.
-
Load balancer is deployed in a multi-AZ configuration and routes all traffic to the active cluster. It is the only component that spans AZs in normal operation.
-
Cloud storage (S3 / GCS / Azure Blob) is region-scoped and accessible from all AZs. It serves as the shared synchronization layer between the two clusters.
-
Pinot segments are continuously backed up to cloud storage as they are created.
-
PostgreSQL (dashboards, alerts, configuration) is periodically dumped to cloud storage.
-
AZ-Service - The
az-servicewhich backs up its configuration databases (configdb,orchestratordb) to cloud storage. Starting with release 3.4.2,az-serviceis enabled by default.
Prerequisites
Before you begin:
-
A fresh installation for AZ-1 and AZ-2. A Kloudfuse cluster cannot be upgraded to a AZ failover setup. The ingress-nginx must not be installed as part of the Kloudfuse cluster configuration.
-
Cloud storage must be set up and accessible from both AZs within the region.
-
ingress-nginx must be installed separately from Kloudfuse.
| For a cold stand-by, after the initial install, you can scale the number of nodes for AZ2 down to zero. This will reduce the costs of instances that you are not actually using. |
Building the Kloudfuse Clusters
The Kloudfuse Clusters need to be both installed in the same Kubernetes Cluster. However, each Cluster needs to run on
nodes that are tied to a specific AZ. So the cluster for AZ1 needs to have all the nodes it intends to run installed
in AZ1 and the backup cluster in AZ2 needs to be associated with nodes that will run only in AZ2. To do this you need to
add nodeAffinity/Tolerations in the global section of the custom_values.yaml
Refer to the documentation at Node Affinity and Tolerations on how to set this in the custom_values.yaml
Ingress Installation
A dedicated ingress-nginx controller installed as a standalone Helm release, separate from the Kloudfuse chart. You need to disable the bundled ingress-nginx in custom-values.yaml.
See Standalone Nginx Ingress for full configuration and installation instructions.
AZ-Service
az-service reads its cloud storage configuration from global.cloudStorage, which is the recommended approach since all Kloudfuse components share the same storage account. The zone field is always required under az-service.config. An az-service-specific override is available if the service needs to write to a different bucket.
Cloud Storage Configuration (global.cloudStorage)
Set global.cloudStorage in your custom-values.yaml. az-service will use this configuration automatically alongside the rest of the Kloudfuse components.
global:
cloudStorage:
type: <s3 | gcs | azure> (1)
useSecret: true (2)
secretName: cloud-storage-secret (3)
s3:
region: <region> (4)
bucket: <bucket> (5)
# gcs:
# bucket: <bucket>
# azure:
# container: <container>
az-service:
config:
zone: "<zone-id>" (6)
backupIntervalMinutes: 60 (7)
| 1 | type: Cloud storage provider — s3, gcs, or azure. |
| 2 | useSecret: Set to true to load credentials from a Kubernetes secret. Set to false to use the node’s ambient credentials (e.g., an IAM instance role or Workload Identity). |
| 3 | secretName: Name of the Kubernetes secret containing cloud storage credentials. See Cloud Storage Credentials for how to create the secret for each provider. |
| 4 | s3.region: AWS region of the S3 bucket. |
| 5 | s3.bucket: S3 bucket name. |
| 6 | zone: A unique identifier for the backup path within the bucket. If the bucket is shared across clusters, this prevents data from being written to the same folder. Defaults to <orgId>-<Release.Name>-<Release.Namespace> if not set. |
| 7 | backupIntervalMinutes: How often az-service backs up PostgreSQL to cloud storage. Defaults to 60. |
Cloud Storage Credentials
Create the Kubernetes secret before running the Helm install. The secret name must match cloudStorage.secretName in your configuration.
S3 — the secret must contain accessKey and secretKey:
kubectl create secret generic cloud-storage-secret \
--from-literal=accessKey=<accessKey> \
--from-literal=secretKey='<secretKey>'
GCS — the secret must contain the JSON credential file named secretKey:
kubectl create secret generic cloud-storage-secret --from-file=./secretKey
Azure — the secret must contain the storage account connectionString:
kubectl create secret generic cloud-storage-secret \
--from-literal=connectionString='<connectionString>'
Overriding Cloud Storage for AZ-Service
If az-service needs to write backups to a different bucket than the one used by the rest of Kloudfuse, set az-service.config.cloudStorage explicitly. This takes precedence over global.cloudStorage for az-service only.
az-service:
config:
zone: "<zone-id>"
cloudStorage:
type: s3
useSecret: true
secretName: az-service-storage-secret
s3:
region: <region>
bucket: <az-service-bucket>
Failover Steps
The following steps shift traffic from Kloudfuse 1 (AZ1) to Kloudfuse 2 (AZ2). These steps apply whether you are responding to a failure or executing a planned migration. The same steps are used in reverse to move back from AZ2 to AZ1.
Switching steps
-
Install secrets in Kloudfuse AZ2.
Re-create all secrets from AZ1 in the AZ2 cluster before running the Helm install. The following are the most common required secrets.-
Image pull credentials (required to pull Kloudfuse container images)
-
Cloud storage secret (required by
az-servicefor backup and restore) -
PostgreSQL credentials (if using a cloud-managed database)
-
TLS certificate (if HTTPS is enabled on the cluster)
-
SSO Metadata Secret (required for SAML authentication)
This not intended to be a complete list, but a reminder that a number of systems may require secrets to be set up.
-
-
Scale up Node Group 2 on AZ2.
If the node group was pre-created with zero nodes, scale it up to the required node count. This step make take some time as the Cloud Provider launches new nodes. -
Do a helm upgrade on the Cluster in AZ2
Run the Helm upgrade command to install or update Kloudfuse on the AZ2 cluster. Ensurecustom-values.yamlreflects the AZ2 configuration.helm upgrade --install kfuse oci://us-east1-docker.pkg.dev/mvp-demo-301906/kfuse-helm/kfuse \ -n kfuse \ --version <VERSION> \ (1) -f custom-values.yaml1 Replace <VERSION>with a valid Kloudfuse release value. See Release Notes for the latest release. -
Update Nginx Controller route Traffic to Kloudfuse AZ2.
Update the load balancer to direct traffic to AZ2. From this point, Kloudfuse 2 is accessible for UI access and data ingestion. No data loss occurs at this step.helm upgrade --install <release-name> ingress-nginx/ingress-nginx \ (1) --namespace <release-name> --create-namespace \ --version <version> \ (2) -f values.yaml1 Replace <release-name>with a name that identifies the AZ (for example,ingress-az1oringress-az2). Use the same value for both the release name and namespace so each AZ controller is isolated.2 Replace <version>with the ingress-nginx chart version required for your environment. See the ingress-nginx releases page for available versions. -
Port-forward az-service on AZ1 and AZ2
Open two terminal sessions — one for each cluster. Switch context, set the namespace, and forward theaz-serviceport so thecurlcommands in the steps below can reach each cluster locally. Replace<AZ1_CONTEXT>and<AZ2_CONTEXT>with the context names fromkubectl config get-contexts.AZ1:
kubectl config use-context <AZ1_CONTEXT> kubectl config set-context --current --namespace kfuse kubectl port-forward svc/az-service 8081:8080AZ2 (separate terminal):
kubectl config use-context <AZ2_CONTEXT> kubectl config set-context --current --namespace kfuse kubectl port-forward svc/az-service 8082:8080Keep both port-forward sessions running for the duration of the failover. The subsequent
curlcommands uselocalhost:8081for AZ1 andlocalhost:8082for AZ2. -
Pause Kloudfuse AZ1.
Silences alerts and pauses Pinot consumption on AZ1, ensuring all in-flight data is flushed to Kafka before the segment state is transferred.curl -X POST https://localhost:8081/pauseTo bypass the Kafka consumption check (for example, if AZ1 is unreachable):
curl -X POST "https://localhost:8081/pause?force=true" -
Activate Kloudfuse AZ2.
Restores PostgreSQL (configdb,orchestratordb) from the S3 backup, rehydrates missing Pinot segments from cloud storage, and resumes Pinot consumption. Thezoneparameter must match the zone ID used by AZ1’saz-service.curl -X POST "https://localhost:8082/activate?zone=<zone-id>"To restore to a specific point in time, add the
latestTimeparameter:curl -X POST "https://localhost:8082/activate?zone=<zone-id>&latestTime=<timestamp>"To monitor rehydration progress:
curl https://localhost:8082/status -
Resume alerts on Kloudfuse AZ2.
Unsilences alerts once activation is complete and you have verified that the cluster is healthy.curl -X POST https://localhost:8082/resume
| The new cluster must use the same cloud storage configuration (S3, GCS, or Azure) as the original backed-up cluster, including the same bucket or container and credentials. The zone configuration may differ from the original cluster. |
AZ-Service command reference
The az-service exposes four endpoints. All write operations use POST.
Pause
Silences alerts and pauses Pinot consumption. Use this on the cluster you are failing away from.
curl -X POST https://<KFUSE_ENDPOINT>/pause
| Parameter | Required | Description |
|---|---|---|
|
No |
Set to |
Activate
Restores configdb and orchestratordb from S3, rehydrates missing Pinot segments from cloud storage, and resumes Pinot consumption. Use this on the cluster you are failing over to.
curl -X POST "https://<KFUSE_ENDPOINT>/activate?zone=<zone-id>"
| Parameter | Required | Description |
|---|---|---|
|
Yes |
Zone ID of the source backup. Must match the |
|
No |
Restore |
|
No |
Kubernetes namespace of the source cluster. When provided, |
|
No |
Passed through to the source cluster pause call when |
|
No |
Number of Pinot segments to rehydrate per batch. Defaults to the service configuration value. |
|
No |
Set to |