Disaster Recovery on the Google Kubernetes Engine (GKE) Platform
Learn how to implement a Disaster Recovery plan for Google Kubernetes Engine (GKE) Platform. For more information, see Google documentation on Backup for GKE.
Google Cloud Backup
To handle region and availability zone failures, we recommend the following steps:
-
Create a GKE cluster for the primary region and availability zone; see Google documentation for Creating a zonal cluster.
-
Create another GKE cluster in the fail-over region and availability zone.
-
For both GKE clusters, complete the DNS/TLS setup prerequisites as described in Configure HTTPS/TLS on the Kloudfuse Ingress or Configure Kloudfuse Stack to Work with TLS Termination , depending on your approach.
-
Enable GKE backup and recovery for these clusters. See these instructions in Google documentation:
-
Create a GCS bucket with cross-region and availability zone access.
-
Install the Kloudfuse helm chart in the primary GKE cluster; see Install Kloudfuse Using Helm.
Kloudfuse add the following labels for user-configured secrets and
config
maps, such as SSO/SAML setup, TLS certificates, and so on.app.kubernetes.io/instance: kfuse
Configure Backup Policy
Set up the GKE backup policy based on the Recovery Point Objective (RPO) and retention policy of your organization.
This automatically creates backups of Kloudfuse installations to fulfill the RPO.
GKE retains the backups for the specified period, and deletes them after.
GKE uses different configurations parameters, depending on the RPO:
-
For RPOs of 60 minutes or longer, use the option
target-rpo-minutes
. -
For RPOs less than 60 minutes, use the option
cron-schedule
.Use the following example as a guide for setting up a GKE backup policy based on the
target-rpo-minutes
option.gcloud beta container backup-restore backup-plans create kloudfuse-backup-plan \ --project=<projectID> \ (1) --location=<location of the primary GKE cluster> \ (2) --cluster=<name of the primary GKE cluster> \ (3) --selected-applications=<namespace>/kloudfuse \ (4) --include-secrets \ (5) --target-rpo-minutes=60 \ (6) --backup-retain-days=1 (7)
1 project
: The ID of the Google Cloud project.2 location
: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.3 cluster
: The name of the primary GKE cluster to back up and restore.4 selected-applications
: A list of protected applications; see Google documentation on Define custom backup and restore logic.5 include-secrets
: An optional argument to includeSecret
resources if they are in the scope of the backup and restore policy.6 target-rpo-minutes
: Minimum RPO is 60 minutes. See RPO < 60 minutes for instructions on how to configure shorter RPOs.7 backup-retain-days
: Specify the data retention time, in days.Use the following example as a guide for setting up a GKE backup policy based on the
--cron-schedule
option.The value mask is
mins hrs days months yrs
. The*
(wildcard) character is equivalent to all possible values.For example, the value of
10 3 * * *
creates a backup at 3:10 AM every day. All times are in UTC format.gcloud beta container backup-restore backup-plans create kloudfuse-backup-plan \ --project=<project> \ (1) --location=<location of the primary gke cluster> \ (2) --cluster=<name of the primary gke cluster> \ (3) --selected-applications=<namespace>/kloudfuse \ (4) --include-secrets \ (5) --cron-schedule="10 3 * * *" \ (6) --backup-retain-days=1 (7)
1 project
: The ID of the Google Cloud project.2 location
: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.3 cluster
: The name of the primary GKE cluster to back up and restore.4 selected-applications
: A list of protected applications; see Google documentation on Define custom backup and restore logic.5 include-secrets
: An optional argument to includeSecret
resources if they are in the scope of the backup and restore policy.6 -- cron-schedule
: RPO is less than 60 minutes. See RPO ≥ 60 minutes for instructions on how to configure longer RPOs.7 backup-retain-days
: Specify the data retention time, in days.
Restore
-
Configure the Restore plan.
gcloud beta container backup-restore restore-plans create kloudfuse-restore-plan \ --project=<project> \ (1) --location=<location of the failover gke cluster> \ (2) --backup-plan=kloudfuse-backup-plan \ (3) --cluster=<name of the failover gke cluster> \ (4) --cluster-resource-conflict-policy=delete-and-restore \ (5) --selected-applications=<namespace>/kloudfuse \ (6) --volume-data-restore-policy=restore-volume-data-from-backup (7)
1 project
: The ID of the Google Cloud project.2 location
: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.3 backup-plan
: The saved plan for Kloudfuse backups.4 cluster
: The name of the primary GKE cluster to back up and restore.5 cluster-resource-conflict-policy
: Define the behavior for ; see Google documentation for ClusterResourceConflictPolicy.6 selected-applications
: A list of protected applications; see Google documentation on Define custom backup and restore logic.7 Defines how to populate data for restored volumes. See Google documentation on how to Create a restore plan. -
When a primary region and availability zone fails, the administrator can restore the Kloudfuse to the target/fail-over GKE cluster.
We recommend that you set up an automatic detection and alerting mechanism.
gcloud beta container backup-restore restores create kloudfuse-restore \ --project=<project> \ (1) --location=<location of the fail-over gke cluster> \ (2) --restore-plan=kloudfuse-restore-plan \ (3) --backup=<select the latest GKE backups for the kloudfuse> (4)
1 project
: The ID of the Google Cloud project.2 location
: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.3 restore-plan
: The saved plan for Kloudfuse restores.4 backup
: The backup to restore; must be located in the backup plan used by the restore plan. -
When using a regional static IP address for the load balancer, complete these additional steps:
-
In the Kloudfuse Ingress configuration, update the load balancer IP to the regional static IP for the fail-over region. See Configure Helm Values for Kloudfuse Ingress.
-
Update the DNS record to point to this new static IP.
-
-
Alternatively, when using a global static IP address for the load balancer, be sure to "fence off" the failed GKE cluster. Otherwise, if the primary GKE cluster recovers, it can accidentally bind to the same static IP address.
Backup Limitations
Backup works on Kubernetes resources and underlying persistent volumes only.
It does not work for:
-
Cluster configuration information: ode configuration, node pools, initial cluster size, or enabled features.
-
Container images referenced by a backup. If an image referenced by a workload manifest is removed from its image repository, subsequent restore of that configuration does not restore the workload.
-
Configuration information or state of services outside the cluster: Cloud SQL or external load balancers.
-
Other volume types, such as Filestore NFS or Google Cloud NetApp Volumes. However, you can use Backup for GKE to provide solutions for workloads that are backed by Filestore volumes. See Handle Filestore volumes with Backup for GKE.