Disaster Recovery - GKE

Learn how to implement a disaster recovery plan for the Google Kubernetes Engine (GKE) platform. For more information, see the Google documentation on Backup for GKE.

Google Cloud Backup

To handle region and availability zone failures, Kloudfuse recommends the following steps:

  1. Create a GKE cluster for the primary region and availability zone. See the Google documentation for Creating a zonal cluster.

  2. Create another GKE cluster in the failover region and availability zone.

  3. For both GKE clusters, complete the DNS/TLS setup prerequisites as described in Https Configuration or TLS Host Based Routing, depending on your approach.

  4. Enable GKE backup and recovery for these clusters. See these instructions in the Google documentation:

  5. Create a GCS bucket with cross-region and availability zone access.

  6. Install the Kloudfuse Helm chart in the primary GKE cluster. See Installation Topics.

    Kloudfuse adds the following labels to user-configured secrets and config maps, such as SSO/SAML setup, TLS certificates, and so on:

    app.kubernetes.io/instance: kfuse

Configure Backup Policy

Set up the GKE backup policy based on the Recovery Point Objective and retention policy of your organization.

This automatically creates backups of Kloudfuse installations to fulfill the RPO.

GKE retains backups for the specified period and deletes them afterward.

GKE uses different configuration parameters depending on the RPO:

RPO ≥ 60 minutes

Use the following example as a guide for setting up a GKE backup policy based on the target-rpo-minutes option.

gcloud beta container backup-restore backup-plans create kloudfuse-backup-plan \
    --project=<projectID> \ (1)
    --location=<location of the primary GKE cluster> \ (2)
    --cluster=<name of the primary GKE cluster> \ (3)
    --selected-applications=<namespace>/kloudfuse \ (4)
    --include-secrets \ (5)
    --target-rpo-minutes=60 \ (6)
    --backup-retain-days=1 (7)
1 project: The ID of the Google Cloud project.
2 location: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.
3 cluster: The name of the primary GKE cluster to back up and restore.
4 selected-applications: A list of protected applications; see Google documentation on Define custom backup and restore logic.
5 include-secrets: An optional argument to include Secret resources if they are in the scope of the backup and restore policy.
6 target-rpo-minutes: Minimum RPO is 60 minutes. See RPO < 60 minutes for instructions on how to configure shorter RPOs.
7 backup-retain-days: Specify the data retention time, in days.
RPO < 60 minutes

Use the following example as a guide for setting up a GKE backup policy based on the --cron-schedule option.

The value mask is mins hrs days months yrs. The * (wildcard) character is equivalent to all possible values.

For example, the value 10 3 * * * creates a backup at 3:10 AM every day. All times are in UTC format.

gcloud beta container backup-restore backup-plans create kloudfuse-backup-plan \
    --project=<project> \ (1)
    --location=<location of the primary GKE cluster> \ (2)
    --cluster=<name of the primary GKE cluster> \ (3)
    --selected-applications=<namespace>/kloudfuse \ (4)
    --include-secrets \ (5)
    --cron-schedule="10 3 * * *" \ (6)
    --backup-retain-days=1 (7)
1 project: The ID of the Google Cloud project.
2 location: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.
3 cluster: The name of the primary GKE cluster to back up and restore.
4 selected-applications: A list of protected applications; see Google documentation on Define custom backup and restore logic.
5 include-secrets: An optional argument to include Secret resources if they are in the scope of the backup and restore policy.
6 -- cron-schedule: RPO is less than 60 minutes. See RPO ≥ 60 minutes for instructions on how to configure longer RPOs.
7 backup-retain-days: Specify the data retention time, in days.

Restore

  1. Configure the restore plan:

    gcloud beta container backup-restore restore-plans create kloudfuse-restore-plan \
        --project=<project> \ (1)
        --location=<location of the failover GKE cluster> \ (2)
        --backup-plan=kloudfuse-backup-plan \ (3)
        --cluster=<name of the failover GKE cluster> \ (4)
        --cluster-resource-conflict-policy=delete-and-restore \ (5)
        --selected-applications=<namespace>/kloudfuse \ (6)
        --volume-data-restore-policy=restore-volume-data-from-backup (7)
    1 project: The ID of the Google Cloud project.
    2 location: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.
    3 backup-plan: The saved plan for Kloudfuse backups.
    4 cluster: The name of the failover GKE cluster.
    5 cluster-resource-conflict-policy: Defines how conflicting cluster-scoped resources already present in the failover cluster are handled during restore; see Google documentation for ClusterResourceConflictPolicy.
    6 selected-applications: A list of protected applications; see Google documentation on Define custom backup and restore logic.
    7 Defines how to populate data for restored volumes. See Google documentation on how to Create a restore plan.
  2. When a primary region or availability zone fails, the administrator can restore Kloudfuse to the target failover GKE cluster.

    Kloudfuse recommends setting up an automatic detection and alerting mechanism for zone failures.

    gcloud beta container backup-restore restores create kloudfuse-restore \
        --project=<project> \ (1)
        --location=<location of the failover GKE cluster> \ (2)
        --restore-plan=kloudfuse-restore-plan \ (3)
        --backup=<select the latest GKE backup for Kloudfuse> (4)
    1 project: The ID of the Google Cloud project.
    2 location: Location of the primary gke cluster — the region of the backup/restore plan; see Google documentation on Available regions and zones.
    3 restore-plan: The saved plan for Kloudfuse restores.
    4 backup: The backup to restore; must be located in the backup plan used by the restore plan.
  3. When using a regional static IP address for the load balancer, complete these additional steps:

    • In the Kloudfuse Ingress configuration, update the load balancer IP to the regional static IP for the failover region. See Configure Nginx Ingress.

    • Update the DNS record to point to the new static IP.

  4. When using a global static IP address for the load balancer, ensure that the failed GKE cluster is fenced off. If the primary GKE cluster recovers, it can accidentally bind to the same static IP address.

Backup Limitations

Backup works on Kubernetes resources and underlying persistent volumes only.

It does not cover:

  • Cluster configuration information: node configuration, node pools, initial cluster size, or enabled features.

  • Container images referenced by a backup. If an image referenced by a workload manifest is removed from its image repository, a subsequent restore does not restore the workload.

  • Configuration information or state of services outside the cluster, such as Cloud SQL or external load balancers.

  • Other volume types, such as Filestore NFS or Google Cloud NetApp Volumes. However, you can use Backup for GKE to provide solutions for workloads backed by Filestore volumes. See Handle Filestore volumes with Backup for GKE.