This page describes how to back up and restore clusters created with Google Distributed Cloud. These instructions apply to all cluster types supported by Google Distributed Cloud.
Back up a cluster
The backup process has two parts. First, a snapshot is made from the etcd store. Then, the related PKI certificates are saved to a tar file. The etcd store is the Kubernetes backing store for all cluster data and contains all the Kubernetes objects and custom objects required to manage cluster state. The PKI certificates are used for authentication over TLS. This data is backed up from the cluster's control plane or from one of the control planes for a high-availability (HA)
We recommend you back up your clusters regularly to ensure your snapshot data is relatively current. The rate of backups depends upon the frequency in which significant changes occur for your clusters.
Make a snapshot of the etcd store
In Google Distributed Cloud, a pod named etcd-CONTROL_PLANE_NAME
in the kube-system namespace runs the etcd for that control plane. To backup the
cluster's etcd store, perform the following steps from your admin workstation:
Use
kubectl get po
to identify the etcd Pod.kubectl --kubeconfig CLUSTER_KUBECONFIG get po -n kube-system \ -l 'component=etcd,tier=control-plane'
The response includes the etcd Pod name and its status.
Use
kubectl describe pod
to see the containers running in the etcd pod, including the etcd container.kubectl --kubeconfig CLUSTER_KUBECONFIG describe pod ETCD_POD_NAME -n kube-system
Run a Bash shell in the etcd container:
kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \ ETCD_POD_NAME --container etcd --namespace kube-system \ -- bin/sh
From the shell within the etcd container, use
etcdctl
(version 3 of the API) to save a snapshot,snapshot.db
, of the etcd store.ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ snapshot save snapshotDATESTAMP.db
Replace DATESTAMP with the current date to prevent overwriting any subsequent snapshots.
Exit from the shell in the container and run the following command to copy the snapshot file to the admin workstation.
kubectl --kubeconfig CLUSTER_KUBECONFIG cp \ kube-system/ETCD_POD_NAME:snapshot.db \ --container etcd snapshot.db
Store the snapshot file in a location that is outside of the cluster and is not dependent on the cluster's operation.
Archive the PKI certificates
The certificates to be backed up are located in the /etc/kubernetes/pki
directory of the control plane. The PIK certificates together with the etcd
store snapshot.db
file are needed to to recover a cluster in the event the
control plane goes down completely. The following steps create a tar file,
containing the PKI certificates.
Use
ssh
to connect to the cluster's control plane as root.ssh root@CONTROL_PLANE_NAME
From the control plane, create a tar file,
certs_backup.tar.gz
with the contents of the/etc/kubernetes/pki
directory.tar -czvf certs_backup.tar.gz -C /etc/kubernetes/pki .
Creating the tar file from within the control plane preserves all the certificate file permissions.
Exit the control plane and, from the workstation, copy tar file containing the certificates to a preferred location on the workstation.
sudo scp root@CONTROL_PLANE_NAME:certs_backup.tar.gz BACKUP_PATH
Restore a cluster
Restoring a cluster from a backup is a last resort and should be used when a cluster has failed catastrophically and cannot be returned to service any other way. For example, the etcd data is corrupted or the etcd Pod is in a crash loop.
The cluster restore process has two parts. First, the PKI certificates are restored on the control plane. Then, the etcd store data is restored.
Restore PKI certificates
Assuming you have backed up PKI certificates as described in Archive the PKI certificates, the following steps describe how to restore the certificates from the tar file to a control plane.
Copy the PKI certificates tar file,
certs_backup.tar.gz
, from workstation to the cluster control plane.sudo scp -r BACKUP_PATH/certs_backup.tar.gz root@CONTROL_PLANE_NAME:~/
Use
ssh
to connect to the cluster's control plane as root.ssh root@CONTROL_PLANE_NAME
From the control plane, extract the contents of the tar file to the
/etc/kubernetes/pki
directory.tar -xzvf certs_backup.tar.gz -C /etc/kubernetes/pki/
Exit the control plane.
Restore the etcd store
When restoring the etcd store, the process depends upon whether or not the cluster is running in high availability (HA) mode and, if so, whether or not quorum has been preserved. Use the following guidance to restore the etcd store for a given cluster failure situation:
If the failed cluster is not running in HA mode, restore the etcd store on the control plane with the following steps.
If the cluster is running in HA mode and quorum is preserved, do nothing. As long a quorum is preserved, you don't need to restore failed clusters.
If the cluster is running in HA mode and quorum is lost, repeat the following steps to restore the etcd store for each failed member.
Follow these steps from the workstation to remove and restore the etcd store on a control plane for a failed cluster:
Create a
/backup
directory in the root directory of the control plane.ssh root@CONTROL_PLANE_NAME "mkdir /backup"
This step is not strictly required, but we recommend it. The following steps assume you have created a
/backup
directory.Copy the etcd snapshot file,
snapshot.db
from workstation to thebackup
directory on the cluster control plane.sudo scp snapshot.db root@CONTROL_PLANE_NAME:/backup
Use SSH to connect to the control plane node:
ssh root@CONTROL_PLANE_NAME
Stop the etcd and kube-apiserver static pods by moving their manifest files out of the
/etc/kubernetes/manifests
directory and into the/backup
directory.sudo mv /etc/kubernetes/manifests/etcd.yaml /backup/etcd.yaml sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /backup/kube-apiserver.yaml
Remove the etcd data directory.
rm -rf /var/lib/etcd/
Run
etcdctl
snapshot restore usingdocker
.sudo docker run --rm -t \ -v /var/lib:/var/lib \ -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \ -v /backup:/backup \ --env ETCDCTL_API=3 \ k8s.gcr.io/etcd:3.2.24 etcdctl \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --data-dir=/var/lib/etcd \ --name=CONTROL_PLANE_NAME \ --initial-advertise-peer-urls=https://CONTROL_PLANE_IP:2380 \ --initial-cluster=CONTROL_PLANE_NAME=https://CONTROL_PLANE_IP:2380 \ snapshot restore /backup/snapshot.db
The entries for
--name
,--initial-advertise-peer-urls
, and--initial-cluster
can be found in theetcd.yaml
manifest file that was moved to the/backup
directory.Ensure that
/var/lib/etcd
was recreated and that a new member is created in/var/lib/etcd/member
.Move the etcd and kube-apiserver manifests back to the
/manifests
directory so that the static pods can restart.sudo mv /backup/etcd.yaml /etc/kubernetes/manifests/etcd.yaml sudo mv /backup/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
Run a Bash shell in the etcd container:
kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \ ETCD_POD_NAME --container etcd --namespace kube-system \ -- bin/sh
Use
etcdctl
to confirm the added member is working properly.ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --endpoints=CONTROL_PLANE_IP:2379 \ endpoint health
If you are restoring multiple failed members, once all failed members have been restored, run the command with the control plane IP addresses from all restored members in the `--endpoints' field.
For example:
ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --endpoints=10.200.0.3:2379,10.200.0.4:2379,10.200.0.5:2379 \ endpoint health
On success for each endpoint, your cluster should be working properly.