Kubernetes Backup and Restore Made Easy - TechWorld with Nana

Transcription is available below video player.

(TechWorld with Nana)

Nana Janashia:

In this video, we're going to talk about a challenging task of data management in Kubernetes and a tool that makes data management very easy for the Kubernetes administrators, which is Kasten’s K10. So what does ‘data management in Kubernetes’ actually mean, and why is it a challenging task? Imagine we have the following real world applications set up in our Kubernetes production cluster. Let's say we have an EKS cluster where our microservices application is running. Our microservices use Elasticsearch database, which is also running in the cluster. And in addition to that, our application is using Amazon's RDS data service, which is a managed database outside the cluster. This means our application has data services, both inside and outside the cluster. And this data will be physically stored on some storage backend. RDS data will be stored on AWS, of course, and data for Elasticsearch will be used in the cluster through Kubernetes persistent volume components, but they also need to be physically stored somewhere. And this could be a cloud storage on AWS, Google Cloud, et cetera, or on an on-premise server.

Now let's look at the use cases we would have to think about in terms of data management of this specific application setup. Imagine the following scenarios in your Kubernetes cluster. The underlying infrastructure where the cluster is running, fails and we'll lose all the pods, services and the whole cluster configuration. We would need to recreate the cluster with the same cluster state and the same application data. So we need to restore our cluster to the previous state.

Or let's say our Elasticsearch database gets corrupted or hacked into, and we lose all the data there. Again, we need to restore our database to the latest working state.

Or another very common use case: say our cluster is running on AWS, but we want to make our production cluster more reliable and flexible by not depending on just one cloud provider. And by replicating it on a Google Cloud environment with the same application set up and application data.

In all these cases, the challenge is how do we capture an application backup that includes all the data that the application needs, whether it's the databases in the Kubernetes cluster or a managed data service outside the cluster, so that if our cluster failed or something happened to our application and we lost all the data, we would be able to restore or replicate the application’s state with all its components like pods, services, config maps, secrets, et cetera, and all its data. And that is a challenging task. So essentially, you need a way to capture and backup all these parts of the cluster to then easily take that backup and replicate or restore the cluster state with it.

Now let's look at what alternatives we have available for that. If we do VM backups of our cluster nodes or etcd backups, we will save the state of the cluster. But what about the application data? They are not stored on the worker nodes, right? They are stored outside the cluster on a cloud platform or on on-premise servers. On the other side for the cloud storage backends, the cloud providers themselves have their own backup and replication mechanisms, but it's only partially managed by the platform, so you still have to configure the data backups and take care of your data yourself. Plus, it's just the data in the volume. This doesn't include the cluster state.

Many teams write custom scripts to piece together backup solutions on different levels like components, and state inside the cluster, and data outside the cluster. But these scripts can get very complex very soon because the data end state is spread on many levels and many platforms. And the script code usually ends up being too tied to the underlying platform or infrastructure where data is stored. The same goes for the restore logic.

Many teams use custom scripts to write restore logic or cluster recreation logic from all the different backup sources. So, overall your team may end up with lots of complex self-managed scripts, which are usually hard to maintain. And these are exactly the challenges that Kasten's K10 tool addresses.

So how does K10 solve these problems? K10 abstracts away the underlying infrastructure to give you a consistent data management support, no matter where the data is actually stored. So teams can choose whichever infrastructure or platform they want for their application without sacrificing operational simplicity because K10 has a pretty extensive ecosystem and integrates with various relational or NoSQL databases, many different combinations, distributions, and all clouds. So instead of backup scripts for each platform or level, you just have one easy UI interface of K10 to create complete application backups in the cluster.

So everything that is part of the application like Kubernetes components themselves in the application data in volumes, as well as data in managed data services outside the cluster will be captured in the application snapshot by K10. So, you can easily take that snapshot and reproduce or restore your cluster on any infrastructure you want. And K10 works with policies. So instead of manually backing up and restoring your applications, which means more effort and higher risk of making mistakes, you can configure backup and restore tasks to run automatically with the settings you define in the policy.

Now, what if you have multiple clusters across multiple zones or regions, or even across cloud platforms, how do you consistently manage tens or hundreds of cluster backups? For that, K10 actually has a multi-cloud cluster mode. In K10's multi-cluster dashboard, you have a single overview of all your clusters, as well as a way to create and configure global backup and restore policies that you can then apply to multiple clusters.

Now, if you have hundreds or thousands of applications across many clusters, of course you don't want to create policies on the UI. And for that, K10 actually provides us with Kubernetes native way of scripting policies with YAML. So you can also automate your policy creation and configuration as part of your policy as code workflow.

Now let's see how K10 actually works. First, we install K10 in a Kubernetes cluster. We can install it easily using a Helm chart in its own namespace. Once deployed, K10 will automatically discover all the applications running inside the cluster. And in our case, let's say we have a MySQL application running in its own MySQL namespace with persistent data storage configured. And you can see those automatically discovered applications on K10's dashboard, which also gets deployed in Kubernetes along with other components.

The applications card also shows warning that the discovered applications are unmanaged, which means we don't have any backup policies configured for our applications yet. So basically the application data isn't protected. That's what the warning is about. And for each application, we have a details view, which shows all application related components grouped together, including all the labels, data components, workload as well as configuration and networking components.

So as a next step, we can create automated backup policy for our MySQL application to protect mySQL application and its data. And if I click on “new policy,” I can create a policy for my application with all needed configuration options. Creating a policy on UI is as easy as simply choosing a snapshot action and selecting which application you want to back up and how often. Now you have an option to decide exactly what you want to backup. We can choose to protect everything that's associated with that application or be more granular and protect only some components of the application using filter resources. Or we can go even broader and snapshot multiple applications at once using the labels. Now this will configure local snapshots, but ideally we want to store our snapshots on an external storage location wo we have our backups protected and living outside the cluster.

For that, we can enable backups via snapshot exports and select the export location, which is going to be the backup target. This backup target can be configured in “location profiles” section in the settings, and this backup target can be any S3-compatible storage like Amazon S3, Google Cloud storage, Azure storage, MinIO, et cetera. So you can store your application backups in your preferred location. So in our case, we can configure a MinIO storage profile by adding the credentials and endpoint bucket name and save profile.

Now we can use this location profile as a remote storage location for our snapshots by selecting it here. And before creating the policy, if we click on this YAML button here, you'll see the policy component in a familiar YAML format. And this is actually the Kubernetes native API behind this policy. So everything you see in the UI is enabled by this API. So you can script your policies, and this will be very useful if you have hundreds or thousands of applications in your cluster that you want to back up and you need a way to scale your policies and configuration options.

So we create the policy and this policy will run every hour since we configured an hourly backup, but we can also run the policy manually at any time. So if we click on “run policy,” this will trigger a backup job that you can see on the dashboard. And when completed, we will see that all application components have been captured in that snapshot.

Now that we have a backup of the application with a local snapshot as well as its remote copy, we have the whole application protected. So if something happens, and we lose the data or application gets misconfigured, et cetera, we can restore the application from the latest snapshot simply by clicking “restore” and selecting the snapshot. We can restore the application or even clone it in a different namespace in the same cluster. In our case, let's clone the MySQL application in a new namespace called MySQL Clone. And when I click on “restore,” this will trigger restore job. And if we go back to the cluster, we can see what's happening in the background. New namespace was created, and we brought back all the application components and application data, so they are all now available in that new namespace.

And finally, when you're restoring the application, maybe you don't want to clone and run the application exactly the same way with the same configuration. For example, if you're cloning your application to another platform, maybe you want to change the storage type to use the storage of that platform or change the number of replicas of your applications or change the availability zone the application will run in and so on. You can actually do such adjustments to the application when restoring it using what's called transformations in K10, just by selecting “transformations” as an option, and then basically configuring what you need to be adjusted.

So as you see, K10 can make the data management for applications running in Kubernetes way easier, and you can actually get started with K10 for free as it has a “free forever” option for managing up to 10 nodes. And even more, in the description of this video, you will find a link to K10’s page where you can go through a hands-on lab to quickly try out the tool yourself without any cluster setup, as well as the link to the free K10 version. So make sure to check out those links and with that: Thank you for watching and see you in the next video.