Kubernetes bare metal storage

Published:  27/03/2019 10:11

Introduction

A downside of using containers is that they're really not meant for stateful applications.

You can mount local storage on them, but that can become confusing very quickly with multiple storage options and the distributed model of Kubernetes.

The persistent storage model used by Kubernetes is often praised as one of its flagship features.

The current article is a follow-up on two others about bare metal Kubernetes deployments:

A word of warning

Containers are awesome because they have this frozen-in-time perfect and consistent state when you start or restart them, be it on the laptop of a developer or in the production K8S cluster and it's very easy to streamline an automatic build process that outputs a working portable container that will work both in production and development with predictable results.

A container crashed on Kubernetes? Kill it so that it just restarts (actually K8S will automatically try to restart it by default), nothing can go wrong since it's autosufficient.

When you start to rely on the consistency of a state that has to be mounted on the container, things become less awesome. Or at the very least that's our opinion here at Net7.

Apps using persistent storage vary on how much they actually rely on that storage, how they manage data consistency and how much concurrency they may allow. Some simple apps that need storage using simple files and an engine that you fully understand probably won't be much of an issue, this is especially true of some log parsing systems where you might be OK with losing some of the logging.

On that note, the ELK stack is a common deployment on Kubernetes and we think it's a great app. to run on such an infrastructure, even though it requires persistent storage.

On the other hand, you can find a ton of stories of issues with containers that heavily rely on persistent storage, namely database systems.

Not to play on confirmation bias because you could turn pretty much any web search into confirming what you believe, but if you've had filesystem issues with database systems before and handle database systems day to day you probably feel a little queasy about storage that could be brutally detached, reattached, shared, completely removed by accident, etc.

Success stories still exist aplenty when using Kubernetes. Our point is that your should not absolutely strive to get all your databases to run in pods, especially in the case where you already have a database system that works reliably.

More precisely, a database in Kubernetes can't benefit from some of its best features. E.g. Rolling updates (multiple node data systems require a more complex deployment model called a SatefulSet) and horizontal pod autoscaling and as such database systems do not really benefit a lot from being inside a Kubernetes cluster and tend to suffer more from the container layer overhead than stateless application clusters.

The everything-in-containers approach is totally fine but is not something you just decide to do one monday morning, it requires careful planning and testing for what is expected to be the most reliable layers of your infrastructure: the data layers.

The Kubernetes Storage Model

There are a few different storage options in Kubernetes that we'll try to present here, we'll then explain what we've chosen as our baseline persistent storage options.

We won't cover the basic Kubernetes pod volume storage, which is often shown to mount storage for cache or configuration sharing

It can still be useful in some multi-container deployments and in the case you want to use a memory filespace for a buffer, cache or temporary filesystem.

It's still possible to mount an NFS share as a simple volume, and the data will still be intact once the pod is shutdown. It can also easily be pre-populated with data this way.

Mouting an entire NFS share as a volume is probably the easiest form of persistent storage you can get. We would like to maximize a single NFS share to make it provision different volumes for us.

Persistent Volumes

While basic pod volumes are not actual API resources, the objects of type PersistentVolume (later abbreviated PV) and PersistentVolumeClaim (later abbreviated PVC) are.

The idea is that when we create Deployment objects we can make claims for persistent volume, that will in turn look for a free matching PersistentVolume item and then claim it for that deployment.

In that order of things, you're supposed to create PersistentVolume items before attempting to claim them in deployments.

Most cloud providers will allow you to skip the PersistentVolume creation by providing a special automatic provisioner for Persistent Volume Claims. We'll see later on how we can also implement that feature at our scale.

To create a PV, a prerequisite is to have access to existing distributed storage. K8S supports a variety of storage systems, including NFS, CIFS, Ceph, GlusterFS and more.

We use NFS for our baseline storage here at Net7, as we have several SSD-backed storage systems with performances on par with SSD volumes offered by cloud providers.

The rest of the article will assume you already have a working NFS server.

If you want to use NFS and are using Ubuntu or Debian, it's required to install the NFS client on each and every node:

apt-get install nfs-common

Create a simple persistent volume

To just create one volume to be claimed by a specific deployment you want to put in production, you could declare it like so:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: <VOLUME_NAME>
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  persistentVolumeReclaimPolicy: Retain
  accessModes:
    - ReadWriteMany
  nfs:
    server: <NFS_SERVER>
    path: <NFS_PATH>

Where <NFS_SERVER> is the IP address or hostname of the NFS server and <NFS_PATH> is the NFS volume mount path configured on the server. Don't forget to add in a <VOLUME_NAME> and possibly register the volume in a specific namespace (we're using the default namespace here).

It's important to have the accessMode set to ReadWriteMany with NFS.

The capacity is important in that PersistentVolumeClaims will look for a PersistentVolume that has enough capacity for what was requested in the claim.

Create the persistent volume claim

The claim could be done right before your deployment as part of the same yaml file but it will have to be referenced in the deployment template spec so it has to be created before the deployment is created.

Example PVC for the PV created before:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-pv-claim
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Do note that we're calling it "mysql-pv-claim" because our test deployment will be a very simple MySQL server.

The PVC will look at existing PVs and pick one to claim in its entirety, so there's no point in using a smaller volume in the request part of the configuration, Kubernetes will look for the first volume that can accomodate our storage request but will still bind the entirety of it.

You can inspect the PVC item using a variation of the command:

kubectl get pvc

To see if it got bound correctly. In case of errors it will probably stay in "Pending" state forever. You can use kubectl describe pvc to try to identify the source of the failure.

Use the PVC in a deployment

We can now use that PVC to deploy MySQL. The config is fairly long but it's pretty straightforward to understand.

The only new thing is that we will choose to use the Kubernetes secrets API to save our future MySQL root password. To do so you need to issue a kubectl command such as:

kubectl create secret generic mysql-pass --from-literal=password=<YOUR_PASSWORD>

Where the secret name here is "mysql-pass", write yours down and change it in the configuration below if you want to use a different name (also note that all of this is using the default namespace).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-server
  labels:
    app: mysql-server
spec:
  selector:
    matchLabels:
      app: mysql-server
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: mysql-server
    spec:
      containers:
      - image: mysql:5.6
        name: mysql
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-pass
              key: password
        ports:
        - containerPort: 3306
          name: mysql
        volumeMounts:
        - name: mysql-persistent-storage
          mountPath: /var/lib/mysql
          subPath: mysql-server
      volumes:
      - name: mysql-persistent-storage
        persistentVolumeClaim:
          claimName: mysql-pv-claim

Quick mention about the deployment strategy: it's more than strongly adviced to use Recreate to make sure we won't use the rolling update system which would start the new pod alongside the old one and have them both try to bind the persistent volume, which is a very bad idea.

Unless you really know what you're doing, any deployment using persistent storage should use the Recreate deployment strategy.

How do we get the volume back?

Let's say we don't need our MySQL instance anymore and destroy the deployment. In that process we also remove the PVC. What happens now?

It depends on the retention policy that was chosen when we created the persistent volume. See the following line:

persistentVolumeReclaimPolicy: Retain

There are two main retain policies:

  • Retain: Deleting the bound PVC does not delete the data on the PV, and the PV cannot be claimed by another PVC and reused. The idea behind this is that you can still read the data directly on the storage subsystem (the NFS server, for instance) and backup everything one last time but you can't directly reuse that data for any storage claim, you'd have to create a new volume and manually copy that data on it.
    • NB: you still have to manually delete the PV.
  • Delete: Deleting the PVC will cause the deletion of the PV. Meaning you can recreate that PV and claim it anew. Cloud providers may also completely remove the underlying storage resource if you use that policy.

The retain policy and the visibility of the PersistentVolume resource with its set storage volume are the advantages of creating a NFS PersistentVolume over just mounting the NFS share as a simple volume as briefly mentioned in the intro for this chapter.

Many ready-made deployment configuration files for K8S will also have PersistentVolumeClaims in them and thus assume you have Persistent Volumes or a way to provision them dynamically.

Dynamic provisioning

For now we have to manually create PV entries to that the future PVCs made by our deployments can successfully bind storage.

If a PVC is made and no adequate PV can be cound, the deployment will obviously fail.

The idea behind dynamic provisioning is that the PVCs you make in your deployment will automatically create adequate new PV items and bind them.

Dynamic provisioning is normally meant for cloud storage and storage engines such as NFS do not provide a default dynamic provisioner. However, it's possible to write your own using the Kubernetes API.

We'll be using a dyanmic NFS provisioner from quay.io.

If you don't have the NFS client installed on all your K8S nodes, please install them before attempting the deployment.

Another quick note about your NFS server this time: if you have an option that allows "cross-mounts", you need to enable that too. This is not required if you use the kernel server for Linux.

First we need to create a service account for it:

kind: ServiceAccount
apiVersion: v1
metadata:
  name: nfs-client-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: nfs-client-provisioner-runner
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: run-nfs-client-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-client-provisioner
    namespace: default
roleRef:
  kind: ClusterRole
  name: nfs-client-provisioner-runner
  apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-client-provisioner
rules:
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: leader-locking-nfs-client-provisioner
subjects:
  - kind: ServiceAccount
    name: nfs-client-provisioner
    # replace with namespace where provisioner is deployed
    namespace: default
roleRef:
  kind: Role
  name: leader-locking-nfs-client-provisioner
  apiGroup: rbac.authorization.k8s.io

Now to deploy the provisioner:

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: nfs-client-provisioner
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: nfs-client-provisioner
    spec:
      serviceAccountName: nfs-client-provisioner
      containers:
        - name: nfs-client-provisioner
          image: quay.io/external_storage/nfs-client-provisioner:latest
          volumeMounts:
            - name: nfs-client-root
              mountPath: /persistentvolumes
          env:
            - name: PROVISIONER_NAME
              value: <PROVISIONER_NAME>
            - name: NFS_SERVER
              value: <NFS_SERVER>
            - name: NFS_PATH
              value: <NFS_MOUNT_PATH>
      volumes:
        - name: nfs-client-root
          nfs:
            server: <NFS_SERVER>
            path: <NFS_MOUNT_PATH>

Where <NFS_SERVER> is the IP address or hostname of your NFS server and <PROVISIONER_NAME> should be chosen and kept written down somewhere as we will later need it to create a StorageClass object.

As with the simple PersistentVolume example from earlier, we also need to provide the NFS <NFS_MOUNT_PATH> as configured on the server.

Let's now create a storage class to use the dynamic provisioner. We're choosing the name fast for it but that's really up to you and relative to your infrastructure.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast
provisioner: <PROVISIONER_NAME>
parameters:
  archiveOnDelete: "false"

Where we need to mention the <PROVISIONER_NAME> configured earlier.

You should now be able to create PVCs as in:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: <CLAIM_NAME>
  annotations:
    volume.beta.kubernetes.io/storage-class: "fast"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Of course you should choose an adequate <CLAIM_NAME> and possibly also add a namespace for it.

Physically, that specific NFS dynamic provisioner will create subfolders on the main NFS share for each PVC which includes the namespace and the PVC UUID.

If you delete the PVC resource, the whole volume will be removed from the NFS share (the whole subfolder that was created will be deleted).

The actual resource request in Gi doesn't matter with this provisioner. It's in no way some sort of quota and the provisioner won't do any kind of check to see if the requested volume is actually available.

You might still want to keep legitimate values in your configuration files in case you want to deploy using a cloud provisioner that will request a persistent volume of exactly that size.

Conclusion

Using a NFS dynamic storage provider is a very easy way to provide flexible storage capabilities to a bare metal Kubernetes cluster.

Even in the event of a NFS server crash, pods will still automatically restart at some point when the server comes back. That being said, you should still strive to take regular backups or snapshots of your NFS servers.

We can also easily snapshot the data of the NFS server to another NFS server and switch them in case of critical failure.

The next step we'd recommend would be to use Ceph and their dynamic provisioner, which is natively present in Kubernetes.

As said in "A word of warning", though, you should keep in mind that database systems inside of Kubernetes do not benefit from some of its best features. There is no issue with wanting to deploy a MariaDB Galera cluster outside of Kubernetes, with tighly controlled and stable software version and hardware and easy access to configuration.

About K8S itself, there is still more we can discuss in the future, mainly auto-scaling capabilities and how to deploy your own private Docker registry. We may also want to describe how to cleanly make a rolling update of a packaged monolithic application using solutions such as Spring Boot (Java), Django (Python), NodeJS, ...

Comments

Loading...