Hacking Kubernetes etcd for (personal) profit
As techies we make mistakes every once in a while. We learn new things, we experiment, and then we try hard to make best out of our knowledge. We also don’t like to admit to our mistakes, or poor choices. This story is about one of these — a poor choice (a mistake made on purpose) which could have been avoided but it happened and then the journey to fix it with the least amount of effort.
The Background
I run a few Kubernetes (Rancher) clusters for my home(s) needs. I consider them production, but of course, more appropriate would be “production” — as these are not business critical for anyone, except me and my family. Cutting the long story short, I have set up home-labs in 3 different locations, all connected through Site-to-Site VPNs, running some software, VMs and Kubernetes clusters.
The main location is the centre of command. Aside the compute layer, I have a storage shelf with 15 hard drives which is fully managed by the excellent TrueNAS. What’s also worth noting the TrueNAS instance (iSCSI/NAS/SMB) is accessible via both the 10Gbit and 1Gbit network interfaces — that’s because my main switch, an old Zyxel GS1920–24 doesn’t have a 10G port unfortunately, so the TrueNAS instance has an additional 10G PCI-E NIC attached.
The TrueNAS also serves as a persistence layer for my kubernetes clusters, and that’s where the real story begins.
Kubernetes
For flexibility and the ease of use, my Kubernetes cluster uses an external NFS provisioner to create Persistent Volumes on the fly (Dynamic Provisioning). It’s an excellent solution for a home-lab and although it comes with some caveats (mainly due to the NFS limitation, e.g size of a PV is not really honoured) I’ve enjoyed working with it.
In my infinite wisdom (self-sarcasm), I’ve been using the same IP address for my NAS for years, it almost became as solid to me as the DNS entry would be. You could wake me up at night and I’d recite 192.168.1.72. So naturally, despite everything I preach to others, my colleagues and my customers, I used that IP address when I set up my external NFS provisioner, here’s the proof — the values given to the Helm chart:
affinity: {}
image:
pullPolicy: IfNotPresent
repository: k8s.gcr.io/sig-storage/nfs-subdir-external-provisioner
tag: v4.0.2
imagePullSecrets: []
labels: {}
leaderElection:
enabled: true
nfs:
mountOptions: null
path: /mnt/storage-secondary/k8s-pv
server: 192.168.1.72
volumeName: nfs-pv-root
I’ve completely forgotten about it. No issues whatsoever, until I decided to reorganise my network, add a few VLANs and separate some traffic. That also meant that my NAS was moving to a completely different subnet — a dedicated one for storage, let’s call it a home SAN, and would not be going through the main pfSense router.
Storage in Kubernetes is governed by the set of CSI drivers. Let’s recap what is the Kubernetes CSI (from the GA announcement a few years back):
CSI was developed as a standard for exposing arbitrary block and file storage storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes. With the adoption of the Container Storage Interface, the Kubernetes volume layer becomes truly extensible. Using CSI, third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code.
NFS external provisioner is a CSI driver, which upon receipt a CreateVolume call from Kubernetes, goes “outside” the cluster realms and creates an NFS mount in the underlying operating system, all in line with the CSI documentation:
When volume provisioning is invoked, the parameter type: pd-ssd and the secret any referenced secret(s) are passed to the CSI plugin csi-driver.example.com via a CreateVolume call. In response, the external volume plugin provisions a new volume and then automatically create a PersistentVolume object to represent the new volume. Kubernetes then binds the new PersistentVolume object to the PersistentVolumeClaim, making it ready to use.
and we can actually see it on the Kubernetes node, when we SSH to it and run:
mount | grep nfs
the result will be as follows:
192.168.1.72:/mnt/storage-secondary/k8s-pv-test-cluster/default-hello1-pvc-5e649f65-a34e-45fe-aac3-1b7c774a3944 on /var/lib/kubelet/pods/72362a2a-c914-4e52-a549-f61e3c7416d4/volumes/kubernetes.io~nfs/pvc-5e649f65-a34e-45fe-aac3-1
b7c774a3944 type nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.30.70,local_lock=none,addr=192.168.1.72) 192.168.1.72:/mnt/storage-secondary/k8s-pv-test-cluster/default-hello2-pvc-b7d3421e-a24a-4525-95cf-295e0c190ba9 on /var/lib/kubelet/pods/72362a2a-c914-4e52-a549-f61e3c7416d4/volumes/kubernetes.io~nfs/pvc-b7d3421e-a24a-4525-95cf-2
95e0c190ba9 type nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.30.70,local_lock=none,addr=192.168.1.72)
So how this all happens? The external NFS provisioner receives a call from Kubernetes and delegates the creation and mounting of the path to a NFS controller. From there, the controller uses a set of standard Linux commands to mount a path to the node (also uses a standard set of system libraries, i.e nfs-common
in Ubuntu node).
BUT, when doing all of that, it takes the IP/DNS address of the mount from the PV definition in Kubernates, an example is:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: cluster.local/nfs-provisioner-test-nfs-subdir-external-provisioner
creationTimestamp: "2021-09-24T16:16:23Z"
finalizers:
- kubernetes.io/pv-protection
managedFields:
- apiVersion: v1
manager: nfs-subdir-external-provisioner
operation: Update
time: "2021-09-24T16:16:23Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"kubernetes.io/pv-protection": {}
f:status:
f:phase: {}
manager: kube-controller-manager
operation: Update
time: "2021-09-24T19:43:54Z"
name: pvc-5e649f65-a34e-45fe-aac3-1b7c774a3944
resourceVersion: "400447"
uid: 17eb6580-d672-4e33-926b-aefd63a40ed4
spec:
accessModes:
- ReadWriteOnce
- ReadWriteMany
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: hello1
namespace: default
resourceVersion: "43088"
uid: 5e649f65-a34e-45fe-aac3-1b7c774a3944
nfs:
path: /mnt/storage-secondary/k8s-pv-test-cluster/default-hello1-pvc-5e649f65-a34e-45fe-aac3-1b7c774a3944
server: 192.168.1.72
persistentVolumeReclaimPolicy: Delete
storageClassName: nfs-client
volumeMode: Filesystem
So, finally, in my case — I have tens, if not hundreds of PVs already created and they have a hardcoded IP of 192.168.1.72 to my NFS server. I must change it.
And, now, if you’re not so proficient with Kubernetes, you may think — what’s the problem? just change it. The thing is, once provisioned, the core attributes of both PV and PVC (Persistent Volume Claims) are immutable , the Kubernetes API will very quickly point that out:
spec.persistentvolumesource: Forbidden: spec.persistentvolumesource is immutable after creation
— and that is by design. It makes a lot of sense for them to be immutable in the Kubernetes world, however on this occasion, it is either that or, literally removing all the PVs, PVCs, and then reprovisioning them by hand on the new IP.
I certainly didn’t want to manually re-provision hundreds of PVs.
In the search for a solution
We know Kubernetes won’t allow us to change the server attribute on the PV, but… unless that is stored in some sort of ROM (Read Only Memory) which by the virtue of physics couldn’t be changed (and we know it’s not that), it’s just a few bytes stored somewhere, and there’s only k8s API standing in the way.
If you live in the Kubernetes world, you’ve probably heard about the etcd
. If you lived under a rock for at least last 5 years, here’s the recap from etcd.io:
etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines.
etcd is a database where Kubernetes stores all the definitions, the current state of the cluster, including PVs.
Now, even though there are some etcd browsers available — most of the ones I tried are very generic and do not anticipate the objects stored as values are serialised with protobuf, so the result you get resembles:
k8sv1PersistentVolume� �(pvc-5e649f65-a34e-45fe-aac3-1b7c774a3944"*$17eb6580-d672-4e33-926b-aefd63a40ed428B��bepv.kubernetes.io/provisioned-byBcluster.local/nfs-provisioner-test-nfs-subdir-external-provisionerrkubernetes.io/pv-protectionz��nfs-subdir-external-provisionerUpdatev1"��2FieldsV1:��{"f:metadata":{"f:annotations":{".":{},"f:pv.kubernetes.io/provisioned-by":{}}},"f:spec":{"f:accessModes":{},"f:capacity":{".":{},"f:storage":{}},"f:claimRef":{".":{},"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}},"f:nfs":{".":{},"f:path":{},"f:server":{}},"f:persistentVolumeReclaimPolicy":{},"f:storageClassName":{},"f:volumeMode":{}}}��kube-controller-managerUpdatev1"�Ը�2FieldsV1:ki{"f:metadata":{"f:finalizers":{".":{},"v:\"kubernetes.io/pv-protection\"":{}}},"f:status":{"f:phase":{}}}�storage1Giv*t 192.168.1.72b/mnt/storage-secondary/k8s-pv-test-cluster/default-hello1-pvc-5e649f65-a34e-45fe-aac3-1b7c774a3944ReadWriteOnceReadWriteMany"[PersistentVolumeClaimdefaulthello1"$5e649f65-a34e-45fe-aac3-1b7c774a3944*v1243088:*Delete2nfs-clientBFilesystem Bound"
If you look closely, in the human-readable part you will see the IP address, but changing only that part will result in a massive cluster breakdown, as there’s more to it than just meets the eye.
The guys from Flant have given me an idea — I’m not alone in my quest of hacking the etcd for profit (whether it’s expressed in money or saved time). They based their solution on the OpenShift etcdhelper
app written in GoLang, check this script out: https://github.com/openshift/origin/tree/master/tools/etcdhelper
From here, the path was straight forward. I did touch GoLang a few times in my life, but I don’t have such experience with it as I do with other languages (Java/Python/.NET/C++) in which I wrote production platforms for many years. But, how hard could it be? I quickly set up JetBrains GoLand and started experimenting.
The result was this method:
which, takes a reference to the k8s client in GoLang, the kubernetes StorageClass name you want to modify, and the new NFS Server address you want to inject. We start by grabbing all persistent volumes in our ETCD, and then iterating through them and filtering out those with the desired StorageClass. Once we’re certain it is indeed a PV governed by a NFS Controller, that’s the condition:
pv.Spec.NFS != nil
we can do a very simple substitution of the value in our object:
pv.Spec.NFS.Server = newNFSServer
we also need to remember to encode the object with protobuf, that’s easy enough with:
protobuf.NewSerializer(scheme.Scheme, scheme.Scheme)
newObj := new(bytes.Buffer)
protoSerializer.Encode(obj, newObj)
and lastly, we hammer the serialized object to our etcd database with:
clientv3.NewKV(client).Put(context.Background(), string(kv.Key), newObj.String())
when we run our program it will change the values in etcd directly, bypassing the k8s API altogether.
There is one more thing you need to do. If you remember those mounts are on the operating system level. Changing the value in etcd directly will NOT trigger the NFS controller to remount all your NFS shares, therefore, you need to reboot ALL k8s nodes affected (all nodes having NFS shares). When the node boots up, the controller will read the PV definition from etcd and remount the share on the new IP or DNS.
That way I was able to change hundreds of PVs to actually use the NFS coordinates, they should have used from day one — a DNS entry, instead of the IP address. A DNS is always something that can be changed easily on your DNS server.
Key take-aways
It may seem obvious, but always, absolutely always — even if it’s your playground system make use of DNS entries instead of hardcoded IPs or other values. As I said in the very beginning, you may know that, and you may think you will never need to change it, until you actually do.
Never give up in tech, the fact you’re blocked on the API level doesn’t mean you cannot do something. In this post we’ve made the immutable mutable — yes, that was a hack but isn’t everything we do in our daily jobs a hack of sorts?
Lastly, and most importantly, goes without saying, this is not an official way of altering your k8s resources, by any stretch of imagination. The code snippet comes with absolutely no warranty. Before making any changes to your etcd
server you should always make a backup of its state. There are tools available which may help you with that.
Git repo
The modified version of etcdhelper is available on my GitHub: https://github.com/rliwoch/etcdhelper-nfs