Talos - The Kubernetes Operating System
Introduction
I’ve had a home lab for playing with servers for going on 15 years. It started in University, as a tired old desktop with as many hard drives as I could scrounge up and keep alive, all jerry-rigged into a then-cutting-edge LVM2 file server, as much memory as we could afford to host game servers (one at a time or the poor thing would overheat), and IPCop so we could all share the single internet connection we had. It was a frankensteinian beast, but it served me well and lit the match to the furnace of my career.
Fast-forward to the present, and I’ve a very forgiving wife who has accepted that I will always have the mistress that is my home lab, now resplendent in her 12U rack, with 10G networking of Ubiquity, battery back-up and power management by APC, and 3 XeonD based servers from SuperMicro. Some of which was even bought brand-new, but most still coming in second-hand. The time has come to give her a make-over, and I’ve decided to go all-in on Kubernetes this time around.
So, what is Talos?
Talos from SideroLabs is a Linux distribution built from the ground-up for Kubernetes. What is so great about that? Well, most Kubernetes deployments run on top of existing Linux distributions, such as Ubuntu or Amazon Linux, and in doing so inherit all the issues of the underlaying distribution and the bundled software.
Talos is the reverse. When I said it was built from the ground-up I mean it has a Linux kernel (compiled with KSPP config), grub, containerd and runc, and a few Talos-built services to manage the host itself. It brings its own init, so no SystemD/sysvinit/upstart, machined to manage the machine itself and replaces NetworkManager/systemd-networkd/netctl and deals with local storage too. There is no shell with Talos, so no need for SSH either, instead Talos is managed by a robust and secure gRPC API with mutual TLS, managed bu apid and trustd respectively. It runs from a squashfs image so is immutable and ephemeral, it really is the absolute bare-minimalist distribution required to get a Kubernetes platform running in a reliable and secure manner.
With all that out of the way, lets get started on the fun stuff!
Installing Talos
Talos supports installations on pretty much anything, Cloud platforms like AWS, DigitalOcean, and Hetzner, including bare metal platform Equinox Metal, to local installs right on top of Docker or virtualised in VirtualBox or QEMU.
Since I want to use my physical servers, this guide is going to stick to the bare metal methods. The same team that created Talos have also created Sidero Metal to manage Talos Kubernetes cluster on bare metal platforms which I’ll cover at a later date, along with a highly-available control plane and customized network configuration. This first cluster will be a simple single control plane node with two worker nodes, and the networking will be handled by flannel which comes built-in as standard with Talos.
Get Set Up
First things first, I need the talosctl
tool to manage a Talos cluster, and to
make it executable. For a simple boot process to get the cluster up-and-running,
I’m just going to download the ISO and boot directly from that.
talosctl
to a different
directory depending on your $PATH
environment variable.$ curl -Lo ~/.local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-linux-amd64
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 151 100 151 0 0 537 0 --:--:-- --:--:-- --:--:-- 537
100 659 100 659 0 0 1415 0 --:--:-- --:--:-- --:--:-- 1415
100 56.1M 100 56.1M 0 0 20.7M 0 0:00:02 0:00:02 --:--:-- 33.5M
$ chmod +x ~/.local/bin/talosctl
$ curl -Lo ~/Downloads/talos-amd64.iso https://github.com/talos-systems/talos/releases/latest/download/talos-amd64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 146 100 146 0 0 1254 0 --:--:-- --:--:-- --:--:-- 1258
100 654 100 654 0 0 4375 0 --:--:-- --:--:-- --:--:-- 4375
100 72.7M 100 72.7M 0 0 30.6M 0 0:00:02 0:00:02 --:--:-- 35.0M
Next, generate the configuration needed for the Talos Kubernetes cluster, giving the cluster a name and API endpoint. In this case, I know the control plane node has a fixed IP given as a static lease from the DHCP server. You could just as easily use the host name if you internal DNS supports it.
$ export TALOS_CONTROL_PLANE="192.168.10.101"
$ talosctl gen config home-lab-talos-cluster https://${TALOS_CONTROL_PLANE}:6443
generating PKI and tokens
created /home/tim/projects/home-lab-talos/controlplane.yaml
created /home/tim/projects/home-lab-talos/worker.yaml
created /home/tim/projects/home-lab-talos/talosconfig
$ talosctl --talosconfig talosconfig config endpoint ${TALOS_CONTROL_PLANE}
Now I am ready to boot the servers via the ISO downloaded earlier.
Bootstrapping the Control Plane
When booting an unconfigured environment, Talos will boot into a minimal ’live'
system waiting for configuration. This can be provided by having a HTTP endpoint
serving the configuration, putting it in user-data, or using talosctl
itself to
push the configuration to the machine. I am using the latter method in this case.
During the boot sequence, there is a message to signal that Talos is waiting for configuration, which should look similar to the following:
[ 123.278059] [talos] task loadConfig (1/1): this machine is reachable at:
[ 123.446031] [talos] task loadConfig (1/1): 192.168.10.101
[ 123.999253] [talos] task loadConfig (1/1): server certificate fingerprint:
[ 124.170420] [talos] task loadConfig (1/1): dQJHkqoTA1/RP/YXS4QgJt/Pr6K7sR4JKQcb71R6cv4=
[ 124.370537] [talos] task loadConfig (1/1):
[ 124.476888] [talos] task loadConfig (1/1): upload configuration using talosctl:
[ 124.658006] [talos] task loadConfig (1/1): talosctl apply-config --insecure --nodes 192.168.10.101 --file <config.yaml>
[ 124.930462] [talos] task loadConfig (1/1): or apply configuration using talosctl interactive installer:
[ 125.167492] [talos] task loadConfig (1/1): talosctl apply-config --insecure --nodes 192.168.10.101 --interactive
[ 125.425906] [talos] task loadConfig (1/1): optionally with node fingerprint check:
[ 122.613633] [talos] task loadConfig (1/1): talosctl apply-config --insecure --nodes 192.168.10.101 --cert-fingerprint 'dQJHkqoTA1/RP/YXS4QgJt/Pr6K7sR4JKQcb71R6cv4=' --file <config.yaml>
This is the cue to apply the Talos configuration, unfortunately it can scroll by pretty quick as more kernel message are output in the wait loop. In any case, I can apply the control plane configuration to the node a couple of minutes after booting.
$ talosctl --nodes ${TALOS_CONTROL_PLANE} apply-config --insecure --file controlplane.yaml
That starts the install process on the Talos control plane node, pulling the
installer container, formatting the disk, and copying the file system into place.
Once that is complete, Talos enters another wait-loop condition, this time to join
the etcd
cluster, which currently does not exist. This is shown in a looped
message in the kernel logs, something akin to:
[ 110.511544] [talos] etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap`
So, with the Talos configuration generated earlier I can bootstrap the Talos control plane:
$ talosctl --talosconfig talosconfig --nodes ${TALOS_CONTROL_PLANE} bootstrap
And Talos begins to configure the etcd cluster, as well as the static pods that function as part of the control plane. Once it is all done, a kernel message is logged confirming the boot process is complete and that I now have a functional Talos Kubernetes control plane.
[ 302.315792] [talos] boot sequence: done: 4m17.856857358s
With that, the kubeconfig
for the Talos cluster can be retrieved and used by
kubectl
.
$ talosctl --talosconfig talosconfig --nodes 192.168.10.101 kubeconfig ./kubeconfig
$ kubectl --kubeconfig ./kubeconfig get nodes
NAME STATUS ROLES AGE VERSION
talos-192-168-10-101 Ready control-plane,master 28m v1.23.1
The cluster still isn’t much use, as we only have a single control plane node, which by default won’t allow regular workload pods to be scheduled on it.
Adding Worker Nodes
Adding the remaining nodes to the cluster as workload nodes is very similar to the process for the control plane, but instead applying the configuration file of the worker node at the configuration prompt of the boot process. Since the workers are all cattle, they can be configured at the same time with the same configuration. As with the control plane node, the workers get static IP leases from the DHCP server.
$ export TALOS_WORKER_01="192.168.10.102"
$ export TALOS_WORKER_02="192.168.10.103"
$ talosctl --nodes ${TALOS_WORKER_01} apply-config --insecure --file worker.yaml
$ talosctl --nodes ${TALOS_WORKER_02} apply-config --insecure --file worker.yaml
Once they have gone through the install and boot sequence, they will automatically
join the cluster control plane set up earlier. We can check the node status with
kubectl
.
$ kubectl --kubeconfig ./kubeconfig get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-192-168-10-101 Ready control-plane,master 20h v1.23.1 192.168.10.101 <none> Talos (v0.14.0) 5.15.6-talos containerd://1.5.8
talos-192-168-10-102 Ready <none> 49s v1.23.1 192.168.10.102 <none> Talos (v0.14.0) 5.15.6-talos containerd://1.5.8
talos-192-168-10-103 Ready <none> 31s v1.23.1 192.168.10.103 <none> Talos (v0.14.0) 5.15.6-talos containerd://1.5.8
$ kubectl --kubeconfig ./kubeconfig get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-6f779cf5f6-b7zmc 1/1 Running 0 20h 10.244.0.3 talos-192-168-10-101 <none> <none>
kube-system coredns-6f779cf5f6-kdkmk 1/1 Running 0 20h 10.244.0.2 talos-192-168-10-101 <none> <none>
kube-system kube-apiserver-talos-192-168-10-101 1/1 Running 0 20h 192.168.10.101 talos-192-168-10-101 <none> <none>
kube-system kube-controller-manager-talos-192-168-10-101 1/1 Running 2 (20h ago) 20h 192.168.10.101 talos-192-168-10-101 <none> <none>
kube-system kube-flannel-9x4tb 1/1 Running 0 13m 192.168.10.102 talos-192-168-10-102 <none> <none>
kube-system kube-flannel-b786x 1/1 Running 0 13m 192.168.10.103 talos-192-168-10-103 <none> <none>
kube-system kube-flannel-x4x9q 1/1 Running 0 20h 192.168.10.101 talos-192-168-10-101 <none> <none>
kube-system kube-proxy-f9dzb 1/1 Running 0 20h 192.168.10.101 talos-192-168-10-101 <none> <none>
kube-system kube-proxy-vff8v 1/1 Running 0 13m 192.168.10.102 talos-192-168-10-102 <none> <none>
kube-system kube-proxy-zd5gq 1/1 Running 0 13m 192.168.10.103 talos-192-168-10-103 <none> <none>
kube-system kube-scheduler-talos-192-168-10-101 1/1 Running 2 (20h ago) 20h 192.168.10.101 talos-192-168-10-101 <none> <none>
Using the Cluster
To ensure the cluster is working as expected, it would be prudent to apply some sort of test workload to it and observe the results are as expected. I’m going to use the simple Google Kubernetes Bootcamp project for this.
First, set up a simple deployment and service.
$ kubectl --kubeconfig ./kubeconfig create deployment kubernetes-bootcamp --image=gcr.io/google-samples/kubernetes-bootcamp:v1
deployment.apps/kubernetes-bootcamp created
$ kubectl --kubeconfig ./kubeconfig get pods
NAME READY STATUS RESTARTS AGE
kubernetes-bootcamp-65d5b99f84-hpscj 1/1 Running 0 7m25s
$ kubectl --kubeconfig ./kubeconfig expose deployment/kubernetes-bootcamp --port 8080
service/kubernetes-bootcamp exposed
Forward the service locally and make sure the HTTP endpoint works
$ kubectl --kubeconfig ./kubeconfig port-forward svc/kubernetes-bootcamp 8080 &
[2] 11198
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
$ curl localhost:8080
Hello Kubernetes bootcamp! | Running on: kubernetes-bootcamp-65d5b99f84-hpscj | v=1
Scale the deployment up and check the pods are distributed across the worker nodes
$ kubectl --kubeconfig ./kubeconfig scale deployments/kubernetes-bootcamp --replicas=4
deployment.apps/kubernetes-bootcamp scaled
$ kubectl --kubeconfig ./kubeconfig get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kubernetes-bootcamp-65d5b99f84-8njn9 1/1 Running 0 45s 10.244.1.3 talos-192-168-10-102 <none> <none>
kubernetes-bootcamp-65d5b99f84-bdsgw 1/1 Running 0 45s 10.244.2.3 talos-192-168-10-103 <none> <none>
kubernetes-bootcamp-65d5b99f84-hpscj 1/1 Running 0 21m 10.244.2.2 talos-192-168-10-103 <none> <none>
kubernetes-bootcamp-65d5b99f84-rgl6x 1/1 Running 0 45s 10.244.1.2 talos-192-168-10-102 <none> <none>
It is pretty clear the cluster is working as expected, and is ready for general use.
Clean up
Now there is a working Talos cluster in the home-lab, the default configuration put together here is great for testing and experimentation. If the machines need to be used for anything else, it’s best to restore them to their pre-Talos state with a simple reset.
$ talosctl --nodes ${TALOS_CONTROL_PLANE},${TALOS_WORKER_01},${TALOS_WORKER_02} reset --graceful=false
This will wipe the installation disk of the Talos cluster nodes, including the boot loader, leaving them clean and ready for more testing!