Skip to main content

Talos Linux: The Kubernetes OS of Dreams

·706 words·4 mins
Author
Liam Hardman

What & Why Talos?
#

Over the last few years, I’ve hosted my Kubernetes cluster in a few ways. I’ve covered this in my own blog in parts, but I always felt like things weren’t super stable. Was that user error from my part? Yes, definitely. For a (at the time) newcomer to Kubernetes, it seemed like ‘The Golden Path’ wasn’t too clear.

Nowadays though, it’s pretty clear to me. If I’m not using a hosted Kubernetes service, then I’m using Talos Linux. To give a brief explanation, Talos Linux is a Kubernetes OS. I mean that quite literally since there’s no console. You don’t even go through an installer, you just boot the ISO and it’s ready to go. There’s one slight annoyance in that you have to use talosctl to configure the cluster, but it’s much better than using something like Rancher like I used to since you’re able to keep track of what you’ve actually done via the generated config files.

One big advantage to Talos is that since it’s only used for Kubernetes, it’s extremely light. Idle RAM usage even when a node is added to a cluster is usually less than 200MB. That means with the same amount of RAM you’d otherwise use for a K8S or other kind of Kubernetes cluster, you could provision more nodes and thus get more redundancy. For reference, here’s what my cluster looks like:

Cluster Topology
#

graph TD
    Cluster[Talos Cluster]
    
    subgraph CP [Control Plane]
        direction LR
        N1[talos-cp-01<br>192.168.6.41]
        N2[talos-cp-02<br>192.168.6.48]
    end
    
    subgraph W [Workers]
        direction LR
        N3[talos-w-01<br>192.168.6.42]
        N4[talos-w-02<br>192.168.6.43]
        N5[talos-w-03<br>192.168.6.44]
        N6[talos-w-04<br>192.168.6.45]
    end
    
    Cluster --- CP
    Cluster --- W

I’ve rolled with this setup since I’ve got 2 physical machines in my rack, and this means if one of those goes down, my applications won’t be affected.

Storage
#

For any new cluster, you’ve got a few options for storage. If you’ve got a storage server running in your network, you could try your hand using the NFS Provisioner. Personally, it just wasn’t that stable for me, and restricted me when using stateful applications with multiple replicas.

You’ve also got Longhorn as an option, but it has a few weird limitations that make it a bit pointless in my opinion. Most notably:

  • No sharding storage across multiple nodes
  • Frequent hanging when expanding PVC’s
  • Documentation for some edge-case workloads is a bit sparse

Ceph is what I went with in the end. Is the config really frustrating at points? Definitely. I’ve had it running for quite some time now though and it’s been pretty stable, and that pain in the initial config phase has definitely been worth it.

I Declare, Therefore It Is
#

One of the aspects that I really enjoy with Talos is the fact that it forces you to think about configuring OS’s much differently compared to ’the old days’. It’s a restriction, and one that will frustrate you at times, but the fact that you have to put so much effort in to get a remote console to any node means that you’re forced into the ‘golden path’ of declarative configuration.

As a good example, check back in with this article I’d made on RKE2. It’s just under 2000 words, and requires a good bit of manual fiddling. If you want to re-apply the config, you’d have to go back to your notes, or that article, and repeat the process. To setup that exact same cluster with Talos, you can follow the instructions in this git repo instead.

1967 words in an article, compared to 437 in a git repo. When you have to apply it again, that’s still 1967 words in an article to refamiliarize yourself with, compared to 437 in a git repo. If I could declaratively create a time machine and tell past me to create a talos cluster instead, then it would’ve saved me at least 4000 words of reading! Then again, the talos docs while comprehensive, are a lot of reading too.

The cherry on the cake is that this is somewhat self-documenting as well. If you want to know how your cluster is configured, you can just check the config files. You’ll want to do some additional documentation on the side, but half of the war is already won.