Lab Update

Posted on Kubernetes, Networking, Talos, Vault

Update on the lab setup and what I’ve been working on recently.

This is a short update on the lab status and what I’ve been working on. Life’s been busy recently and posting hasn’t been a priority. I’m hoping that will change soon as I’m pursuing opportunities that will allow more exposure to the communities I’ve loved and missed over the years.

Hardware

Really same as before - I haven’t had the need to upgrade until recently when I’ve started actual development on projects.

Ubiquiti Router
Cisco Switches
pfSense (BGP)
Lenovo Tiny Desktops
- Dedicated to Kubernetes Master Nodes, freeing me from relying on my Pi Clusters.
- One system runs the automation systems via HomeAssistant.
Raspberry Pi CM4 Cluster (6 nodes, 8GB RAM each, NVMe storage)
DeskPi Super6C Raspberry Pi CM4 Cluster Board
Raspberry Pi 4 - Solely handling DNS duties.
Synology NAS

Differences

Networking
- BGP handles the Kubernetes routing
- pfSense is providing the BGP services
- Default gateway routes the designated networks to the pfSense gateway address. These load balancer IPs are dynamically provisioned by the pfSense.
Storage
- The Raspberry Pi are all sporting NVMe for stateful-set storage
- Half the Raspberry Pi have internal eMMC storage
- Half are using SD Cards (Don’t do this)

Software

OS: Talos Linux
Networking: Cilium
Storage: local-path-provider
Network Monitoring: Hubble-UI
Monitoring: Prometheus
Reporting: Grafana
Secrets: Vault

A few YouTube videos, bit of reading about Talos and I knew I wanted to learn more. It literally had everything I wanted:

API access
Required config management (if you’re doing it right, please don’t just apply random patches, use the pipeline)
Security by default

If I was going to learn how to leverage kubernetes I may as well learn it right!

So I wiped the functional Debian kub cluster and started from scratch… It was more difficult than I thought and I brought a lot of the pain upon myself. A few issues getting the Pi configured properly, a desire to PXE boot to realize my setup wouldn’t work as configured. Hardware failures, changing jobs, moving homes, get everything setup again, buy new SD card (128GB), they are all bad… Flash what I have and start building the cluster.

Mistakes

SD Cards too small
- I needed to use what I had and the Talos OS is pretty small. I’m using NVMe for the persistent storage so I’m good! WRONG! I forgot about ephemeral storage! The pods that don’t have stateful-sets still need storage to operate and this goes in the ephemeral storage. Since I needed to dedicate the entire drive to the local-path-provisioner it only had what was left on the SD card for ephemeral storage. This meant my pods are beating the hell out of the SD cards on half the worker nodes. This is less than ideal and led to an idea for expansion…
Didn’t plan out the networking
- I was trying to get everything working and assigned an entire /24 subnet to the loadbalancer and it worked! CHEERS! I went to bed. Later I came back and realized that wasn’t what I wanted. I had to go back and plan out the environments like I wanted them and touch it a second time. Now they are configured and everything is working as planned. More of a personal ‘damn it’ than a mistake but I’m learning and should have put some thought into it beforehand.
- Why do I want to separate my networking environments? It’s a lab that I’m trying to simulate an enterprise environment with while having minimal hardware. Network separation is simple and effective if you have proper network controls. I would like to eventually migrate this to leverage VLANs but I’ve got mixed results when testing with the Ubiquiti and MY Cisco gear. It’s on the list but not a priority until the expansion plans.
Hardware
- The Pi CM4 are handling my current load without an issue - I’m actually quite impressed! My issue is the networking. It’s a hardware limitation that each of the CM4 connect to a 1GB switch which has one 1GB external port. My control nodes are x64 systems and connected to 1GB ports each. While everything can communicate the network is… less than reliable right now.
- Control Nodes are older systems and it certainly shows. They need to be replaced and I would like to expand the cluster with additional Pi nodes to do that with. They are efficient and do the job.
Configuration - Jeez, here are the big ones…
- I’m learning as I go so first thing I did is change the cluster name. While this can be overcome and may be a normal practice, it’s not something you should do when learning.
- I tried to get MetalLB working on the talos nodes. It worked on the Debian cluster and I really didn’t get the security setup of Talos. MetalLB requires more access than I was comfortable trying to configure, but it started me down the BGP path. So, it all worked out.
- Try and initialize, unlock, join-nodes and configure auto-unlock in a single step. Why?!? Well like I said I wanted this to be a production grade, security first cluster so I don’t want Vault to be locked if the cluster or pod goes down. I certainly want it in Production mode, using Raft and clustered. Like I said, secure and production grade. I currently have everything working from bootstrap to unlocked and ready for use, but I haven’t revisited the auto-unlock yet.

Things to Finish

They are actually running just not configured. They are also going to be collecting the stats from the HomeAssistant installation.

Prometheus
Grafana
HomeAssistant Integration - I want to trigger events or start pods/apps based on external inputs. These plans are still being developed and have not yet been researched.
BDR - Not if, When! I have this setup and now it needs to protect it, and the workloads. I want to do this two ways:
- Local: Backup to NFS, both the Kubernetes components (etcd, configs, secrets, etc.) and the applications. While snapshots are great in general they aren’t in this environment, so I’m going to require other solutions.
- Remote: I’m going to leverage StorJ for the remote option for backup. Why? Well it’s a lot less expensive than S3, client side encrypted, interesting project and I don’t need compliance in my lab.
DHCP
- I’m currently using DHCP off my Ubiquiti router and it’s not very… feature rich currently when it comes to these features. I’m already using BIND for DNS and would like to migrate to Kea for DHCP. This will allow me to start trying to leverage PXE which will bring a new new game to the lab!
PXE
- Currently the systems are booting from an SD card located underneath the cluster board. If it’s mounted they can’t be accessed. If the SD card accidentally gets wiped from say forgetting to specify the partition. You need to bring the entire cluster down, disassemble, remove and flash the SD, reverse. Less than ideal for a lab, when this is configured it removes that dependency and now the SD card can just be ephemeral storage.

Things I would do differently

This is a loaded idea because there’s not much I would really keep the same.

I would do a lot more planning now that I know what to look for in a general sense. That would reduce the number of mistakes, changes, confusing configurations and general cruft of the system. I am going to cut myself a little slack since this is the first cluster leveraging Talos and learning the general workings. There’s a lot more but I learn by doing so I’m definitely going to break something.
Different Hardware: I would NOT use the cluster board for what I’m trying to do. While it’s completely functional for learning and small projects, for my use case it’s not appropriate. I would replace the cluster board with individual PoE Pi boards or blades. Boot from PXE (preferred) or SD card, eMMC for ephemeral storage and NVMe for stateful-sets.

Current Uses

I’m writing cloud first apps leveraging Kubernetes and just like the cluster they are a continuous work in progress but a great learning platform. How to leverage different components, proper secret management, pipeline development and checks, etc.

The Stack

Database: MongoDB stateful-set

Caching: Redis

Frontend: Streamlit and FastAPI/Kong

Monitoring: Prometheus

Reporting: Grafana

Security: Vault

Pipeline: GitHub Actions

Runners: Local

Current Development

Health & Safety App

Agent Analysis App

Agent Development Company (Ambitious but it’s fun and am learning lots)

Immediate Plans

I’m going to need to expand this soon and this is one of the things I think is great about Talos. I’m leveraging KubeSpan and am going to configure autoscaling into GCP (price - it’s always price). This creates a WireGuard VPN across the nodes of the cluster encrypting all traffic in-flight. This allows me to extend the cluster into multiple cloud providers as needed leveraging the least expensive resource possible, dynamically. This also allows me to seamlessly protect against a major single cloud provider failure. Leveraging Terraform I’ll be able to maintain standards and change control across multiple providers. This is an awesome win!

Less Immediate Plans

Significant upgrades to the entire “datacenter”, as financing permits:

10G redundant networking (Storage and Master Nodes)
ProxMox Nodes for Ceph storage, Talos Master Node VMs & x64 workloads/pods
Raspberry Pi Blades or other compact units with POE and storage options (Primary worker nodes)
PXE Boot all nodes (ARM & X64)
HomeAssistant Integrations

Anyway I think that’s enough rambling, but that’s the status of the lab and why it is in its current state. I’m looking forward to working on these projects and hope to transition into working this type of stack full-time. I’m excited about the possibilities, where it can lead and I’m certainly not short on ideas of how to leverage this in so many settings.

Thanks for sticking around Cheers

Lab Update

Hardware

Differences

Software

Mistakes

Things to Finish

Things I would do differently

Current Uses

The Stack

Current Development

Immediate Plans

Less Immediate Plans

Cloudy with a Chance of Tech

Error

Hardware

Differences

Software

Mistakes

Things to Finish

Things I would do differently

Current Uses

The Stack

Current Development

Immediate Plans

Less Immediate Plans

Templates (for web app):

Error