RKE2 Deep Dive Guide — From Zero to Hero

1

Before RKE2 — What Problems Does It Solve?

Imagine you built an amazing web application. It's running on one server. Then thousands of users show up. One server can't handle it. You need many servers working together — but how do you coordinate them?

Real-World Analogy

Think of a restaurant kitchen. One chef can cook a few meals. But for a busy restaurant you need many chefs, a head chef (manager), someone to take orders, and a system to make sure every dish gets cooked and served. Kubernetes is that entire kitchen management system for your software. RKE2 is a specific, security-first way to set up that kitchen.

The Core Problems

📦

Container Orchestration

You have dozens or hundreds of containers. Someone needs to decide which server runs which container, restart them if they crash, and scale them up/down.

🔒

Security & Compliance

Government agencies and enterprises need Kubernetes that is hardened out of the box — CIS benchmarks, FIPS encryption, and audit logging by default.

⚡

Operational Simplicity

Setting up vanilla Kubernetes is hard (dozens of certificates, configs, binaries). RKE2 bundles everything into a single binary with sane defaults.

2

What Exactly Is RKE2?

RKE2 (Rancher Kubernetes Engine 2), also known as RKE Government, is a fully conformant Kubernetes distribution built by SUSE / Rancher. It's designed for environments where security is non-negotiable.

💡

Key Identity

RKE2 combines the ease of use of K3s (Rancher's lightweight distro) with the close upstream alignment of RKE1 — and adds hardened security on top.

The Name Explained

Part	Meaning
RKE	Rancher Kubernetes Engine — Rancher's family of K8s distributions
2	Second generation, replacing RKE1 (which reached end-of-life in July 2025)
"Government"	Originally built to meet U.S. federal security standards (DISA STIG, CIS, FIPS)

What You Get Out of the Box

CNCF-certified Kubernetes (passes all conformance tests)
Embedded etcd — no external database needed
containerd runtime (Docker-free)
CIS Kubernetes Benchmark hardening by default
FIPS 140-2 compliant cryptography
Built-in Canal CNI (Calico + Flannel)
Built-in NGINX Ingress Controller
Built-in CoreDNS and Metrics Server
Helm Controller for managing add-ons
SELinux support

3

RKE2 Architecture — The Big Picture

An RKE2 cluster has two types of machines (called nodes):

Server Node (Control Plane)

The "brain" of the cluster. It decides what runs where, stores the cluster state, and exposes the Kubernetes API. In production you run 3 server nodes for high availability.

Agent Node (Worker)

The "muscles" of the cluster. These actually run your applications (as containers inside Pods). You can have as many agent nodes as you need.

Figure 1: RKE2 High-Availability Architecture — 3 server nodes + N agent nodes

How the Pieces Talk to Each Other

User / kubectl

→

Load Balancer:6443

→

kube-apiserver

→

Scheduler

→

kubelet

→

containerd → Pod

Port Reference

Port	Protocol	Purpose	Used By
9345	TCP	RKE2 supervisor API (node registration)	Server & Agent nodes
6443	TCP	Kubernetes API server	kubectl, agents, external
2379-2380	TCP	etcd client & peer communication	Server nodes only
10250	TCP	kubelet metrics / exec	All nodes
8472	UDP	VXLAN (Canal/Flannel overlay)	All nodes

4

Core Components Explained

Let's break down every piece running inside an RKE2 cluster, explained so anyone can understand:

Restaurant Analogy Continues

We'll keep using our restaurant kitchen analogy. Each Kubernetes component maps to a role in the kitchen.

📋

kube-apiserver

The Front Desk. Every request goes through here — "create a pod", "list services", "delete a deployment". Port 6443.

🗄️

etcd

The Recipe Book. A distributed key-value database storing the entire state of the cluster. RKE2 embeds it — no separate setup.

📅

kube-scheduler

The Assignment Manager. Picks the best node for new pods based on CPU, memory, affinity rules.

🔄

kube-controller-manager

The Quality Inspector. Watches state and makes corrections. If 3 replicas desired but 2 running — creates the 3rd.

🔧

kubelet

The Line Cook. Runs on every node. Receives instructions and makes sure containers are running.

🌐

kube-proxy

The Waiter. Manages network rules so requests reach the right pod — even across nodes.

🐳

containerd

The Oven. The container runtime that pulls images, starts containers, manages their lifecycle. Not Docker.

🔌

Canal CNI

The Intercom. Calico handles network policies. Flannel creates the overlay so all pods can communicate.

🚪

NGINX Ingress

The Entrance. Routes external HTTP/HTTPS traffic to correct services based on hostnames and paths.

📡

CoreDNS

Name Tags. Internal DNS so pods find services by name instead of IP addresses.

📊

Metrics Server

The Thermometer. Collects CPU/memory usage. Required for kubectl top and autoscaling.

⎈

Helm Controller

Recipe Installer. Auto-deploys Helm charts shipped with RKE2 via HelmChart CRDs.

How RKE2 Starts — The Bootstrap Sequence

1. rke2-server binary starts

The single binary contains everything: apiserver, scheduler, controller-manager, etcd, kubelet, and containerd.

2. TLS certificates are generated

Self-signed CA + all component certificates are created automatically in /var/lib/rancher/rke2/server/tls/.

3. etcd starts

The embedded etcd launches and initializes the cluster state database.

4. kube-apiserver starts

API server comes online, listening on port 6443.

5. Static pods are deployed

Unlike RKE1 (which used Docker), RKE2 runs control plane components as static pods managed by the kubelet.

6. Helm charts are applied

The Helm Controller deploys bundled add-ons: Canal CNI, CoreDNS, NGINX Ingress, and Metrics Server.

7. Node token is generated

A token is written to /var/lib/rancher/rke2/server/node-token — this is what agent nodes use to join.

5

How RKE2 Differs from K3s & RKE1

Rancher/SUSE offers three Kubernetes distributions. Here's how they compare:

Feature	RKE1 (EOL)	K3s	RKE2
Status	End-of-life (Jul 2025)	Active	Active (recommended)
Target	General purpose	Edge / IoT / Dev	Enterprise / Government
Runtime	Docker	containerd	containerd
Data Store	External etcd	SQLite / etcd / MySQL / Postgres	Embedded etcd only
CIS Hardened	Manual effort	Manual effort	By default
FIPS 140-2	No	No	Yes
SELinux	Limited	Supported	Full support
Secrets Encryption	Manual	Optional flag	Enabled by default
Min Resources	4 GB RAM	512 MB RAM	4 GB RAM
Control Plane	Docker containers	Single process	Static pods via kubelet
Windows Nodes	Yes	No	Yes

🎯

When to Choose What?

Choose RKE2 if: you need production-grade security, compliance certifications, enterprise support, or you're replacing RKE1.
Choose K3s if: you're running on edge devices, Raspberry Pi's, CI/CD environments, or need minimal resource usage.
Avoid RKE1: it's end-of-life. Migrate to RKE2 or K3s.

6

Hands-On: Installing a Single-Node Cluster

Let's set up a working RKE2 cluster from scratch. We'll start with a single server node — perfect for learning.

Prerequisites

Requirement	Minimum	Recommended
OS	Ubuntu 20.04+ / RHEL 8+ / SLES 15+	Ubuntu 22.04 LTS
CPU	2 cores	4 cores
RAM	4 GB	8 GB
Disk	20 GB	50+ GB SSD
User	root or sudo access

Step 1 — Download & Install RKE2

RKE2 provides an installer script that sets up the binary and systemd service:

# Download and run the official installer
curl -sfL https://get.rke2.io | sudo sh -

# What just happened?
# 1. Downloaded the rke2 binary to /usr/local/bin/rke2
# 2. Created a systemd service: rke2-server.service
# 3. Created config directory: /etc/rancher/rke2/

Step 2 — (Optional) Create a Config File

# Create the config directory
sudo mkdir -p /etc/rancher/rke2

# Create config file
sudo tee /etc/rancher/rke2/config.yaml <<EOF
# Specify a custom node name
node-name: my-first-server

# Enable CIS hardening profile
profile: cis

# Use Canal CNI (default, but being explicit)
cni: canal

# Write kubeconfig with this permission
write-kubeconfig-mode: "0644"

# TLS Subject Alternative Names
tls-san:
  - my-server.example.com
  - 10.0.0.100
EOF

Step 3 — Start the RKE2 Server

# Enable the service so it starts on boot
sudo systemctl enable rke2-server.service

# Start RKE2 (this takes 2-5 minutes on first run)
sudo systemctl start rke2-server.service

# Watch the startup logs in real-time
sudo journalctl -u rke2-server -f

⏳

Be Patient!

The first start downloads container images and bootstraps etcd + all control plane components. It can take 2–10 minutes depending on your internet speed.

Step 4 — Configure Your Shell for kubectl

# Add RKE2 binaries to PATH
echo 'export PATH=$PATH:/var/lib/rancher/rke2/bin' >> ~/.bashrc

# Point kubectl to the RKE2 kubeconfig
echo 'export KUBECONFIG=/etc/rancher/rke2/rke2.yaml' >> ~/.bashrc

# Apply changes immediately
source ~/.bashrc

# Verify kubectl works
kubectl version --short

Step 5 — Verify Your Cluster

# Check node status (should show "Ready")
kubectl get nodes
NAME              STATUS   ROLES                       AGE   VERSION
my-first-server   Ready    control-plane,etcd,master   3m    v1.30.x+rke2r1

# Check all system pods are running
kubectl get pods -A
NAMESPACE     NAME                                          READY   STATUS
kube-system   etcd-my-first-server                           1/1     Running
kube-system   kube-apiserver-my-first-server                 1/1     Running
kube-system   kube-controller-manager-my-first-server        1/1     Running
kube-system   kube-scheduler-my-first-server                 1/1     Running
kube-system   rke2-canal-xxxxx                               2/2     Running
kube-system   rke2-coredns-rke2-coredns-xxxxx                1/1     Running
kube-system   rke2-ingress-nginx-controller-xxxxx            1/1     Running
kube-system   rke2-metrics-server-xxxxx                      1/1     Running

# Save the node token (needed for adding more nodes later)
sudo cat /var/lib/rancher/rke2/server/node-token
K10abc123def456...::server:xyz789...

🎉

Congratulations!

You now have a working single-node RKE2 Kubernetes cluster with all core components running.

7

Hands-On: Adding Worker Nodes

A single node is great for learning, but in production you need dedicated worker (agent) nodes. Here's how to add them.

Step 1 — Install RKE2 Agent on the Worker

# Install RKE2 in AGENT mode (not server)
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sudo sh -

Step 2 — Configure the Agent

sudo mkdir -p /etc/rancher/rke2

sudo tee /etc/rancher/rke2/config.yaml <<EOF
# Point to the server's supervisor port
server: https://10.0.0.100:9345

# Paste the token from the server node
token: K10abc123def456...::server:xyz789...

# Give this worker a friendly name
node-name: worker-01
EOF

Step 3 — Start & Verify

sudo systemctl enable --now rke2-agent.service

# Back on the server node:
kubectl get nodes
NAME              STATUS   ROLES                       AGE   VERSION
my-first-server   Ready    control-plane,etcd,master   30m   v1.30.x+rke2r1
worker-01         Ready    <none>                      2m    v1.30.x+rke2r1

# Label the worker for organization
kubectl label node worker-01 node-role.kubernetes.io/worker=worker

🔁

Repeat for More Workers

Simply repeat Steps 1-3 on each additional machine, changing the node-name each time (worker-02, worker-03, etc.).

8

Hands-On: High-Availability (HA) Setup

For production, you need 3 server nodes so that if one dies, the cluster keeps running.

🗳️

Why 3 Servers? (The Quorum Rule)

etcd needs a majority of nodes alive to accept writes. With 3 nodes, you can lose 1 (2/3 majority). With 2 nodes, losing 1 means no majority — cluster is stuck!

Server 1 — The First Server

curl -sfL https://get.rke2.io | sudo sh -

sudo mkdir -p /etc/rancher/rke2
sudo tee /etc/rancher/rke2/config.yaml <<EOF
token: my-shared-secret-token-12345
tls-san:
  - rke2.example.com
  - 10.0.0.100
  - 10.0.0.101
  - 10.0.0.102
EOF

sudo systemctl enable --now rke2-server.service
# Wait until fully ready before starting server 2!

Server 2 & 3 — Joining Servers

curl -sfL https://get.rke2.io | sudo sh -

sudo mkdir -p /etc/rancher/rke2
sudo tee /etc/rancher/rke2/config.yaml <<EOF
server: https://10.0.0.100:9345
token: my-shared-secret-token-12345
tls-san:
  - rke2.example.com
  - 10.0.0.100
  - 10.0.0.101
  - 10.0.0.102
EOF

sudo systemctl enable --now rke2-server.service

Load Balancer — Example with Nginx

# /etc/nginx/nginx.conf (on a separate LB machine)
stream {
    upstream rke2_api {
        least_conn;
        server 10.0.0.100:6443 max_fails=3 fail_timeout=5s;
        server 10.0.0.101:6443 max_fails=3 fail_timeout=5s;
        server 10.0.0.102:6443 max_fails=3 fail_timeout=5s;
    }
    upstream rke2_supervisor {
        least_conn;
        server 10.0.0.100:9345 max_fails=3 fail_timeout=5s;
        server 10.0.0.101:9345 max_fails=3 fail_timeout=5s;
        server 10.0.0.102:9345 max_fails=3 fail_timeout=5s;
    }
    server { listen 6443; proxy_pass rke2_api; }
    server { listen 9345; proxy_pass rke2_supervisor; }
}

9

Networking Deep Dive (CNI)

Analogy

If pods are apartments in different buildings (nodes), the CNI is the road system and phone lines that connect them. Without CNI, pods on different nodes can't talk to each other.

Supported CNI Plugins

CNI	What It Does	Best For
Canal (default)	Flannel (VXLAN overlay) + Calico (network policies). Simple, reliable.	Most use cases
Calico	Full Calico: BGP routing + advanced network policies. No overlay = better perf.	Large clusters, bare metal
Cilium	eBPF-based networking. Advanced observability, L7 policies, service mesh.	Modern stacks, security
Multus	Meta-plugin: attaches multiple network interfaces to pods.	Telco / NFV workloads

How to Switch CNI

# In /etc/rancher/rke2/config.yaml BEFORE first start:

cni: cilium    # Option A: Cilium
cni: calico    # Option B: Calico
cni: multus,canal  # Option C: Multus + Canal

⚠️

Important!

You must choose the CNI before the first server start. Changing CNI on a running cluster requires a full reinstall.

Canal/Flannel VXLAN overlay networking — packets are encapsulated and sent through a UDP tunnel between nodes

10

Security & CIS Hardening

Security is RKE2's biggest selling point. Let's understand what it does automatically and what you need to do manually.

What RKE2 Does Automatically

🔐

Secrets Encryption at Rest

K8s Secrets stored in etcd are encrypted. Attackers with access to etcd data can't read your passwords.

🛡️

Pod Security Standards

With profile: cis, RKE2 enforces restricted pod security blocking privileged containers by default.

🌐

Network Policies

Canal/Calico enforces network policy rules defining which pods can communicate.

📝

Audit Logging

API server audit logs record who did what and when. Found at /var/lib/rancher/rke2/server/logs/.

🏛️

FIPS 140-2 Cryptography

All TLS uses FIPS-validated BoringCrypto — required by government agencies.

🎯

Minimal Attack Surface

No Docker daemon. Container images scanned with Trivy. Only essential components run.

What You Must Do Manually

# 1. Enable the CIS profile in config
profile: cis

# 2. Set kernel parameters (on EVERY node)
sudo tee /etc/sysctl.d/90-rke2-cis.conf <<EOF
vm.panic_on_oom=0
vm.overcommit_memory=1
kernel.panic=10
kernel.panic_on_oops=1
EOF
sudo sysctl --system

# 3. Create the etcd user (required for CIS profile)
sudo useradd -r -c "etcd user" -s /sbin/nologin -M etcd -U

# 4. Protect the etcd data directory
sudo chmod 700 /var/lib/rancher/rke2/server/db

11

Configuration Reference

# /etc/rancher/rke2/config.yaml — Complete Reference

# ── Cluster Identity ──
node-name: prod-server-01            # Friendly node name
token: my-super-secret-token         # Shared secret for joining
server: https://lb.example.com:9345  # Only for joining nodes

# ── TLS ──
tls-san:
  - lb.example.com
  - 10.0.0.100

# ── Security ──
profile: cis
secrets-encryption: true
write-kubeconfig-mode: "0600"

# ── Networking ──
cni: canal
cluster-cidr: 10.42.0.0/16
service-cidr: 10.43.0.0/16
cluster-dns: 10.43.0.10

# ── etcd ──
etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 10

# ── Container Runtime ──
system-default-registry: registry.example.com

# ── Node Labels & Taints ──
node-label:
  - environment=production
node-taint:
  - dedicated=control-plane:NoSchedule

# ── Disable Components ──
disable:
  - rke2-ingress-nginx

File System Layout

Path	Purpose
`/etc/rancher/rke2/config.yaml`	Main configuration file
`/etc/rancher/rke2/rke2.yaml`	Kubeconfig for kubectl
`/var/lib/rancher/rke2/bin/`	Binaries: kubectl, crictl, ctr
`/var/lib/rancher/rke2/server/`	Server data: etcd, manifests, tls certs
`/var/lib/rancher/rke2/server/node-token`	Token for joining new nodes
`/var/lib/rancher/rke2/server/db/`	etcd database files
`/var/lib/rancher/rke2/server/manifests/`	Auto-deploy manifests

12

Day-2 Operations: Upgrades, Backups & Monitoring

Upgrading RKE2

📌

Golden Rule of Upgrades

Always upgrade server nodes first, one at a time, then agent nodes. Never skip more than one minor version.

# Method 1: Automated with the install script
curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION="v1.31.2+rke2r1" sudo sh -
sudo systemctl restart rke2-server

# Method 2: System Upgrade Controller (recommended for HA)
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml

etcd Backups

# Take an on-demand snapshot
rke2 etcd-snapshot save --name my-backup

# List existing snapshots
rke2 etcd-snapshot ls
Name                     Size     Created
my-backup                3.2 MB   2026-03-17T10:30:00Z

# Restore from a snapshot (disaster recovery)
sudo systemctl stop rke2-server
sudo rke2 server --cluster-reset \
  --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/my-backup
sudo systemctl start rke2-server

Monitoring

# Quick health checks
kubectl get nodes                   # All nodes Ready?
kubectl get pods -A                 # All pods Running?
kubectl top nodes                   # CPU/Memory usage
kubectl top pods -A                 # Pod resource usage

# Check etcd health
sudo /var/lib/rancher/rke2/bin/etcdctl \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  endpoint health

13

Troubleshooting Cheat Sheet

Problem	Diagnostic Command	Common Fix
Server won't start	`journalctl -u rke2-server -f`	Check port conflicts, firewall, disk space
Agent can't join	`journalctl -u rke2-agent -f`	Verify token, server URL, port 9345
Node "NotReady"	`kubectl describe node <name>`	Check CNI pods, disk/memory pressure
Pods stuck "Pending"	`kubectl describe pod <name>`	Not enough resources, taints
Pods can't connect	`kubectl get pods -n kube-system`	CNI pods running? UDP 8472 open?
kubectl not working	`echo $KUBECONFIG`	Set KUBECONFIG path
etcd unhealthy	`etcdctl endpoint health`	Check disk I/O, restore snapshot

Essential Commands

# ── Service Logs ──
journalctl -u rke2-server -f
journalctl -u rke2-agent -f

# ── Container Runtime ──
sudo /var/lib/rancher/rke2/bin/crictl ps
sudo /var/lib/rancher/rke2/bin/crictl logs <id>

# ── Network Debug ──
kubectl run debug --image=busybox --rm -it -- sh

# ── Full Uninstall ──
sudo /usr/local/bin/rke2-uninstall.sh          # Server
sudo /usr/local/bin/rke2-agent-uninstall.sh     # Agent

14

Real-World Example: Deploy an App End-to-End

What We're Building

A simple Nginx web app with 3 replicas, exposed via a Service and Ingress so the outside world can reach it at myapp.example.com.

Step 1 — Create a Deployment (3 replicas)

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-web-app
  template:
    metadata:
      labels:
        app: my-web-app
    spec:
      containers:
      - name: nginx
        image: nginx:1.25-alpine
        ports:
        - containerPort: 80
        resources:
          requests: { cpu: 100m, memory: 64Mi }
          limits:   { cpu: 200m, memory: 128Mi }
EOF

kubectl get deploy my-web-app
NAME         READY   UP-TO-DATE   AVAILABLE
my-web-app   3/3     3            3

Step 2 — Create a Service

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: my-web-app-svc
spec:
  selector:
    app: my-web-app
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP
EOF

Step 3 — Create an Ingress

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-web-app-ingress
spec:
  ingressClassName: nginx
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-web-app-svc
            port:
              number: 80
EOF

Step 4 — Test It

# Internal test
kubectl run curl-test --image=curlimages/curl --rm -it -- \
  curl http://my-web-app-svc.default.svc.cluster.local

# External test
curl http://myapp.example.com

# Scale up!
kubectl scale deployment my-web-app --replicas=5

The Full Request Flow

Browser

→

DNS

→

NGINX Ingress

→

Service

→

Pod (Nginx)

15

Summary & Where to Go Next

What You've Learned

What RKE2 is and why it exists (security-first Kubernetes)
The full architecture: server nodes, agent nodes, and all core components
How RKE2 differs from K3s and RKE1
How to install a single-node cluster from scratch
How to add worker nodes and set up HA with 3 servers
How networking works (CNI: Canal, Calico, Cilium)
Security features and CIS benchmark hardening
Day-2 ops: upgrades, etcd backups, and monitoring
How to deploy a real application end-to-end

Recommended Next Steps

📚

Official Docs

Read the full docs at docs.rke2.io — the Configuration and Advanced sections.

🖥️

Rancher Manager

Install Rancher on top of RKE2 for a beautiful web UI to manage clusters and workloads.

💾

Longhorn Storage

Add distributed storage with Longhorn — Rancher's cloud-native storage solution.

🔐

cert-manager

Automate TLS certificate management with Let's Encrypt for your Ingress resources.

You're Now RKE2 Ready!

You've gone from zero knowledge to understanding RKE2's architecture, security model, installation process, and day-to-day operations. Go build something amazing.