Rancher 创建自定义集群 注册节点失败

环境信息:
服务器版本 : Rocky Linux 9.6 (Blue Onyx)
Rancher版本 : v2.12.3
rke2 版本 : rke2 version v1.33.5+rke2r1

Rancher 部署流程

[root@rancher-k8s-node-130 ~]# rke2 --version
rke2 version v1.33.5+rke2r1 (d1092839cf08cb901b1d40461b0fa6e7ae6f8fc4)
go version go1.24.6 X:boringcrypto

节点部署前期准备命令

  1. 在所有的node上运行下面命令
sudo dnf update -y     -- 更新包
sudo dnf install -y curl wget git vim jq firewalld openssl chrony tar  -- 安装工具

2 内核参数优化 创建k8s 参数

cat <<EOF | sudo tee /etc/sysctl.d/99-k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
vm.swappiness = 0
EOF
sudo sysctl --system
  1. 禁用Swap并关闭SELinux
sudo swapoff -a
sudo sed -i '/ swap / s/^/#/' /etc/fstab
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing/SELINUX=permissive/' /etc/selinux/config
  1. 主机名与Hosts解析
    每个节点都要设置 主机名
sudo hostnamectl set-hostname rancher-k8s-node-130
  1. 关闭防火墙
sudo systemctl stop firewalld  -- 启用并立即启动防火墙服务
  1. 同步系统时间(避免证书有效期异常)
systemctl start chronyd && systemctl enable chronyd
timedatectl # 验证时间同步状态

– k8s 镜像仓库配置

vi /etc/rancher/rke2/registries.yaml
mirrors:
  "docker.io":
    endpoint:
      - "https://docker.1ms.run"

– k8s 集群配置

vi /etc/rancher/rke2/config.yaml
token: G7gd7x1KFDAaaADFY52  # 自定义集群Token
tls-san:
  - "rancher.sweetnight.com.cn"  # Rancher访问域名
cni: "calico"                    # calico 网络插件
etcd-snapshot-schedule-cron: "0 */6 * * *" # 每6小时etcd快照
etcd-snapshot-retention: 24      # 保留24份快照

部署 rke2-server

curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_TYPE="server" sh -
sudo systemctl start rke2-server.service

部署 cert-manager

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --version v1.15.3 \
  --set crds.enabled=true

部署 Nginx Ingress MetalLB (是想通过MetalLB来实现负载)

metallb 安装v0.15.2版本(使用官方CRD部署清单)选择对k8s 匹配的版本 当前的k8s是 v1.33.5+rke2r1

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.15.2/config/manifests/metallb-native.yaml

创建metallb IP池 这个地方就用固定ip

cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: single-fixed-ip-pool  # 单IP池名称
  namespace: metallb-system
spec:
  addresses:
    - 192.168.1.139/32  # 仅包含你要的固定IP(/32表示单个IP)
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: single-ip-advertisement
  namespace: metallb-system
spec:
  ipAddressPools:
    - single-fixed-ip-pool  # 关联单IP池
EOF

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: rke2-ingress-nginx-controller
  namespace: kube-system
  annotations:
    metallb.io/static-ip: 192.168.1.139  # 你的固定 IP(已在 MetalLB 池中)
spec:
  type: LoadBalancer  # 与 MetalLB 配合
  ports:
  - name: http
    port: 80
    targetPort: 80
    protocol: TCP
  - name: https
    port: 443
    targetPort: 443
    protocol: TCP
  selector:
    # 关键:匹配 Ingress Controller Pod 的标签(你之前查询到的)
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: rke2-ingress-nginx
    app.kubernetes.io/name: rke2-ingress-nginx
EOF

创建 ingressclass.yaml 文件 让后面的rancher绑定

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: rke2-ingress-nginx  # 与 Rancher 配置的 ingress.className 完全一致
spec:
  controller: k8s.io/ingress-nginx  # 关键:匹配 Ingress-nginx 控制器的默认标识
EOF

部署 rancher 2.12.3版本

helm upgrade --install rancher rancher-stable/rancher \
--namespace cattle-system \
--version 2.12.3 \
--set hostname=rancher.sweetnight.com.cn \
--set ingress.ingressClassName=rke2-ingress-nginx \
--set replicas=1 \
--set bootstrapPassword=ZHzvanB8YKfQM7pwzYyw \
--set global.cattle.psp.enabled=false \
--set ingress.tls.source=rancher

rancher 部署创建新 自定义集群
Kubernetes 版本: v1.33.5+rke2r1
容器网络 : calico
镜像仓库 : –https://docker.1ms.run
网络 : 集群CIDR(10.42.0.0/16) 服务CIDR(10.42.0.0/16) 集群DNS 10.42.0.10
其他都是默认值了

rancher 添加新节点
在新节点把上面的 节点部署前期准备命令 都执行了一次

curl --insecure -fL https://rancher.sweetnight.com.cn/system-agent-install.sh | sudo sh -s - \
  --server https://rancher.sweetnight.com.cn \
  --label 'cattle.io/os=linux' \
  --token x6fvgmtjvtbsfqmcf85pzmgqxktxsp9pp8fr28j8zv22rxcbkhmk27 \
  --ca-checksum d2ec7967668eb6ef67dd0ee49ae84ad1580838df571bc90ce78a489519ebeba5 \
  --etcd \
  --controlplane \
  --address 192.168.1.131 \
  --internal-address 192.168.1.131 \
  --node-name common-cluster-131

[INFO]  Label: --cattle.io/os=linux
[INFO]  Role requested: etcd
[INFO]  Role requested: controlplane
[INFO]  CA strict verification is set to true
[INFO]  Using default agent configuration directory /etc/rancher/agent
[INFO]  Using default agent var directory /var/lib/rancher/agent
[INFO]  Successfully downloaded CA certificate
[INFO]  Value from https-://rancher.sweetnight.com.cn/cacerts is an x509 certificate
[INFO]  Successfully tested Rancher connection
[INFO]  Downloading rancher-system-agent binary from https-://rancher.sweetnight.com.cn/assets/rancher-system-agent-amd64
[INFO]  Successfully downloaded the rancher-system-agent binary.
[INFO]  Downloading rancher-system-agent-uninstall.sh script from https-://rancher.sweetnight.com.cn/assets/system-agent-uninstall.sh
[INFO]  Successfully downloaded the rancher-system-agent-uninstall.sh script.
[INFO]  Generating Cattle ID
[INFO]  Successfully downloaded Rancher connection information
[INFO]  systemd: Creating service file
[INFO]  Creating environment file /etc/systemd/system/rancher-system-agent.env
[INFO]  Enabling rancher-system-agent.service
Created symlink /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service → /etc/systemd/system/rancher-system-agent.service.
[INFO]  Starting/restarting rancher-system-agent.service

查询日志 一直没有新的日志产生

[root@common-node-131 tmp]# journalctl -u rancher-system-agent -f
Nov 13 11:11:43 common-node-131 systemd[1]: Started Rancher System Agent.
Nov 13 11:11:43 common-node-131 rancher-system-agent[54303]: time="2025-11-13T11:11:43+08:00" level=info msg="Rancher System Agent version v0.3.13 (5a64be2) is starting"
Nov 13 11:11:43 common-node-131 rancher-system-agent[54303]: time="2025-11-13T11:11:43+08:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Nov 13 11:11:43 common-node-131 rancher-system-agent[54303]: time="2025-11-13T11:11:43+08:00" level=info msg="Starting remote watch of plans"
Nov 13 11:11:43 common-node-131 rancher-system-agent[54303]: time="2025-11-13T11:11:43+08:00" level=info msg="Starting /v1, Kind=Secret controller"

等了很长时间也没有注册成功
麻烦给看下 是我漏的步骤 还是说我的环境有问题 感谢感谢

创建的自定义集群是 rke2 集群么?如果是的话,可参考:RKE2 commandsRKE2 commands 来确认 containerd 等 是否启动正常

是创建的自定义 RKE2集群
我看发的 RKE2 commands 上面
要先手动执行
curl -sL https://get.rke2.io | sh
systemctl daemon-reload
systemctl start rke2-server

是要先部署这个服务 才能注册节点吗
curl --insecure -fL https://rancher.sweetnight.com.cn/system-agent-install.sh | sudo sh -s -
–server https://rancher.sweetnight.com.cn
–label ‘cattle.io/os=linux
–token x6fvgmtjvtbsfqmcf85pzmgqxktxsp9pp8fr28j8zv22rxcbkhmk27
–ca-checksum d2ec7967668eb6ef67dd0ee49ae84ad1580838df571bc90ce78a489519ebeba5
–etcd
–controlplane
–address 192.168.1.131
–internal-address 192.168.1.131
–node-name common-cluster-131

先在问题好像是在 新的节点上都没有安装 rke2-server 这个服务
[root@common-node-131 tmp]# systemctl status rke2-server
Unit rke2-server.service could not be found.
[root@common-node-131 tmp]#

我认为执行了注册节点的命令 会自动安装 rke2-server

是的,会自动部署 rke2 的服务,说明还没到安装 rke2 的不走,你可以看看 crictl 相关的命令,看看有没有生成,如果 crictl 相关的命令你也执行不了,然后 rancher-system-agent 服务也没啥日子,那就只能观察 rancher server 的日志了,看看安装到哪个步骤了。

rancher server 的日志是主要看 rancher pod的日志吗

[root@rancher-k8s-node-130 ~]# kubectl logs rancher-86984c6d89-nxhrf -n cattle-system | grep ERROR

有在报下面这些错误 还需要排查哪些问题呢

025/11/12 16:53:51 [ERROR] Error during subscribe websocket: close sent

2025/11/12 17:22:13 [ERROR] watcher channel closed

2025/11/12 18:01:38 [ERROR] watcher channel closed

2025/11/12 18:32:53 [ERROR] watcher channel closed

2025/11/12 19:14:56 [ERROR] watcher channel closed

2025/11/12 19:30:37 [ERROR] Error during subscribe write tcp 10.42.30.102:80->10.42.30.76:33030: write: broken pipe

2025/11/12 19:54:54 [ERROR] watcher channel closed

2025/11/12 20:45:09 [ERROR] watcher channel closed

2025/11/12 21:43:10 [ERROR] watcher channel closed

2025/11/12 22:38:06 [ERROR] watcher channel closed

2025/11/12 23:24:51 [ERROR] watcher channel closed

2025/11/13 00:06:08 [ERROR] watcher channel closed

2025/11/13 00:45:15 [ERROR] watcher channel closed

2025/11/13 01:25:47 [ERROR] watcher channel closed

2025/11/13 01:28:39 [ERROR] Error during subscribe websocket: close sent

2025/11/13 01:38:35 [ERROR] error in transform: failed to find status.conditions block in cluster c-m-5zs4sn25

2025/11/13 01:38:35 [ERROR] [rkecluster] fleet-default/common-cluster: error getting CAPI cluster no matching controller owner ref

2025/11/13 01:38:35 [ERROR] error syncing ‘fleet-default/common-cluster’: handler rke-cluster: no matching controller owner ref, requeuing

2025/11/13 01:38:35 [ERROR] [rkecluster] fleet-default/common-cluster: error getting CAPI cluster no matching controller owner ref

2025/11/13 01:38:35 [ERROR] error syncing ‘fleet-default/common-cluster’: handler rke-cluster: no matching controller owner ref, requeuing

2025/11/13 01:38:35 [ERROR] [rkecluster] fleet-default/common-cluster: error getting CAPI cluster no matching controller owner ref

2025/11/13 01:38:35 [ERROR] error syncing ‘fleet-default/common-cluster’: handler rke-cluster: no matching controller owner ref, requeuing

2025/11/13 01:38:35 [ERROR] error in transform: failed to find status.conditions block in cluster c-m-5zs4sn25

2025/11/13 01:38:35 [ERROR] [rkecluster] fleet-default/common-cluster: error getting CAPI cluster no matching controller owner ref

2025/11/13 01:38:35 [ERROR] error syncing ‘fleet-default/common-cluster’: handler rke-cluster: no matching controller owner ref, requeuing

2025/11/13 01:38:35 [ERROR] [rkecluster] fleet-default/common-cluster: error getting CAPI cluster no matching controller owner ref

2025/11/13 01:38:35 [ERROR] error syncing ‘fleet-default/common-cluster’: handler rke-cluster: no matching controller owner ref, requeuing

2025/11/13 01:38:35 [ERROR] [planner] rkecluster fleet-default/common-cluster: error during plan processing: no matching controller owner ref

2025/11/13 01:38:36 [ERROR] defaultSvcAccountHandler: Sync: error handling default ServiceAccount of namespace key=c-m-5zs4sn25, err=Operation cannot be fulfilled on namespaces “c-m-5zs4sn25”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 01:38:36 [ERROR] defaultSvcAccountHandler: Sync: error handling default ServiceAccount of namespace key=c-m-5zs4sn25, err=Operation cannot be fulfilled on namespaces “c-m-5zs4sn25”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 01:38:36 [ERROR] defaultSvcAccountHandler: Sync: error handling default ServiceAccount of namespace key=c-m-5zs4sn25-p-dbls8, err=Operation cannot be fulfilled on namespaces “c-m-5zs4sn25-p-dbls8”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 01:38:36 [ERROR] defaultSvcAccountHandler: Sync: error handling default ServiceAccount of namespace key=c-m-5zs4sn25, err=Operation cannot be fulfilled on namespaces “c-m-5zs4sn25”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 01:38:36 [ERROR] defaultSvcAccountHandler: Sync: error handling default ServiceAccount of namespace key=c-m-5zs4sn25-p-k4hxt, err=Operation cannot be fulfilled on namespaces “c-m-5zs4sn25-p-k4hxt”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 01:38:36 [ERROR] defaultSvcAccountHandler: Sync: error handling default ServiceAccount of namespace key=c-m-5zs4sn25-p-k4hxt, err=Operation cannot be fulfilled on namespaces “c-m-5zs4sn25-p-k4hxt”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 01:38:36 [ERROR] defaultSvcAccountHandler: Sync: error handling default ServiceAccount of namespace key=c-m-5zs4sn25-p-k4hxt, err=Operation cannot be fulfilled on namespaces “c-m-5zs4sn25-p-k4hxt”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 01:38:36 [ERROR] unable to update cluster c-m-5zs4sn25 with sync annotation, grs will re-enqueue on change: Operation cannot be fulfilled on clusters.management.cattle.io “c-m-5zs4sn25”: the object has been modified; please apply your changes to the latest version and try again

2025/11/13 02:20:12 [ERROR] watcher channel closed

[root@common-node-131 lib]# cd /var/lib/rancher/

[root@common-node-131 rancher]# ll

total 0

drwx------. 3 root root 60 Nov 13 11:11 agent

[root@common-node-131 rancher]# cd agent/

[root@common-node-131 agent]# ll

total 4

drwx------. 2 root root 6 Nov 13 11:11 interlock

-rw-------. 1 root root 2317 Nov 13 11:11 rancher2_connection_info.json

[root@common-node-131 agent]#

现在新节点的 cd /var/lib/rancher/ 这个目录下啥都没有

问题已经解决:有两个操作问题

  1. 在新添加节点的时候一定要把 worker 选上 要把3个角色都勾选
  2. 在写仓库镜像代理的时候不能加https 只能写 docker.1ms.run

感谢支持

可以不用同时都勾选 3 个角色,只有添加的集群中具有这 3 个角色才能开始安装,几个例子:
示例 1:
node1:etcd、controlplan
node2: worker

示例 2:
node1:etcd
node2: controlplan
node3:worker

这几种都可以执行安装