节点一直停留在Waiting for probes

Rancher Server 设置

  • Rancher 版本:v2.7.6
  • 安装选项 (Docker install/Helm Chart):
    • 如果是 Helm Chart 安装,需要提供 Local 集群的类型(RKE1, RKE2, k3s, EKS, 等)和版本:Helm Chart, RKE2 v2.16.8-rancher2
  • 在线或离线部署:在线

下游集群信息

  • Kubernetes 版本: v1.26.8+rke2r1

主机操作系统: Ubuntu 22.04.3 LTS

问题描述:
有一个节点的状态一直是Reconcilling,下面显示Waiting for probes: kube-controller-manager, kube-scheduler。集群刚建好的时候是没问题的,用了一段时间后出现的

截图:

其他上下文信息:

日志
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stderr]: + [  = true ]"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stderr]: + [ false = true ]"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[Applyinator] Command sh [-c run.sh] finished with err: <nil> and exit code: 0"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20250414-110324/0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: Name                                    Location                                                                                 Size     Created"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744560001 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744560001 85114912 2025-04-14T00:00:02+08:00"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744578002 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744578002 85114912 2025-04-14T05:00:02+08:00"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744596000 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744596000 85114912 2025-04-14T10:00:01+08:00"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: on-demand-zh***01-1695091299     file:///var/lib/rancher/rke2/server/db/snapshots/on-demand-zh***01-1695091299     14057504 2023-09-19T10:41:39+08:00"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744527603 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744527603 85114912 2025-04-13T15:00:04+08:00"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744545602 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744545602 85114912 2025-04-13T20:00:03+08:00"
Apr 14 11:04:17 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:17+08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Apr 14 11:04:18 zh***01 rancher-system-agent[943]: time="2025-04-14T11:04:18+08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-da3ffc36a4da-machine-plan with feedback"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20250414-111324/0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: Name                                    Location                                                                                 Size     Created"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744545602 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744545602 85114912 2025-04-13T20:00:03+08:00"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744560001 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744560001 85114912 2025-04-14T00:00:02+08:00"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744578002 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744578002 85114912 2025-04-14T05:00:02+08:00"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744596000 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744596000 85114912 2025-04-14T10:00:01+08:00"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: on-demand-zh***01-1695091299     file:///var/lib/rancher/rke2/server/db/snapshots/on-demand-zh***01-1695091299     14057504 2023-09-19T10:41:39+08:00"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[0d8b686d47874ac5432d5eb728d17afaf017aa4b68b1f4547d7085feef971d8c_0:stdout]: etcd-snapshot-zh***01-1744527603 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-zh***01-1744527603 85114912 2025-04-13T15:00:04+08:00"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Apr 14 11:13:24 zh***01 rancher-system-agent[943]: time="2025-04-14T11:13:24+08:00" level=info msg="[K8s] updated plan secret fleet-default/custom-da3ffc36a4da-machine-plan with feedback"

可以检查下是不是组件证书过期了,参考:[BUG] Rancher managed RKE2 clusters stuck in "Waiting for probes: kube-controller-manager, kube-scheduler" · Issue #41125 · rancher/rancher · GitHub

1 个赞

确实是证书过期的问题,我之前试过用集群管理界面上的Rotate Certificates来刷新证书,点了之后一点效果都没有,原来还得进每个节点手动操作呀。
这个状态挂着好久了,终于解决了!:pray::pray::pray:

我前段时间也遇到这个问题,我的是证书已经过期了,在UI上执行轮换证书已经不起效了,只能进节点执行。然后我也一并在UI轮换其他集群未过期的证书,发现也不生效(但我记得以前都是UI上执行会生效的,不知道怎么突然不行了 :thinking:)然后只能纯手动把所有集群的证书都轮换一遍了 :sweat_smile: