Rancer Server 2.5.12 更换证书后出现无法连接集群的情况

Rancher Server 设置

  • Rancher 版本:2.5.12,
  • 安装选项 (Docker install/Helm Chart): Docker install

问题描述:
Rancer Server 更换证书后出现无法连接集群的情况,集群列表里对应出现错误信息: Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [172.16.60.17]

截图:

其他上下文信息:

日志

2022/03/23 02:38:30 [ERROR] cluster [c-xq6v8] provisioning: Failed to set up SSH tunneling for host [172.16.60.17]: Can’t retrieve Docker Info: error during connect: Get “http://%!F(MISSING)var%!F(MISSING)run%!F(MISSING)docker.sock/v1.24/info”: can not build dialer to [c-xq6v8:m-84e8ff4e25f7]
2022/03/23 02:38:30 [ERROR] cluster [c-xq6v8] provisioning: Removing host [172.16.60.17] from node lists
2022/03/23 02:38:45 [ERROR] error syncing ‘c-xq6v8’: handler cluster-deploy: Get “https://172.16.60.17:6443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent”: waiting for cluster [c-xq6v8] agent to connect, handler cluster-provisioner-controller: Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [172.16.60.17], requeuing

请把你更新证书的详细步骤列出来。

  1. 进入 rancher server 容器,执行相关操作
docker exec -it rancher /bin/sh
kubectl --insecure-skip-tls-verify -n kube-system delete secrets k3s-serving
kubectl --insecure-skip-tls-verify delete secret serving-cert -n cattle-system
rm -f /var/lib/rancher/k3s/server/tls/dynamic-cert.json
  1. 请求刷新参数
curl --insecure -sfL https://localhost:8443/v3
  1. 重启rancher server 容器
docker restart rancher

然后发现webhook报错,就按官方文档执行:

Rotation of Expired Webhook Certificates

For Rancher versions that have rancher-webhook installed, these certificates will expire after one year. It will be necessary for you to rotate your webhook certificate when this occurs.

Rancher will advise the community once there is a permanent solution in place for this known issue. Currently, there are two methods to work around this issue:

1. Users with cluster access, run the following commands:

kubectl delete secret -n cattle-system cattle-webhook-tls
kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io --ignore-not-found=true rancher.cattle.io
kubectl delete pod -n cattle-system -l app=rancher-webhook

把这个命令的 localhost 换车对应的 IP 试试

已执行,同样的错误还是出现

那就按照日志提示去排查吧,看看对应节点的 etcd 为什么没连上,可以看看 etcd 的日志

感觉是跟那个端口号有关系,我改成实际访问rancher的443后就可以了

請教,這個curl --insecure -sfL https://localhost:8443/v3 ,運行後,有反應嗎? 是在Linux本地運行,還是在Rancher容器中運行?

你可以在任意机器上执行这个命令,包括 linux 主机,或者你的 pc 机,但这个 localhost 需要换成你的 rancher server 的访问地址