Rancher 2.6.8更新Server SSL证书后,下游RKE集群后出现Cluster agent is not connected

Rancher Server 设置

  • Rancher 版本:2.6.8
  • 安装选项 (Docker install/Helm Chart): Docker

下游集群信息

  • Kubernetes 版本: 1.23.10
  • Cluster Type (Local/Downstream): Downstream
    • 如果 Downstream,是什么类型的集群?(自定义/导入或为托管 等): 自定义RKE

用户信息

  • 登录用户的角色是什么? (管理员/集群所有者/集群成员/项目所有者/项目成员/自定义):
    • 如果自定义,自定义权限集:

主机操作系统:
CentOS 7.9

问题描述:
Rancher 2.6.8 Docker Server端更了SSL证书和CA证书链后,根据这个帖子(https://gist.github.com/superseb/076f20146e012f1d4e289f5bd1bd4971)重建了对应cattle-system空间下的cluster-agent和node-agent

在控制节点执行:
kubectl -n cattle-system delete daemonset.apps/cattle-node-agent deployment.apps/cattle-cluster-agent
curl --insecure -sfL https://mydomain.net/v3/import/mycode_c-sp9kw.yaml | kubectl apply -f -

但是通过rancher UI查看 下游RKE集群还是[Disconnected] Cluster agent is not connected

重现步骤:

结果:

预期结果:

截图:

其他上下文信息:

日志

cattle-cluster-agent pod运行正常,日志无报错:
time=“2023-01-07T17:11:16Z” level=info msg=“Listening on /tmp/log.sock”
time=“2023-01-07T17:11:16Z” level=info msg=“Rancher agent version v2.6.8 is starting”
time=“2023-01-07T17:11:16Z” level=info msg=“Connecting to wss://mydomain.net/v3/connect/register with token starting with nfg78vndnq9zvh5cqsm4tvrffht”
time=“2023-01-07T17:11:16Z” level=info msg=“Connecting to proxy” url=“wss://mydomain.net/v3/connect/register”

node-agent pod运行正常,日志无报错:
level=info msg=“Connecting to wss://mydomain.net/v3/connect with token starting with nfg78vndnq9zvh5cqsm4tvrffht”
level=info msg=“Connecting to proxy” url=“wss://mydomain.net/v3/connect”
level=info msg=“Starting plan monitor, checking every 120 seconds”

Rancher Server有一些错误日志(大概有3种异常,应该和此次的更新操作无关)
[ERROR] error syncing ‘c-sp9kw/p-2kqvq’: handler system-image-upgrade-controller: upgrade cluster c-sp9kw system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster’s [c-sp9kw] kubernetes version, requeuing
2023/01/07 18:22:45 [ERROR] Failed to handle tunnel request from remote address 10.231.227.113:25364: response 400: cluster not found
2023/01/07 18:22:45 [ERROR] Failed to handle tunnel request from remote address 10.231.227.113:25366: response 400: cluster not found
2023/01/07 18:22:46 [ERROR] error syncing ‘system-library’: handler system-image-upgrade-catalog-controller: upgrade cluster local system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster’s [local] kubernetes version, handler system-image-upgrade-catalog-controller: upgrade cluster c-sp9kw system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster’s [c-sp9kw] kubernetes version, requeuing


替换横竖的操作可以参考:

对的 我就是按照这个来的,现在UI显示下游agent注册不上Server,但是agent日志显示连接成功的。

上面的文章可能不适用于 2.5,如果 2.6 的话,可以试试 https://mp.weixin.qq.com/s/qeE1LxtIgepA9nFgyKoXBA 中的 “重置下游集群配置“ 章节开始和之后的操作

1 个赞

谢谢,另外有两个问题再请教下:
1、2.5.x版本的rancher-server在域名不变的情况下更新ssl证书会导致下游k8s失联,需要重新生成cattle agent才能连接对吗?
2、但是在2.6.x版本的rancher server在域名不变的情况下更新ssl证书后,并没有导致下游k8s的失联,是不是意味着这个版本可以不用重新安装下游k8s的agent呢?

因为 agent 中也保存了 rancher server 的证书信息,所以当 rancher server 替换证书之后,agent 也是需要更新证书的,最简单的方法就是重新生成 agent

细节我记不清了,如果更新证书之后,集群状态是 active,并且 rancher-agent,还有 fleet-agent 都没有错误日志的话,应该就没啥问题了