Rancher Server 设置
- Rancher 版本:2.8.5
- 安装选项 (Docker install/Helm Chart): Helm Chart
- 如果是 Helm Chart 安装,需要提供 Local 集群的类型(RKE1, RKE2, k3s, EKS, 等)和版本: RKE2 v1.27.16+rke2r2
- 在线或离线部署:在线
下游集群信息
- Kubernetes 版本: v1.27.16+rke2r2
- Cluster Type (Local/Downstream): Downstream
- 如果 Downstream,是什么类型的集群?(自定义/导入或为托管 等): 托管rke2
用户信息
- 登录用户的角色是什么? (管理员/集群所有者/集群成员/项目所有者/项目成员/自定义):
- 如果自定义,自定义权限集:
主机操作系统: Ubuntu 20.04.5 LTS 5.4.0-186-generic
问题描述:
helm upgrade rancher rancher-stable/rancher --version=2.9.1 --namespace cattle-system --set hostname=dev-rancher.test.com --set ingress.tls.source=secret --set replicas=2 --set rancherImage=docker.m.daocloud.io/rancher/rancher --set systemDefaultRegistry=docker.m.daocloud.io --debug
通过上面命令将rancher从v2.8.5升级到 2.9.1之后,下游集群自动将calico组件uninstall掉,之后装了个rke2-canal。(升级期间下游集群的业务访问出现504,是因为这个组件替换过程导致的?)
之后UI上看到下游集群全部失联了,查看下游集群cattle-cluster-agent pod的日志,出现大量报错
root@sysadm-test-master1:~# kubectl -n cattle-system logs cattle-cluster-agent-758c48f97f-rts7w
INFO: Environment: CATTLE_ADDRESS=10.42.1.2 CATTLE_AGENT_FALLBACK_PATH=/opt/rke2/bin CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.238.207:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.43.238.207:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.238.207 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.43.238.207:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.238.207 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.238.207 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY=docker.m.daocloud.io CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false,ui-sql-cache=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=b39532d4-7172-43d9-92f2-d120994e1c65 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-758c48f97f-rts7w CATTLE_RANCHER_PROVISIONING_CAPI_VERSION= CATTLE_RANCHER_WEBHOOK_VERSION=104.0.1+up0.5.1 CATTLE_SERVER=https://dev-rancher.test.com CATTLE_SERVER_VERSION=v2.9.1
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5
INFO: https://dev-rancher.shanqu.cc/ping is accessible
INFO: dev-rancher.shanqu.cc resolves to xxx.xxx.xxx.xxx
time="2025-07-03T06:12:15Z" level=info msg="Listening on /tmp/log.sock"
time="2025-07-03T06:12:15Z" level=info msg="Rancher agent version v2.9.1 is starting"
time="2025-07-03T06:12:15Z" level=error msg="unable to read CA file from /etc/kubernetes/ssl/certs/serverca: open /etc/kubernetes/ssl/certs/serverca: no such file or directory"
time="2025-07-03T06:12:20Z" level=info msg="Connecting to wss://dev-rancher.shanqu.cc/v3/connect/register with token starting with nwfg69zf5qxqktnx9m9pm76kt7h"
time="2025-07-03T06:12:20Z" level=info msg="Connecting to proxy" url="wss://dev-rancher.shanqu.cc/v3/connect/register"
time="2025-07-03T06:12:30Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp: lookup dev-rancher.shanqu.cc: i/o timeout"
time="2025-07-03T06:12:30Z" level=error msg="Remotedialer proxy error" error="dial tcp: lookup dev-rancher.test.com: i/o timeout"
time="2025-07-03T06:12:40Z" level=info msg="Connecting to wss://dev-rancher.test.com/v3/connect/register with token starting with nwfg69zf5qxqktnx9m9pm76kt7h"
time="2025-07-03T06:12:40Z" level=info msg="Connecting to proxy" url="wss://dev-rancher.test.com/v3/connect/register"
time="2025-07-03T06:13:10Z" level=warning msg="Error while getting agent config: Get \"https://dev-rancher.test.com/v3/connect/config\": dial tcp: lookup dev-rancher.test.com on 10.43.0.10:53: read udp 10.42.1.2:40848->10.43.0.10:53: i/o timeout"
......
time="2025-07-03T06:23:06Z" level=error msg="error syncing 'rancher-charts': handler helm-clusterrepo-ensure: ensure failure: git -C /var/lib/rancher-data/local-catalogs/v2/rancher-charts/4b40cac650031b74776e87c1a726b0484d0877c3ec137da0872547ff9b73a721 fetch origin -- d565b80afc87e15060d3d74372965da29246071f error: exit status 128, detail: error: RPC failed; curl 92 HTTP/2 stream 7 was not closed cleanly: INTERNAL_ERROR (err 2)\nerror: 6004 bytes of body are still expected\nfetch-pack: unexpected disconnect while reading sideband packet\nfatal: early EOF\nfatal: fetch-pack: invalid index-pack output\n, requeuing"
time="2025-07-03T06:23:06Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup dev-rancher.test.com: i/o timeout"
time="2025-07-03T06:23:16Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup dev-rancher.test.com: i/o timeout"
E0703 06:23:16.888625 38 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1
time="2025-07-03T06:23:16Z" level=error msg="Failed to read API for groups map[metrics.k8s.io/v1beta1:stale GroupVersion discovery: metrics.k8s.io/v1beta1]"
W0703 06:23:17.134532 38 warnings.go:70] v1 ComponentStatus is deprecated in v1.19+
time="2025-07-03T06:23:21Z" level=error msg="Failed to read API for groups map[metrics.k8s.io/v1beta1:stale GroupVersion discovery: metrics.k8s.io/v1beta1]"
E0703 06:23:21.894366 38 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1
W0703 06:23:22.118962 38 warnings.go:70] v1 ComponentStatus is deprecated in v1.19+
time="2025-07-03T06:23:46Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup dev-rancher.test.com: i/o timeout"
time="2025-07-03T06:23:56Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup dev-rancher.teset.com: i/o timeout"
time="2025-07-03T06:24:06Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup dev-rancher.test.com: i/o timeout"
E0703 06:24:11.888069 38 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1
time="2025-07-03T06:24:11Z" level=error msg="Failed to read API for groups map[metrics.k8s.io/v1beta1:stale GroupVersion discovery: metrics.k8s.io/v1beta1]"
W0703 06:24:12.096426 38 warnings.go:70] v1 ComponentStatus is deprecated in v1.19+
time="2025-07-03T06:24:16Z" level=error msg="Failed to dial steve aggregation server: dial tcp: lookup dev-rancher.test.com: i/o timeout"
E0703 06:24:16.894671 38 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: stale GroupVersion discovery: metrics.k8s.io/v1beta1
time="2025-07-03T06:24:16Z" level=error msg="Failed to read API for groups map[metrics.k8s.io/v1beta1:stale GroupVersion discovery: metrics.k8s.io/v1beta1]"
W0703 06:24:17.174033 38 warnings.go:70] v1 ComponentStatus is deprecated in v1.19+
经过多次手动重启下游集群的cattle-cluster-agent后,UI上下游集群的状态恢复成active,但是pod还是出现大量上面的错误信息。
而且集群管理中,编辑下游集群的配置,出现错误:No version info found in KDM
并且发现下游集群的rancher-webhook版本并没有升级到相应版本,用的还是对应rancher 2.8.5的旧版本,但是看rancher 的local集群,rancher-webhook版本已经升级至v0.5.1
重现步骤:
结果:
预期结果:
截图:
其他上下文信息:
日志