rancher安装Failed to connect to peer wss://IP/v3/connect [local ID=IP]: dial tcp IP:443: i/o timeout

RKE2 版本:
rke2 version v1.24.12+rke2r1 (1cbcfe3c873df5a7555cde3211a144055312b2a5)

节点 CPU 架构,操作系统和版本:(3台节点均一致)
Ubuntu 20.04.2 LTS
Linux Cube-1 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

3台server 0台agent
cube-1 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
cube-2 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
cube-3 Ready control-plane,etcd,master 25d v1.24.12+rke2r1

rke2成功HA部署在3台机器后,通过helm安装rancher HA,安装没有彻底成功,且rancher服务也无法被正常访问。


  • 安装 RKE2 的命令:
  1. 安装rke2
    curl -sfL https://get.rke2.io | sh -
  2. kubectl create namespace cattle-system
  3. kubectl -n cattle-system create secret tls tls-rancher-ingress
    (证书是GoDaddy.com, Inc颁发的,应该没问题)
  4. helm install rancher rancher-stable/rancher
    –namespace cattle-system
    –set hostname=dev..com
    –set bootstrapPassword=

    –set ingress.tls.source=secret
    –set ingress.ingressClassName=nginx
  5. kubectl -n cattle-system rollout status deploy/rancher
    kubectl -n cattle-system get deploy rancher
  6. 有2个 helm-operation-**** 容器出现异常 (日志见下)
  7. rancher服务访问不通,3个节点的rancher server日志有问题(见下)。

rancher UI可以正常被访问,rancher正常接管rke2集群

rancher Server无法访问



  1. 在helm安装时会报错:
    Error: INSTALLATION FAILED: failed to create resource: Internal error occurred: failed calling webhook “validate.nginx.ingress.kubernetes.io”: failed to call webhook: Post “https://rke2-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1/ingresses?timeout=10s”: context deadline exceeded

  2. rancher server pod,在其容器内部对外访问速度很慢,而且无法访问别的rancher pod。宿主上访问别的宿主非常快,对外访问也非常快。

helm-operation-****** 日志 (pod一直在Running,最后Error)

Defaulted container “helm” out of: helm, proxy
Waiting for Kubernetes API to be available
rancher-webhook-****** 日志

time=“2023-05-17T13:37:43Z” level=info msg=“Rancher-webhook version v0.3.3 (1b9d829) is starting”
time=“2023-05-17T13:37:44Z” level=info msg=“generated self-signed CA certificate CN=dynamiclistener-ca@1684330664,O=dynamiclistener-org: notBefore=2023-05-17 13:37:44.002591079 +0000 UTC notAfter=2033-05-14 13:37:44.002591079 +0000 UTC”
time=“2023-05-17T13:37:44Z” level=info msg=“Listening on :9443”
time=“2023-05-17T13:37:44Z” level=info msg=“certificate CN=dynamic,O=dynamic signed by CN=dynamiclistener-ca@1684330664,O=dynamiclistener-org: notBefore=2023-05-17 13:37:44 +0000 UTC notAfter=2033-05-14 13:37:44 +0000 UTC”
time=“2023-05-17T13:37:44Z” level=warning msg=“dynamiclistener [::]:9443: no cached certificate available for preload - deferring certificate load until storage initialization or first client request”
time=“2023-05-17T13:37:44Z” level=info msg=“Creating new TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”
time=“2023-05-17T13:37:44Z” level=info msg=“Active TLS secret cattle-system/cattle-webhook-tls (ver=11899619) (count 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=Role controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=ClusterRoleTemplateBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=Cluster controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting apiregistration.k8s.io/v1, Kind=APIService controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting /v1, Kind=Secret controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=GlobalRole controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Sleeping for 15 seconds then applying webhook config”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=ProjectRoleTemplateBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting provisioning.cattle.io/v1, Kind=Cluster controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=RoleTemplate controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Updating TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”

rancher-****** 后面的日志

2023/05/17 13:36:15 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:36:19 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:36:30 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:36:34 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:36:45 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:36:49 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:37:00 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:37:04 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:37:15 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:37:19 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
W0517 13:37:29.577586 33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=BundleNamespaceMapping
2023/05/17 13:37:29 [INFO] Watching metadata for gitjob.cattle.io/v1, Kind=GitJob
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistration
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=BundleDeployment
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ImageScan
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistrationToken
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=GitRepo
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=GitRepoRestriction
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=Content
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterGroup
2023/05/17 13:37:30 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:37:34 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:37:45 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:37:49 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:38:00 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout
2023/05/17 13:38:04 [ERROR] Failed to connect to peer wss:// [local ID=]: dial tcp i/o timeout

rke2-ingress-nginx-controller-****** 日志

W0517 13:19:12.260018 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:19:15.594027 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:21:57.720394 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:22:17.744991 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:51.212331 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:54.545758 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:57.880022 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
I0517 13:26:38.720027 7 store.go:658] “secret was deleted and it is used in ingress annotations. Parsing” secret=“cattle-system/tls-rancher-ingress”
W0517 13:26:38.720409 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:26:38.720447 7 controller.go:1333] Error getting SSL certificate “cattle-system/tls-rancher-ingress”: local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
W0517 13:26:42.053846 7 controller.go:1018] Error obtaining Endpoints for Service “cattle-system/rancher”: no object matching key “cattle-system/rancher” in local store
W0517 13:26:42.053896 7 controller.go:1333] Error getting SSL certificate “cattle-system/tls-rancher-ingress”: local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
I0517 13:26:42.053972 7 controller.go:168] “Configuration changes detected, backend reload required”
I0517 13:26:42.115223 7 controller.go:185] “Backend successfully reloaded”
I0517 13:26:42.115463 7 event.go:285] Event(v1.ObjectReference{Kind:“Pod”, Namespace:“kube-system”, Name:“rke2-ingress-nginx-controller-k4fh5”, UID:“71b2cfab-c167-4dd4-99a0-1780f31754b4”, APIVersion:“v1”, ResourceVersion:“10834628”, FieldPath:""}): type: ‘Normal’ reason: ‘RELOAD’ NGINX reload triggered due to a change in configuration
I0517 13:26:45.387983 7 controller.go:168] “Configuration changes detected, backend reload required”
I0517 13:26:45.440137 7 controller.go:185] “Backend successfully reloaded”
I0517 13:26:45.440307 7 event.go:285] Event(v1.ObjectReference{Kind:“Pod”, Namespace:“kube-system”, Name:“rke2-ingress-nginx-controller-k4fh5”, UID:“71b2cfab-c167-4dd4-99a0-1780f31754b4”, APIVersion:“v1”, ResourceVersion:“10834628”, FieldPath:""}): type: ‘Normal’ reason: ‘RELOAD’ NGINX reload triggered due to a change in configuration

看样子像是跨主机 rancher pod 之间无法访问,你可以先确认下 ip_forward 是否开启 cat /proc/sys/net/ipv4/ip_forward

或者将 pod 副本数缩减为 1,看看能否启动。