rancher安装Failed to connect to peer wss://IP/v3/connect [local ID=IP]: dial tcp IP:443: i/o timeout

环境信息:
RKE2 版本:
rke2 version v1.24.12+rke2r1 (1cbcfe3c873df5a7555cde3211a144055312b2a5)

节点 CPU 架构,操作系统和版本:(3台节点均一致)
Ubuntu 20.04.2 LTS
Linux Cube-1 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

集群配置:
3台server 0台agent
cube-1 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
cube-2 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
cube-3 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
(3台机器作为集群。域名指向另外一台机器安装的nginx,这个nginx再upstream到3台机器。均为局域网连接。)

问题描述:
rke2成功HA部署在3台机器后,通过helm安装rancher HA,安装没有彻底成功,且rancher服务也无法被正常访问。

重现步骤:

  • 安装 RKE2 的命令:
  1. 安装rke2
    curl -sfL https://get.rke2.io | sh -
    tls-san值为我自己注册的二级域名
  2. kubectl create namespace cattle-system
  3. kubectl -n cattle-system create secret tls tls-rancher-ingress
    –cert=tls.crt
    –key=tls.key
    (证书是GoDaddy.com, Inc颁发的,应该没问题)
  4. helm install rancher rancher-stable/rancher
    –namespace cattle-system
    –set hostname=dev..com
    –set bootstrapPassword=

    –set ingress.tls.source=secret
    –set ingress.ingressClassName=nginx
  5. kubectl -n cattle-system rollout status deploy/rancher
    kubectl -n cattle-system get deploy rancher
    两个命令均正常
  6. 有2个 helm-operation-**** 容器出现异常 (日志见下)
  7. rancher服务访问不通,3个节点的rancher server日志有问题(见下)。

预期结果:
rancher UI可以正常被访问,rancher正常接管rke2集群

实际结果:
rancherUI无法访问
rke2表面看起来没什么问题
kubectl可以正常使用
rancher Server无法访问

(因为新用户限制,更多信息跟帖回复)

补充说明

  1. 在helm安装时会报错:
    Error: INSTALLATION FAILED: failed to create resource: Internal error occurred: failed calling webhook “validate.nginx.ingress.kubernetes.io”: failed to call webhook: Post “https://rke2-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1/ingresses?timeout=10s”: context deadline exceeded

  2. rancher server pod,在其容器内部对外访问速度很慢,而且无法访问别的rancher pod。宿主上访问别的宿主非常快,对外访问也非常快。

helm-operation-****** 日志 (pod一直在Running,最后Error)

Defaulted container “helm” out of: helm, proxy
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available

rancher-webhook-****** 日志

time=“2023-05-17T13:37:43Z” level=info msg=“Rancher-webhook version v0.3.3 (1b9d829) is starting”
time=“2023-05-17T13:37:44Z” level=info msg=“generated self-signed CA certificate CN=dynamiclistener-ca@1684330664,O=dynamiclistener-org: notBefore=2023-05-17 13:37:44.002591079 +0000 UTC notAfter=2033-05-14 13:37:44.002591079 +0000 UTC”
time=“2023-05-17T13:37:44Z” level=info msg=“Listening on :9443”
time=“2023-05-17T13:37:44Z” level=info msg=“certificate CN=dynamic,O=dynamic signed by CN=dynamiclistener-ca@1684330664,O=dynamiclistener-org: notBefore=2023-05-17 13:37:44 +0000 UTC notAfter=2033-05-14 13:37:44 +0000 UTC”
time=“2023-05-17T13:37:44Z” level=warning msg=“dynamiclistener [::]:9443: no cached certificate available for preload - deferring certificate load until storage initialization or first client request”
time=“2023-05-17T13:37:44Z” level=info msg=“Creating new TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”
time=“2023-05-17T13:37:44Z” level=info msg=“Active TLS secret cattle-system/cattle-webhook-tls (ver=11899619) (count 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=Role controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=ClusterRoleTemplateBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=Cluster controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting apiregistration.k8s.io/v1, Kind=APIService controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting /v1, Kind=Secret controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=GlobalRole controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Sleeping for 15 seconds then applying webhook config”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=ProjectRoleTemplateBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting provisioning.cattle.io/v1, Kind=Cluster controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=RoleTemplate controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Updating TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”

rancher-****** 后面的日志

2023/05/17 13:36:15 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:36:19 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:36:30 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:36:34 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:36:45 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:36:49 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:37:00 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:04 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:37:15 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:19 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
W0517 13:37:29.577586 33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=BundleNamespaceMapping
2023/05/17 13:37:29 [INFO] Watching metadata for gitjob.cattle.io/v1, Kind=GitJob
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistration
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=BundleDeployment
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ImageScan
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistrationToken
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=GitRepo
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=GitRepoRestriction
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=Content
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterGroup
2023/05/17 13:37:30 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:34 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:37:45 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:49 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:38:00 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:38:04 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout

rke2-ingress-nginx-controller-****** 日志

W0517 13:19:12.260018 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:19:15.594027 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:21:57.720394 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:22:17.744991 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:51.212331 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:54.545758 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:57.880022 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
I0517 13:26:38.720027 7 store.go:658] “secret was deleted and it is used in ingress annotations. Parsing” secret=“cattle-system/tls-rancher-ingress”
W0517 13:26:38.720409 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:26:38.720447 7 controller.go:1333] Error getting SSL certificate “cattle-system/tls-rancher-ingress”: local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
W0517 13:26:42.053846 7 controller.go:1018] Error obtaining Endpoints for Service “cattle-system/rancher”: no object matching key “cattle-system/rancher” in local store
W0517 13:26:42.053896 7 controller.go:1333] Error getting SSL certificate “cattle-system/tls-rancher-ingress”: local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
I0517 13:26:42.053972 7 controller.go:168] “Configuration changes detected, backend reload required”
I0517 13:26:42.115223 7 controller.go:185] “Backend successfully reloaded”
I0517 13:26:42.115463 7 event.go:285] Event(v1.ObjectReference{Kind:“Pod”, Namespace:“kube-system”, Name:“rke2-ingress-nginx-controller-k4fh5”, UID:“71b2cfab-c167-4dd4-99a0-1780f31754b4”, APIVersion:“v1”, ResourceVersion:“10834628”, FieldPath:""}): type: ‘Normal’ reason: ‘RELOAD’ NGINX reload triggered due to a change in configuration
I0517 13:26:45.387983 7 controller.go:168] “Configuration changes detected, backend reload required”
I0517 13:26:45.440137 7 controller.go:185] “Backend successfully reloaded”
I0517 13:26:45.440307 7 event.go:285] Event(v1.ObjectReference{Kind:“Pod”, Namespace:“kube-system”, Name:“rke2-ingress-nginx-controller-k4fh5”, UID:“71b2cfab-c167-4dd4-99a0-1780f31754b4”, APIVersion:“v1”, ResourceVersion:“10834628”, FieldPath:""}): type: ‘Normal’ reason: ‘RELOAD’ NGINX reload triggered due to a change in configuration

看样子像是跨主机 rancher pod 之间无法访问,你可以先确认下 ip_forward 是否开启 cat /proc/sys/net/ipv4/ip_forward

或者将 pod 副本数缩减为 1,看看能否启动。