集群error:Failed to ensure monitoring project name: failed to find "cattle-prometheus" Namespace

Rancher Server 设置

  • Rancher 版本:v2.5.9
  • 安装选项 (Docker install/Helm Chart): Docker install
    • 如果是 Helm Chart 安装,需要提供 Local 集群的类型(RKE1, RKE2, k3s, EKS, 等)和版本:RKE: v1.1.11
  • 在线或离线部署:离线

下游集群信息

  • Kubernetes 版本: client (1.22) and server (1.18)
  • Cluster Type (Local/Downstream): local
    • 如果 Downstream,是什么类型的集群?(自定义/导入或为托管 等):

用户信息

  • 登录用户的角色是什么? (管理员/集群所有者/集群成员/项目所有者/项目成员/自定义):admin
    • 如果自定义,自定义权限集:

主机操作系统:centos7

问题描述:集群昨日正常未作任何操作修改,今日登录rancher页面后发现集群error,提示:Failed to ensure monitoring project name: failed to find “cattle-prometheus” Namespace: Get “https://10.43.0.1:443/api/v1/namespaces/cattle-prometheus”: waiting for cluster [c-swd8k] agent to connect; waiting on cluster-scoped-gc

重现步骤:

结果:

预期结果:

**截图:


**

其他上下文信息:

日志
I0429 00:54:56.119792      54 request.go:645] Throttling request took 1.034102129s, request: GET:https://127.0.0.1:6444/apis/management.cattle.io/v3/rkek8sserviceoptions?limit=500&resourceVersion=0
I0429 00:54:56.693654      54 shared_informer.go:240] Waiting for caches to sync for garbage collector
I0429 00:54:57.176689      54 shared_informer.go:247] Caches are synced for resource quota 
I0429 00:54:57.206887      54 shared_informer.go:247] Caches are synced for resource quota 
2024/04/29 00:54:58 [ERROR] error syncing 'system-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/system-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
I0429 00:54:58.494414      54 shared_informer.go:247] Caches are synced for garbage collector 
2024/04/29 00:54:58 [ERROR] error syncing 'helm3-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/helm3-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:54:58 [ERROR] error syncing 'library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
I0429 00:54:58.554887      54 shared_informer.go:247] Caches are synced for garbage collector 
I0429 00:54:58.554959      54 garbagecollector.go:137] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
2024/04/29 00:55:01 [ERROR] failed on subscribe replicationController: Get "https://10.43.0.1:443/api/v1/replicationcontrollers?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:02 [ERROR] failed on subscribe replicaSet: Get "https://10.43.0.1:443/apis/apps/v1/replicasets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:03 [ERROR] failed on subscribe serviceMonitor: Get "https://10.43.0.1:443/apis/monitoring.coreos.com/v1/servicemonitors?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:05 [ERROR] failed on subscribe alertmanager: Get "https://10.43.0.1:443/apis/monitoring.coreos.com/v1/alertmanagers?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:06 [ERROR] failed on subscribe job: Get "https://10.43.0.1:443/apis/batch/v1/jobs?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:07 [ERROR] failed on subscribe daemonSet: Get "https://10.43.0.1:443/apis/apps/v1/daemonsets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:07 [ERROR] failed on subscribe configMap: Get "https://10.43.0.1:443/api/v1/configmaps?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:07 [ERROR] failed on subscribe statefulSet: Get "https://10.43.0.1:443/apis/apps/v1/statefulsets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:07 [ERROR] failed on subscribe ingress: Get "https://10.43.0.1:443/apis/extensions/v1beta1/ingresses?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:07 [ERROR] failed on subscribe service: Get "https://10.43.0.1:443/api/v1/services?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:08 [ERROR] failed on subscribe prometheus: Get "https://10.43.0.1:443/apis/monitoring.coreos.com/v1/prometheuses?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:09 [ERROR] error syncing 'library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:55:09 [ERROR] error syncing 'system-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/system-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:55:09 [ERROR] error syncing 'helm3-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/helm3-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:55:10 [ERROR] failed on subscribe dnsRecord: Get "https://10.43.0.1:443/api/v1/services?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:10 [ERROR] failed on subscribe prometheusRule: Get "https://10.43.0.1:443/apis/monitoring.coreos.com/v1/prometheusrules?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:10 [ERROR] failed on subscribe deployment: Get "https://10.43.0.1:443/apis/apps/v1/deployments?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:10 [ERROR] failed on subscribe virtualService: Get "https://10.43.0.1:443/apis/networking.istio.io/v1alpha3/virtualservices?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:10 [ERROR] failed on subscribe cronJob: Get "https://10.43.0.1:443/apis/batch/v1beta1/cronjobs?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:11 [ERROR] failed on subscribe gateway: Get "https://10.43.0.1:443/apis/networking.istio.io/v1alpha3/gateways?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:12 [ERROR] failed on subscribe destinationRule: Get "https://10.43.0.1:443/apis/networking.istio.io/v1alpha3/destinationrules?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:12 [ERROR] failed on subscribe pod: Get "https://10.43.0.1:443/api/v1/pods?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:13 [ERROR] failed on subscribe namespacedDockerCredential: Get "https://10.43.0.1:443/api/v1/secrets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:14 [ERROR] failed on subscribe persistentVolumeClaim: Get "https://10.43.0.1:443/api/v1/persistentvolumeclaims?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:25 [ERROR] error syncing 'c-swd8k': handler cluster-deploy: Get "https://10.43.0.1:443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent": waiting for cluster [c-swd8k] agent to connect, requeuing
time="2024-04-29T00:55:30.515409722Z" level=info msg="Cluster-Http-Server 2024/04/29 00:55:30 http: TLS handshake error from 10.42.0.34:59750: remote error: tls: bad certificate"
E0429 00:55:30.524220      54 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]
2024/04/29 00:55:30 [ERROR] error syncing 'library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:55:30 [ERROR] error syncing 'system-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/system-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:55:31 [ERROR] error syncing 'helm3-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/helm3-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:55:31 [ERROR] failed on subscribe replicationController: Get "https://10.43.0.1:443/api/v1/replicationcontrollers?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:32 [ERROR] failed on subscribe replicaSet: Get "https://10.43.0.1:443/apis/apps/v1/replicasets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:36 [ERROR] failed on subscribe statefulSet: Get "https://10.43.0.1:443/apis/apps/v1/statefulsets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:37 [ERROR] failed on subscribe daemonSet: Get "https://10.43.0.1:443/apis/apps/v1/daemonsets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:37 [ERROR] failed on subscribe job: Get "https://10.43.0.1:443/apis/batch/v1/jobs?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:40 [ERROR] failed on subscribe deployment: Get "https://10.43.0.1:443/apis/apps/v1/deployments?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:40 [ERROR] failed on subscribe cronJob: Get "https://10.43.0.1:443/apis/batch/v1beta1/cronjobs?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:55:46 [ERROR] failed on subscribe namespacedSshAuth: Get "https://10.43.0.1:443/api/v1/secrets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
2024/04/29 00:56:01 [ERROR] error syncing 'library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:56:01 [ERROR] error syncing 'system-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/system-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:56:01 [ERROR] error syncing 'helm3-library': handler catalog: Update failed: fatal: unable to access 'https://git.rancher.io/helm3-charts/': gnutls_handshake() failed: Error in the pull function.
: exit status 128, requeuing
2024/04/29 00:56:07 [ERROR] error syncing 'c-swd8k': handler cluster-deploy: Get "https://10.43.0.1:443/apis/apps/v1/namespaces/cattle-system/daemonsets/cattle-node-agent": waiting for cluster [c-swd8k] agent to connect, requeuing
2024/04/29 00:56:10 [INFO] Stopping cluster agent for c-swd8k
2024/04/29 00:56:10 [ERROR] failed to start cluster controllers c-swd8k: context canceled
2024/04/29 00:56:17 [ERROR] failed on subscribe namespacedSecret: Get "https://10.43.0.1:443/api/v1/secrets?resourceVersion=0&timeout=30m0s&timeoutSeconds=1800&watch=true": waiting for cluster [c-swd8k] agent to connect
time="2024-04-29T00:56:30.539847234Z" level=info msg="Cluster-Http-Server 2024/04/29 00:56:30 http: TLS handshake error from 10.42.0.34:60076: remote error: tls: bad certificate"
E0429 00:56:30.550754      54 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, Token has been invalidated]

补充说明:当前k8s集群未受影响,各node状态均为ready, work节点内运行的pod都正常。

看看这个:cattle-prometheus - waiting for cluster agent to connect - Unavailable cluster · Issue #23596 · rancher/rancher · GitHub