通过rancher 创建集群报错

Rancher Server 设置

  • Rancher 版本:2.8.5
  • 安装选项 (Docker install/Helm Chart):
    • 如果是 Helm Chart 安装,需要提供 Local 集群的类型(RKE1, RKE2, k3s, EKS, 等)和版本:
  • 在线或离线部署:

通过 Helm 安装 Rancher

helm install rancher rancher-latest/rancher
–namespace cattle-system
–set hostname=rancher.XXX.com
–set replicas=1
–set ingress.tls.source=secret
–set rancherImage=registry.cn-hangzhou.aliyuncs.com/rancher/rancher
–set systemDefaultRegistry=registry.cn-hangzhou.aliyuncs.com

下游集群信息

  • Kubernetes 版本:
  • Cluster Type (Local/Downstream):
    • 如果 Downstream,是什么类型的集群?(自定义/导入或为托管 等):

用户信息

  • 登录用户的角色是什么? (管理员/集群所有者/集群成员/项目所有者/项目成员/自定义):
    • 如果自定义,自定义权限集:

主机操作系统:

问题描述:
通过rancher ui 创建新的集群,复制rancher ui 给出的命令在一台新的服务器运行

重现步骤:
curl -fL https://rancher.XXX.com/system-agent-install.sh | sudo sh -s - --server https://rancher.XXX.com --label ‘cattle.io/os=linux’ --token l6rvp7vzgz798b2h84sdmqj7kps67c887wj9gtwspqlwn8s6pkr6r7 --etcd --controlplane --worker
结果:
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Rancher System Agent version v0.3.6 (41c07d0) is starting”
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Using directory /var/lib/rancher/agent/work for work”
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Starting remote watch of plans”
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Initial connection to Kubernetes cluster failed with error Get "https://rancher.XXX.com/version\”: tls: failed to verify certificate: x509: certificate signed by unknown authority, removing CA data and trying again"
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=fatal msg=“error while connecting to Kubernetes cluster with nullified CA data: an error on the server ("invalid upgrade response: status code 200") has prevented the request from succeeding”
Jul 18 15:05:02 systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
Jul 18 15:05:02 systemd[1]: rancher-system-agent.service: Failed with result ‘exit-code’.
Jul 18 15:05:07 systemd[1]: rancher-system-agent.service: Service hold-off time over, scheduling restart.
Jul 18 15:05:07 systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 55.
预期结果:
rancher-system-agent 正常运行
截图:

其他上下文信息:
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Rancher System Agent version v0.3.6 (41c07d0) is starting”
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Using directory /var/lib/rancher/agent/work for work”
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Starting remote watch of plans”
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=info msg=“Initial connection to Kubernetes cluster failed with error Get "https://rancher.XXX.com/version\”: tls: failed to verify certificate: x509: certificate signed by unknown authority, removing CA data and trying again"
Jul 18 15:05:02 rancher-system-agent[4672]: time=“2024-07-18T15:05:02+08:00” level=fatal msg=“error while connecting to Kubernetes cluster with nullified CA data: an error on the server ("invalid upgrade response: status code 200") has prevented the request from succeeding”
Jul 18 15:05:02 systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
Jul 18 15:05:02 systemd[1]: rancher-system-agent.service: Failed with result ‘exit-code’.
Jul 18 15:05:07 systemd[1]: rancher-system-agent.service: Service hold-off time over, scheduling restart.
Jul 18 15:05:07 systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 55.

日志


从日志看,使用了非安全的自签名证书,需要指定 privateCA 参数

我没有使用ks3自带的traefik,我是用服务器上的nginx 代理到了rancher。nginx 中使用了 对应域名的证书。这有影响吗?

如果使用了外部负载均衡,可以参考下外部 TLS 终止
https://ranchermanager.docs.rancher.com/zh/getting-started/installation-and-upgrade/installation-references/helm-chart-options#外部-tls-终止


文中 cluster.yml 路径是什么呢。

请仔细看下描述,这部分说明是 local 集群使用 ingress-controller 所需要设置