下游集群无法注册到上游的 rancher

大佬们好

环境信息:
RKE2 版本:

# rke2 -v
rke2 version v1.30.5+rke2r1 (0c83bc82315cd61664880d0b52a7e070e9fbd623)
go version go1.22.6 X:boringcrypto

rancher 版本:
v2.9.2

节点 CPU 架构,操作系统和版本:

 uname -a
Linux ml-57-6 4.19.90-23.21.v2101.fortest.ky10.aarch64 #1 SMP Fri Mar 11 11:37:09 CST 2022 aarch64 aarch64 aarch64 GNU/Linux

操作系统为 rocky linux 9.5, arm64

集群配置:
相同版本的 rke2
上游集群:1 * server
下游集群:3 * server 无 agent

问题描述:
我这里有两个集群,根据 rancher.cn 上的建议,使用了私有证书。当尝试纳管下游集群时,cattle-cluster-agent 在启动过程中报错,导致下游集群无法被纳管。

上游集群 下游集群
配置 单节点 server 三个 server,无 agent
用途 部署 rancher 业务集群
部署方式 使用 air gap 方式手动部署 同左,手动部署
time="2024-12-09T08:13:56Z" level=error msg="Could not securely connect to https://172.16.57.2: Get \"https://172.16.57.2\": tls: failed to verify certificate: x509: cannot validate certificate for 172.16.57.2 b
ecause it doesn't contain any IP SANs"

重现步骤:

  • 安装 RKE2 的命令:

    INSTALL_RKE2_ARTIFACT_PATH=$PWD sh install.sh
    

    上下游均用该命令安装,并可以正常启动。

  • 上游安装 rancher

     helm repo add rancher-stable https://releases.rancher.com/server-charts/stable
    helm fetch rancher-stable/rancher --version=v2.9.2
    
  • 使用该脚本生成私钥
    生成自签名 SSL 证书 | Rancher文档
    生成命令为

    bash generate-key.sh --ssl-trusted-ip=172.16.57.2,172.16.57.9 --ssl-domain=rancher.ml.local --ssl-date=3650
    

    之后使用

    cd certs
    kubectl -n cattle-system create secret tls tls-rancher-ingress \
    --cert=tls.crt \
    --key=tls.key
    
    kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem
    

    上面的命令导入到上游集群中

  • 使用如下的命令进行安装上游集群的 rancher,rancher 安装成功,并可以进入

    cd rancher
    # 因为将 rke2 识别为了 1.31,暂时删掉版本要求
    sed -i "/kubeVersion/d" Chart.yaml
    helm upgrade install --namespace cattle-system \
    --set hostname=rancher.ml.local \
    --set rancherImage=172.16.57.5:443/rancher/rancher \
    --set ingress.tls.source=secret \
    --set privateCA=true \
    --set useBundledSystemChart=true .
    
  • 复制 rancher 中的集群纳入命令,粘贴到下游集群
    下游集群启动 agent 时报错

    time="2024-12-09T08:13:56Z" level=error msg="Could not securely connect to https://172.16.57.2: Get \"https://172.16.57.2\": tls: failed to verify certificate: x509: cannot validate certificate for 172.16.57.2 b
    ecause it doesn't contain any IP SANs"
    

预期结果:
下游集群的 rancher-agent 正常启动,并纳入上游管理

实际结果:
下游集群中的 rancher-agent 启动报错

日志

下游集群 rancher-agent 启动日志

INFO: Environment: CATTLE_ADDRESS=10.42.0.50 CATTLE_CA_CHECKSUM=6afa873fddb087f5557bfe42550fde4c3a4b4848467c013268e3e12e09047998 CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.137.240:80 CATTLE_CLUSTE
R_AGENT_PORT_443_TCP=tcp://10.43.137.240:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.137.240 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_
80_TCP=tcp://10.43.137.240:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.137.240 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.137.2
40 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UU
ID=c04358c4-8b51-4b3a-88bc-2744edca0bd0 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-655ff6f66-5749n CATTLE_RANCHER_PROVISIONING_CAPI_VERSION= CATTLE
_RANCHER_WEBHOOK_VERSION=104.0.2+up0.5.2 CATTLE_SERVER=https://172.16.57.2 CATTLE_SERVER_VERSION=v2.9.2
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5
INFO: https://172.16.57.2/ping is accessible
INFO: Value from https://172.16.57.2/v3/settings/cacerts is an x509 certificate
time="2024-12-09T08:39:31Z" level=info msg="Listening on /tmp/log.sock"
time="2024-12-09T08:39:31Z" level=info msg="Rancher agent version v2.9.2 is starting"
time="2024-12-09T08:39:31Z" level=info msg="Testing connection to https://172.16.57.2 using trusted certificate authorities within: /etc/kubernetes/ssl/certs/serverca"
time="2024-12-09T08:39:31Z" level=error msg="Could not securely connect to https://172.16.57.2: Get \"https://172.16.57.2\": tls: failed to verify certificate: x509: cannot validate certificate for 172.16.57.2 b
ecause it doesn't contain any IP SANs"

是我哪里操作的不对吗,还是遗漏了哪些步骤?提前谢谢大佬们的指点与帮助!

你是 helm 安装的 ranacher,注册地址为什么是个 IP?

谢谢回复

请问这个有限制吗? 我这里下游希望通过 ip 来注册访问,使用 helm 安装只能使用域名访问吗?

helm 安装的 rancher,只能通过域名来去注册下游集群,否则没办法校验证书

那请问如果我想使用 ip,并将 rancher 安装在 rke2 集群中的话,应该采用什么的安装方式呢?

不支持

这种情况下,只能选择 docker 这种方式安装吗?

是的,而且 docker 安装这种方式只适合测试和演示,生产环境上建议安装在 K8s 上

我尝试修改了下游集群中的 codeDNS 中的 hosts,将 57.2 指向了 rancher.ml.local,然后重启了 agent,但是仍然报错。

INFO: Environment: CATTLE_ADDRESS=10.42.0.57 CATTLE_CA_CHECKSUM=a9af61a0a8cba2f57dea7e673cb7659fc7ae29609af9f57f70991bade007e2cb CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.137.240:80 CATTLE_CLUSTE
R_AGENT_PORT_443_TCP=tcp://10.43.137.240:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.137.240 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_
80_TCP=tcp://10.43.137.240:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.137.240 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.137.2
40 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UU
ID=c04358c4-8b51-4b3a-88bc-2744edca0bd0 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-5fbd78d777-r2zpm CATTLE_RANCHER_PROVISIONING_CAPI_VERSION= CATTL
E_RANCHER_WEBHOOK_VERSION=104.0.2+up0.5.2 CATTLE_SERVER=https://rancher.ml.local CATTLE_SERVER_VERSION=v2.9.2
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5
INFO: https://rancher.ml.local/ping is accessible
INFO: rancher.ml.local resolves to 172.16.57.2
INFO: Value from https://rancher.ml.local/v3/settings/cacerts is an x509 certificate
time="2024-12-10T05:56:23Z" level=info msg="Listening on /tmp/log.sock"
time="2024-12-10T05:56:23Z" level=info msg="Rancher agent version v2.9.2 is starting"
time="2024-12-10T05:56:23Z" level=info msg="Testing connection to https://172.16.57.2 using trusted certificate authorities within: /etc/kubernetes/ssl/certs/serverca"
time="2024-12-10T05:56:23Z" level=error msg="Could not securely connect to https://172.16.57.2: Get \"https://172.16.57.2\": tls: failed to verify certificate: x509: cannot validate certificate for 172.16.57.2 b
ecause it doesn't contain any IP SANs"

都说了不能通过 IP 去注册集群啊

我改了 deploy 中的配置,改成了一个域名,然后通过 hosts 这种也不行?

不管你改啥,但这块是用 IP 注册的,肯定是不行

我看了看 rancher 的源码,当获取不到 node.TokenAndURL() 时,就会获取 cluster.TokenAndURL(),这个里面的值我没有改,所以最后注册时用的还是 ip 地址,修改这个以后就好了。

谢谢大佬的帮助。

还有一个离题的问题想请教,我可以将 rancher 部署在一个 devops 集群中吗?还是建议专门为 rancher 整一个小集群?