RK2集群离线部署第1台Server启动失败

环境信息:
RKE2 版本: v1.25.7+rke2r1

节点 CPU 架构,操作系统和版本:
Centos7.9 X86

集群配置:
3个节点

问题描述:
离线安装RKE2时,第1个节点安装之后systemctl start rke2-server.service启动一直报上述错误。此次使用的是物理服务器安装,之前使用虚拟机安装都是一次成功

重现步骤:
安装文件:
install.sh
rke2-images.linux-amd64.tar.zst
rke2.linux-amd64.tar.gz
sha256sum-amd64.txt

  • 安装 RKE2 的命令:
    INSTALL_RKE2_TYPE=“server” INSTALL_RKE2_MIRROR=cn INSTALL_RKE2_ARTIFACT_PATH=/data/rke2-artifacts sh install.sh

预期结果:
第一个节点安装执行完成后,systemctl start rke2-server.service启动成功,然后进行/etc/rancher/rke2/config.yaml集群配置

实际结果:
第一个节点安装执行完成后,systemctl start rke2-server.service启动失败,多次重新安装无效果

日志

Jun 18 01:30:51 xxh_testplatform1 rke2: time=“2023-06-18T01:30:51+08:00” level=info msg=“Tunnel server egress proxy waiting for runtime core to become available”
Jun 18 01:30:52 xxh_testplatform1 rke2: {“level”:“warn”,“ts”:“2023-06-18T01:30:52.677+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.4-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0009a8700/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused””}
Jun 18 01:30:52 xxh_testplatform1 rke2: time=“2023-06-18T01:30:52+08:00” level=error msg=“Failed to check local etcd status for learner management: context deadline exceeded”
Jun 18 01:30:52 xxh_testplatform1 rke2: time=“2023-06-18T01:30:52+08:00” level=info msg=“Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error”
Jun 18 01:30:56 xxh_testplatform1 rke2: time=“2023-06-18T01:30:56+08:00” level=info msg=“Container for etcd not found (no matching container found), retrying”
Jun 18 01:30:56 xxh_testplatform1 rke2: time=“2023-06-18T01:30:56+08:00” level=info msg=“Tunnel server egress proxy waiting for runtime core to become available”
Jun 18 01:30:57 xxh_testplatform1 rke2: time=“2023-06-18T01:30:57+08:00” level=info msg=“Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error”
Jun 18 01:31:01 xxh_testplatform1 rke2: {“level”:“warn”,“ts”:“2023-06-18T01:31:01.195+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.4-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0009a8700/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused””}
Jun 18 01:31:01 xxh_testplatform1 rke2: time=“2023-06-18T01:31:01+08:00” level=info msg=“Failed to test data store connection: context deadline exceeded”
Jun 18 01:31:01 xxh_testplatform1 rke2: time=“2023-06-18T01:31:01+08:00” level=info msg=“Tunnel server egress proxy waiting for runtime core to become available”
Jun 18 01:31:02 xxh_testplatform1 rke2: time=“2023-06-18T01:31:02+08:00” level=info msg=“Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error”
Jun 18 01:31:06 xxh_testplatform1 rke2: time=“2023-06-18T01:31:06+08:00” level=info msg=“Waiting for etcd server to become available”
Jun 18 01:31:06 xxh_testplatform1 rke2: time=“2023-06-18T01:31:06+08:00” level=info msg=“Waiting for API server to become available”
Jun 18 01:31:06 xxh_testplatform1 rke2: {“level”:“warn”,“ts”:“2023-06-18T01:31:06.200+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.4-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0009a8700/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused””}
Jun 18 01:31:06 xxh_testplatform1 rke2: {“level”:“info”,“ts”:“2023-06-18T01:31:06.200+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.4-k3s1/client.go:210”,“msg”:“Auto sync endpoints failed.”,“error”:“context deadline exceeded”}
Jun 18 01:31:06 xxh_testplatform1 rke2: time=“2023-06-18T01:31:06+08:00” level=info msg=“Tunnel server egress proxy waiting for runtime core to become available”
Jun 18 01:31:07 xxh_testplatform1 rke2: {“level”:“warn”,“ts”:“2023-06-18T01:31:07.678+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.4-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0009a8700/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = “transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused””}
Jun 18 01:31:07 xxh_testplatform1 rke2: time=“2023-06-18T01:31:07+08:00” level=error msg=“Failed to check local etcd status for learner management: context deadline exceeded”
Jun 18 01:31:07 xxh_testplatform1 rke2: time=“2023-06-18T01:31:07+08:00” level=info msg=“Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error”

把那台节点直接重启就好

从日志来看,是 etcd 没有启动成功,你可以根据 RKE2 commands 中的 一些排查命令排查下