Ubuntu24.04安装rke2,提示etcd连接不上

环境信息:
RKE2 版本: v1.34.4+rke2r1

节点 CPU 架构,操作系统和版本:
Linux xxx 6.8.0-101-generic #101-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 9 10:15:05 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux

问题描述:
安装并启动后,提示镜像不存在以及etcd不存在

重现步骤:

  • 安装 RKE2 的命令:
    vim /etc/rancher/rke2/config.yaml
    token: canwin-sdc
    node-name: k8s-master01
    tls-san: 10.xxx.xxx.xx
    system-default-registry: “docker.m.daocloud.io
    kube-proxy-arg:
    • proxy-mode=ipvs
    • ipvs-strict-arp=true

curl -sfL https://rancher-mirror.rancher.cn/rke2/install.sh | INSTALL_RKE2_MIRROR=cn sh -

systemctl enable --now rke2-server
实际结果:
journalctl -xeu rke2-server --no-pager

日志

Mar 03 10:08:59 canwinsdc1 systemd[1]: Starting rke2-server.service - Rancher Kubernetes Engine v2 (server)…
░░ Subject: A start job for unit rke2-server.service has begun execution
░░ Defined-By: systemd
░░ Support: Enterprise open source support | Ubuntu
░░
░░ A start job for unit rke2-server.service has begun execution.
░░
░░ The job identifier is 6470255.
Mar 03 10:08:59 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:08:59+08:00” level=warning msg=“not running in CIS mode”
Mar 03 10:08:59 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:08:59+08:00” level=info msg=“Applying Pod Security Admission Configuration”
Mar 03 10:08:59 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:08:59+08:00” level=info msg=“Starting rke2 v1.34.4+rke2r1 (c6b97dc03cefec17e8454a6f45b29f4e3d0a81d6)”
Mar 03 10:08:59 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:08:59+08:00” level=info msg=“Managed etcd cluster initializing”
Mar 03 10:09:01 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:01+08:00” level=info msg=“Password verified locally for node canwinsdc1”
Mar 03 10:09:01 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:01+08:00” level=info msg=“certificate CN=canwinsdc1 signed by CN=rke2-server-ca@1772502570: notBefore=2026-03-03 01:49:30 +0000 UTC notAfter=2027-03-03 02:09:01 +0000 UTC”
Mar 03 10:09:01 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:01+08:00” level=info msg=“certificate CN=system:node:canwinsdc1,O=system:nodes signed by CN=rke2-client-ca@1772502570: notBefore=2026-03-03 01:49:30 +0000 UTC notAfter=2027-03-03 02:09:01 +0000 UTC”
Mar 03 10:09:01 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:01+08:00” level=info msg=“certificate CN=system:kube-proxy signed by CN=rke2-client-ca@1772502570: notBefore=2026-03-03 01:49:30 +0000 UTC notAfter=2027-03-03 02:09:01 +0000 UTC”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“certificate CN=system:rke2-controller signed by CN=rke2-client-ca@1772502570: notBefore=2026-03-03 01:49:30 +0000 UTC notAfter=2027-03-03 02:09:02 +0000 UTC”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Module overlay was already loaded”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Module nf_conntrack was already loaded”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Module br_netfilter was already loaded”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Module iptable_nat was already loaded”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Module iptable_filter was already loaded”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=warning msg=“Failed to load kernel module nft-expr-counter with modprobe”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Removed kube-proxy static pod manifest”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Checking local image archives in /var/lib/rancher/rke2/agent/images for index.docker.io/rancher/rke2-runtime:v1.34.4-rke2r1
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=warning msg=“Failed to load runtime image index.docker.io/rancher/rke2-runtime:v1.34.4-rke2r1 from tarball: no local image available for index.docker.io/rancher/rke2-runtime:v1.34.4-rke2r1: not found in any file in /var/lib/rancher/rke2/agent/images: image not found”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Checking local image archives in /var/lib/rancher/rke2/agent/images for index.docker.io/rancher/rke2-runtime:v1.34.4-rke2r1
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Found crun container runtime at /usr/bin/crun”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=warning msg=“Failed to load runtime image index.docker.io/rancher/rke2-runtime:v1.34.4-rke2r1 from tarball: no local image available for index.docker.io/rancher/rke2-runtime:v1.34.4-rke2r1: not found in any file in /var/lib/rancher/rke2/agent/images: image not found”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Pulling runtime image index.docker.io/rancher/rke2-runtime:v1.34.4-rke2r1
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory"”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Connecting to proxy” url=“wss://127.0.0.1:9345/v1-rke2/connect”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Creating rke2-cert-monitor event broadcaster”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Starting etcd for new cluster, cluster-reset=false”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=start
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“schedule, now=2026-03-03T10:09:02+08:00, entry=1, next=2026-03-03T12:00:00+08:00”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,rke2 --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --cert-dir=/var/lib/rancher/rke2/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/rke2/server/tls/client-ca.crt --egress-selector-config-file=/var/lib/rancher/rke2/server/etc/egress-selector-config.yaml --enable-admission-plugins=NodeRestriction --enable-aggregator-routing=true --enable-bootstrap-token-auth=true --encryption-provider-config=/var/lib/rancher/rke2/server/cred/encryption-config.json --encryption-provider-config-automatic-reload=true --etcd-cafile=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/rke2/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/rke2/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --kubelet-certificate-authority=/var/lib/rancher/rke2/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --proxy-client-cert-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/rke2/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/rke2/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/rke2/server/tls/service.current.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --tls-private-key-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.key”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Removed kube-apiserver static pod manifest”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Running kube-scheduler --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --profiling=false --secure-port=10259 --tls-cert-file=/var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt --tls-private-key-file=/var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.key”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Running kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-kube-apiserver-client-cert-file=/var/lib/rancher/rke2/server/tls/client-ca.nochain.crt --cluster-signing-kube-apiserver-client-key-file=/var/lib/rancher/rke2/server/tls/client-ca.key --cluster-signing-kubelet-client-cert-file=/var/lib/rancher/rke2/server/tls/client-ca.nochain.crt --cluster-signing-kubelet-client-key-file=/var/lib/rancher/rke2/server/tls/client-ca.key --cluster-signing-kubelet-serving-cert-file=/var/lib/rancher/rke2/server/tls/server-ca.nochain.crt --cluster-signing-kubelet-serving-key-file=/var/lib/rancher/rke2/server/tls/server-ca.key --cluster-signing-legacy-unknown-cert-file=/var/lib/rancher/rke2/server/tls/server-ca.nochain.crt --cluster-signing-legacy-unknown-key-file=/var/lib/rancher/rke2/server/tls/server-ca.key --configure-cloud-routes=false --controllers=,tokencleaner,-service,-route,-cloud-node-lifecycle --kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --profiling=false --root-ca-file=/var/lib/rancher/rke2/server/tls/server-ca.crt --secure-port=10257 --service-account-private-key-file=/var/lib/rancher/rke2/server/tls/service.current.key --service-cluster-ip-range=10.43.0.0/16 --tls-cert-file=/var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt --tls-private-key-file=/var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.key --use-service-account-credentials=true"
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg="Running cloud-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --bind-address=127.0.0.1 --cloud-config=/var/lib/rancher/rke2/server/etc/cloud-config.yaml --cloud-provider=rke2 --cluster-cidr=10.42.0.0/16 --configure-cloud-routes=false --controllers=
,-route,-service --kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --leader-elect-resource-name=rke2-cloud-controller-manager --node-status-update-frequency=1m0s --profiling=false”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Server node token is available at /var/lib/rancher/rke2/server/token”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Handling backend connection request [canwinsdc1]”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Connected to proxy” url=“wss://127.0.0.1:9345/v1-rke2/connect”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Remotedialer connected to proxy” url=“wss://127.0.0.1:9345/v1-rke2/connect”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“To join server node to cluster: rke2 server -s https://10.16.228.11:9345 -t {SERVER_NODE_TOKEN}" Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time="2026-03-03T10:09:02+08:00" level=info msg="Agent node token is available at /var/lib/rancher/rke2/server/agent-token" Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time="2026-03-03T10:09:02+08:00" level=info msg="To join agent node to cluster: rke2 agent -s https://10.16.228.11:9345 -t {AGENT_NODE_TOKEN}”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Wrote kubeconfig /etc/rancher/rke2/rke2.yaml”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Run: rke2 kubectl”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=error msg=“Sending HTTP/1.1 503 response to 127.0.0.1:54312: runtime core not ready”
Mar 03 10:09:02 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:02+08:00” level=info msg=“Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=canwinsdc1 --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig --proxy-mode=iptables”
Mar 03 10:09:07 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:07+08:00” level=info msg=“Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"”
Mar 03 10:09:12 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:12+08:00” level=info msg=“Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"”
Mar 03 10:09:17 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:17+08:00” level=info msg=“Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"”
Mar 03 10:09:22 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:22+08:00” level=info msg=“Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory"”
Mar 03 10:09:22 canwinsdc1 rke2[1007358]: time=“2026-03-03T10:09:22+08:00” level=info msg=“Failed to test etcd connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"”

etcd连接不上的问题已解决,原因是之前安装的rke2卸载不干净,重新卸载后安装即启动正常。


但接下来又遇到了新的问题
在安装第二台机器时,出现 CNI 网络插件拉取失败的问题
​vim /etc/rancher/rke2/config.yaml​
server: https://10.16.228.11:9345
token: canwin-sdc
node-name: k8s-master02
tls-san: 10.16.228.12
system-default-registry: “docker.m.daocloud.io
kube-proxy-arg:

  • proxy-mode=ipvs
  • ipvs-strict-arp=true

​kubectl get nodes​
NAME STATUS ROLES AGE VERSION
k8s-master01 Ready control-plane,etcd 45m v1.34.4+rke2r1
k8s-master02 NotReady control-plane,etcd 8m19s v1.34.4+rke2r1

kubectl get pods -n kube-system -o wide

​kubectl describe pod rke2-canal-fhtsz -n kube-system​

提示镜像拉去不下来https://docker.m.daocloud.io/v2/rancher/hardened-calico/manifests/v3.31.3-build20260206
但我试过在docker上主动拉取镜像是正常的
而且有个奇怪的现象,即rke2启动后且CNI 网络插件拉取失败后,机器就无法正常连接公网,必须要停止rke2服务后才恢复正常

查阅资料,怀疑是NetworkManager服务导致的问题,我直接关闭NetworkManager,重启rke2,问题依然没有解决

你得使用 rke2 自带的 crictl 来去测试拉取,参考:RKE2 commands


确实拉不下来,跟pod启动报错的现象一致
目前的情况是rke2启动了之后,宿主机就无法访问公网,因此也导致镜像拉不下来
rke2停掉之后,宿主机网络恢复,但rke2停掉之后就无法用crictl拉取镜像。。。

那你试试去掉这些试试,看看是不是这些配置导致的

排查可能是NetworkManger的问题rancher RKE2 NetworkManager 限制说明及修改_rke2 canal-CSDN博客,添加配置项

vim /etc/NetworkManager/conf.d/rke2-canal.conf

[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:flannel*

systemctl reload NetworkManager

重启服务无效,于是尝试重启服务器node1和node2,重启后node2的rke2启动恢复正常,正常连网

但同样的操作,在node3上无效,尝试将无法拉取的镜像手动导入到containd

ctr --address /run/k3s/containerd/containerd.sock -n k8s.io images import ./hardened-calico.tar

重启rke,此镜像不在报拉取镜像失败,但同时也报了其他镜像无法拉取的问题,该节点的rke2依旧启动不起来
后续尝试同时重启服务器node1、node2、node3,过一段时间后sdc3恢复正常

以上现象,依旧困惑,但好在集群已经正常跑起来了