环境信息:
RKE2 版本:
rke2 version v1.28.10+rke2r1 (b0d0d687d98f4fa015e7b30aaf2807b50edcc5d7)
go version go1.21.9 X:boringcrypto
节点 CPU 架构,操作系统和版本:
Linux k8s-2-gpu 5.15.0-107-generic #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
集群配置:
5个节点,都是server+agent
问题描述:
在纯内网环境下,有一个主机节点在掉电重启后,无法重新加入etcd(可能和该主机是双网卡有关?)
然后使用rke2-uninstall.sh
后,在k8s集群中删除了该节点。然后重新安装,rke2-server无法正常启动,
也无法重新加入集群
重现步骤:
- 安装 RKE2 的命令:
后用离线方式重新安装了rke2
root@k8s-2-gpu:~/rke2-artifacts# INSTALL_RKE2_ARTIFACT_PATH=/root/rke2-artifacts sh install.sh
[INFO] staging local checksums from /root/rke2-artifacts/sha256sum-amd64.txt
[INFO] staging zst airgap image tarball from /root/rke2-artifacts/rke2-images.linux-amd64.tar.zst
[INFO] staging tarball from /root/rke2-artifacts/rke2.linux-amd64.tar.gz
[INFO] verifying airgap tarball
[INFO] installing airgap tarball to /var/lib/rancher/rke2/agent/images
[INFO] Installing airgap image from /root/rke2-artifacts/rke2-images-all.linux-amd64.txt
[INFO] verifying tarball
[INFO] unpacking tarball file to /usr/local
root@k8s-2-gpu:~/rke2-artifacts# INSTALL_RKE2_ARTIFACT_PATH=/root/rke2-artifacts sh install.sh
然后执行systemctl start rke2-server
无法启动
root@k8s-2-gpu:~/rke2-artifacts# systemctl start rke2-server
# 这里会卡很长时间,然后报错
Job for rke2-server.service failed because the control process exited with error code.
See "systemctl status rke2-server.service" and "journalctl -xe" for details.
# rke2-server状态
root@k8s-2-gpu:~/rke2-artifacts# systemctl status rke2-server
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: enabled)
Active: activating (start) since Thu 2024-11-28 15:33:55 CST; 11min ago
Docs: https://github.com/rancher/rke2#readme
Process: 2854853 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
Process: 2854855 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 2854856 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 2854857 (rke2)
Tasks: 61
Memory: 47.1M
CGroup: /system.slice/rke2-server.service
├─2854857 /usr/local/bin/rke2 server
└─2854952 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/ranch>
11月 28 15:44:55 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:44:55+08:00" level=info msg="Failed to test data store connection: context deadline exceeded"
11月 28 15:44:56 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:44:56+08:00" level=info msg="Waiting for etcd server to become available"
11月 28 15:44:56 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:44:56+08:00" level=info msg="Waiting for API server to become available"
11月 28 15:44:59 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:44:59+08:00" level=info msg="Pulling image docker.io/rancher/mirrored-calico-node:v3.27.3"
11月 28 15:45:25 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:45:25+08:00" level=info msg="Waiting for container runtime to become ready before joining etcd cluster"
11月 28 15:45:26 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:45:26+08:00" level=info msg="Waiting for API server to become available"
11月 28 15:45:26 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:45:26+08:00" level=info msg="Waiting for etcd server to become available"
11月 28 15:45:29 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:45:29+08:00" level=info msg="Pulling image docker.io/rancher/mirrored-calico-operator:v1.32.7"
11月 28 15:45:30 k8s-2-gpu rke2[2854857]: {"level":"warn","ts":"2024-11-28T15:45:30.872999+0800","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"ret>
11月 28 15:45:30 k8s-2-gpu rke2[2854857]: time="2024-11-28T15:45:30+08:00" level=info msg="Failed to test data store connection: context deadline exceeded"
/etc/rancher/rke2
配置
bind-address: 10.0.13.62
advertise-address: 10.0.13.62
server: https://10.0.13.61:9345
write-kubeconfig-mode: "0644"
# system-default-registry: "10.0.13.65:5443"
token: *********
tls-san:
- "******"
debug: false
etcd-expose-metrics: true
etcd-disable-snapshots: false
registries.yaml
mirrors:
docker.io:
endpoint:
- "https://10.0.13.65:5443"
"10.0.13.65:5000":
endpoint:
- "http://10.0.13.65:5000"
"10.0.13.65:5001":
endpoint:
- "http://10.0.13.65:5001"
configs:
"10.0.13.65:5443":
auth:
username: *******
password: *******
tls:
insecure_skip_verify: true
"10.0.13.65:5000":
auth:
username: *******
password: *******
tls:
insecure_skip_verify: true
"10.0.13.65:5001":
auth:
username: *******
password: *******
预期结果:
正常启动并加入集群
实际结果:
会卡住,无法加入
日志
– Logs begin at Thu 2024-11-28 15:48:04 CST, end at Thu 2024-11-28 15:59:35 CST. –
11月 28 15:48:20 k8s-2-gpu systemd[1]: Starting Rancher Kubernetes Engine v2 (server)…
11月 28 15:48:20 k8s-2-gpu sh[2876158]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
11月 28 15:48:20 k8s-2-gpu sh[2876159]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=warning msg=“not running in CIS mode”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Applying Pod Security Admission Configuration”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Starting rke2 v1.28.10+rke2r1 (b0d0d687d98f4fa015e7b30aaf2807b50edcc5d7)”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Managed etcd cluster not yet initialized”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Reconciling bootstrap data between datastore and disk”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=start
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“schedule, now=2024-11-28T15:48:21+08:00, entry=1, next=2024-11-29T00:00:00+08:00”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Running kube-apiserver --advertise-address=10.0.13.62 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,rke2 --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --cert-dir=/var/lib/rancher/rke2/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/rke2/server/tls/client-ca.crt --egress-selector-config-file=/var/lib/rancher/rke2/server/etc/egress-selector-config.yaml --enable-admission-plugins=NodeRestriction --enable-aggregator-routing=true --enable-bootstrap-token-auth=true --encryption-provider-config=/var/lib/rancher/rke2/server/cred/encryption-config.json --encryption-provider-config-automatic-reload=true --etcd-cafile=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/rke2/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/rke2/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --kubelet-certificate-authority=/var/lib/rancher/rke2/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/rke2/server/tls/client-kube-apiserver.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --proxy-client-cert-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/rke2/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/rke2/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/rke2/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/rke2/server/tls/service.current.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --tls-private-key-file=/var/lib/rancher/rke2/server/tls/serving-kube-apiserver.key”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Removed kube-apiserver static pod manifest”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Running kube-scheduler --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/rke2/server/cred/scheduler.kubeconfig --profiling=false --secure-port=10259”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Running kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-kube-apiserver-client-cert-file=/var/lib/rancher/rke2/server/tls/client-ca.nochain.crt --cluster-signing-kube-apiserver-client-key-file=/var/lib/rancher/rke2/server/tls/client-ca.key --cluster-signing-kubelet-client-cert-file=/var/lib/rancher/rke2/server/tls/client-ca.nochain.crt --cluster-signing-kubelet-client-key-file=/var/lib/rancher/rke2/server/tls/client-ca.key --cluster-signing-kubelet-serving-cert-file=/var/lib/rancher/rke2/server/tls/server-ca.nochain.crt --cluster-signing-kubelet-serving-key-file=/var/lib/rancher/rke2/server/tls/server-ca.key --cluster-signing-legacy-unknown-cert-file=/var/lib/rancher/rke2/server/tls/server-ca.nochain.crt --cluster-signing-legacy-unknown-key-file=/var/lib/rancher/rke2/server/tls/server-ca.key --configure-cloud-routes=false --controllers=,tokencleaner,-service,-route,-cloud-node-lifecycle --kubeconfig=/var/lib/rancher/rke2/server/cred/controller.kubeconfig --profiling=false --root-ca-file=/var/lib/rancher/rke2/server/tls/server-ca.crt --secure-port=10257 --service-account-private-key-file=/var/lib/rancher/rke2/server/tls/service.current.key --service-cluster-ip-range=10.43.0.0/16 --use-service-account-credentials=true"
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg="Running cloud-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --authorization-kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --bind-address=127.0.0.1 --cloud-config=/var/lib/rancher/rke2/server/etc/cloud-config.yaml --cloud-provider=rke2 --cluster-cidr=10.42.0.0/16 --configure-cloud-routes=false --controllers=,-route,-service --feature-gates=CloudDualStackNodeIPs=true --kubeconfig=/var/lib/rancher/rke2/server/cred/cloud-controller.kubeconfig --leader-elect-resource-name=rke2-cloud-controller-manager --node-status-update-frequency=1m0s --profiling=false”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Server node token is available at /var/lib/rancher/rke2/server/token”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“To join server node to cluster: rke2 server -s https://10.0.13.62:9345 -t {SERVER_NODE_TOKEN}"
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time="2024-11-28T15:48:21+08:00" level=info msg="Agent node token is available at /var/lib/rancher/rke2/server/agent-token"
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time="2024-11-28T15:48:21+08:00" level=info msg="To join agent node to cluster: rke2 agent -s https://10.0.13.62:9345 -t {AGENT_NODE_TOKEN}”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused"”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Wrote kubeconfig /etc/rancher/rke2/rke2.yaml”
11月 28 15:48:21 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:21+08:00” level=info msg=“Run: rke2 kubectl”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Password verified locally for node k8s-2-gpu”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“certificate CN=k8s-2-gpu signed by CN=rke2-server-ca@1719194657: notBefore=2024-06-24 02:04:17 +0000 UTC notAfter=2025-11-28 07:48:22 +0000 UTC”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“certificate CN=system:node:k8s-2-gpu,O=system:nodes signed by CN=rke2-client-ca@1719194657: notBefore=2024-06-24 02:04:17 +0000 UTC notAfter=2025-11-28 07:48:22 +0000 UTC”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Using private registry config file at /etc/rancher/rke2/registries.yaml”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Module overlay was already loaded”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Module nf_conntrack was already loaded”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Module br_netfilter was already loaded”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Module iptable_nat was already loaded”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Module iptable_filter was already loaded”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Runtime image index.docker.io/rancher/rke2-runtime:v1.28.10-rke2r1 bin and charts directories already exist; skipping extract”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-calico-crd.yaml to set cluster configuration values”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-cilium.yaml to set cluster configuration values”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-flannel.yaml to set cluster configuration values”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-validation-webhook.yaml to set cluster configuration values”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/harvester-cloud-provider.yaml to set cluster configuration values”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/harvester-csi-driver.yaml to set cluster configuration values”
11月 28 15:48:22 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:22+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rancher-vsphere-csi.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-multus.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-snapshot-controller-crd.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rancher-vsphere-cpi.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-calico.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-canal.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Updated manifest /var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml to set cluster configuration values”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Logging containerd to /var/lib/rancher/rke2/agent/containerd/containerd.log”
11月 28 15:48:23 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:23+08:00” level=info msg=“Running containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd”
11月 28 15:48:24 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:24+08:00” level=info msg=“containerd is now running”
11月 28 15:48:24 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:24+08:00” level=info msg=“Pulling images from /var/lib/rancher/rke2/agent/images/rke2-images-all.linux-amd64.txt”
11月 28 15:48:24 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:24+08:00” level=info msg=“Pulling image docker.io/rancher/hardened-addon-resizer:1.8.20-build20240410”
11月 28 15:48:51 k8s-2-gpu rke2[2876162]: time=“2024-11-28T15:48:51+08:00” level=info msg=“Waiting for container runtime to become ready before joining etcd cluster”
11月 28 15:48:51 k8s-2-gpu rke2[2876162]: {“level”:“warn”,“ts”:“2024-11-28T15:48:51.423136+0800”,“logger”:“etcd-client”,“caller”:“v3@v3.5.9-k3s1/retry_interceptor.go:62”,“msg”:“retrying of unary invoker failed”,“target”:“etcd-endpoints://0xc0009b08c0/127.0.0.1:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"”}