Rancher上创建k8s集群，一直报错

lbjames23 · 2023 年11 月 17 日 09:06

环境信息:
RKE2 版本:

rancher的版本为：v2.7.9
节点 CPU 架构，操作系统和版本：

3.10.0-1160.99.1.el7.x86_64 #1 SMP Thu Aug 10 10:46:21 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
集群配置:

这是测试节点，只有一个server节点
问题描述:

在rancher中创建k8s的时候，集群报错：Waiting for agent to check in and apply initial plan，另外我不清楚是不是跟rancher v2.7.9版本有关系
重现步骤:

安装 RKE2 的命令:

image1307×733 24.1 KB

日志

“[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20231117-153252/c4ee63cd896420f19cbd88a05165f5d212dd07ce1625ba584fd0693a9e6d7e87_0”
“[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]”
“[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: and exit code: 1”
“error loading x509 client cert/key for probe kube-apiserver (/var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt//var/lib/rancher/rke2/server/tls/client-kube-apiserver.key): open /var/lib/rancher/rke2/server/tls/client-kube-apiserver.crt: no such file or directory”
“error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory”
“error while appending ca cert to pool for probe kube-scheduler”
level=error msg=“error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: open /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: no such file or directory”
“error while appending ca cert to pool for probe kube-controller-manager”
“error loading CA cert for probe (kube-apiserver) /var/lib/rancher/rke2/server/tls/server-ca.crt: open /var/lib/rancher/rke2/server/tls/server-ca.crt: no such file or directory”
“error while appending ca cert to pool for probe kube-apiserver”

rke2-server.service holdoff time over, scheduling restart.
Stopped Rancher Kubernetes Engine v2 (server).
Starting Rancher Kubernetes Engine v2 (server)…

/usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Failed to get unit file state for nm-cloud-setup.service: No such file or directory
“missing required: user: unknown user etcd\nmissing required: group: unknown group etcd\ninvalid kernel parameter value vm.overcommit_memory=0 - expected 1\ninvalid kernel parameter value kernel.panic=0 - expected 10\n”
rke2-server.service: main process exited, code=exited, status=1/FAILURE
Failed to start Rancher Kubernetes Engine v2 (server).
Unit rke2-server.service entered failed state.
rke2-server.service failed.

ksd · 2023 年11 月 20 日 06:57

看描述，你的集群里应该只添加了一个节点，添加节点时，选择的节点的角色都是什么？

lbjames23 · 2023 年11 月 21 日 02:20

这三个角色我都选择了，但是很奇怪，我没有使用证书，为啥会报证书的错误呢！

ksd · 2023 年11 月 21 日 03:00

这里有一些 rke2 的基础命令和日志的排查，你可以根据链接去排查下：RKE2 commands

lbjames23 · 2023 年12 月 5 日 03:15

麻烦能帮我看下这个报错吗
kubelet日志显示尚未对网络插件进行初始化这个报错了
E1205 10:59:33.849300 2417 pod_workers.go:965] “Error syncing pod, skipping” err=“failed to "StartContainer" for "tigera-operator" with CrashLoopBackOff: "back-off 5m0s restarting failed container=tigera-operator pod=tigera-operator-c48559c97-n5dfb_tigera-operator(447ed086-bb7c-450d-91e4-b5770d4516ca)"” pod=“tigera-operator/tigera-operator-c48559c97-n5dfb” podUID=447ed086-bb7c-450d-91e4-b5770d4516ca
E1205 10:59:37.784085 2417 kubelet.go:2352] “Container runtime network not ready” networkReady=“NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized”
E1205 10:59:42.785359 2417 kubelet.go:2352] “Container runtime network not ready” networkReady=“NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized”
E1205 10:59:47.787523 2417 kubelet.go:2352] “Container runtime network not ready” networkReady=“NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized”
I1205 10:59:47.847600 2417 scope.go:110] “RemoveContainer” containerID=“b029c7c648a13733ecdf205bb98a4572affa75d3caec23e9b884e54eff981ea4”

查看containerd日志，是正常的
time=“2023-12-05T11:05:03.886248499+08:00” level=info msg=“CreateContainer within sandbox "0b1a3c9204adbd8a9d165c5e6c6b95b13660273e5396dce487bb3994d047f80a" for &ContainerMetadata{Name:tigera-operator,Attempt:9,} returns container id "a8cdbf3b085abb441f1814a6f1e621b46831d90d6d557c81c1381441cb34f450"”
time=“2023-12-05T11:05:03.886871098+08:00” level=info msg=“StartContainer for "a8cdbf3b085abb441f1814a6f1e621b46831d90d6d557c81c1381441cb34f450"”
time=“2023-12-05T11:05:04.075436666+08:00” level=info msg=“StartContainer for "a8cdbf3b085abb441f1814a6f1e621b46831d90d6d557c81c1381441cb34f450" returns successfully”
time=“2023-12-05T11:05:07.292240657+08:00” level=info msg=“shim disconnected” id=a8cdbf3b085abb441f1814a6f1e621b46831d90d6d557c81c1381441cb34f450 namespace=k8s.io
time=“2023-12-05T11:05:07.292326239+08:00” level=warning msg=“cleaning up after shim disconnected” id=a8cdbf3b085abb441f1814a6f1e621b46831d90d6d557c81c1381441cb34f450 namespace=k8s.io
time=“2023-12-05T11:05:07.292340116+08:00” level=info msg=“cleaning up dead shim” namespace=k8s.io
time=“2023-12-05T11:05:07.879538140+08:00” level=info msg=“RemoveContainer for "7f883247860761f5c82a98f3cca85589f1efa360b94dcc93d84c0350b29a56c8"”
time=“2023-12-05T11:05:07.882377849+08:00” level=info msg=“RemoveContainer for "7f883247860761f5c82a98f3cca85589f1efa360b94dcc93d84c0350b29a56c8" returns successfully”

我现在主机master节点后，这个报错Configuring bootstrap node(s) custom-e136bf3d2c94: waiting for probes: calico

[root@localhost ~]# journalctl -f -u rancher-system-agent
– Logs begin at 二 2023-12-05 10:38:44 CST. –
12月 05 11:00:41 localhost rancher-system-agent[2275]: time=“2023-12-05T11:00:41+08:00” level=info msg=“[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20231205-110041/b34804e5329668e67f41db88f16a50bc6c56b5c4d36b1c2071d4704660ab616c_0”
12月 05 11:00:41 localhost rancher-system-agent[2275]: time=“2023-12-05T11:00:41+08:00” level=info msg=“[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]”
12月 05 11:00:41 localhost rancher-system-agent[2275]: time=“2023-12-05T11:00:41+08:00” level=info msg=“[b34804e5329668e67f41db88f16a50bc6c56b5c4d36b1c2071d4704660ab616c_0:stdout]: Name Location Size Created”
12月 05 11:00:41 localhost rancher-system-agent[2275]: time=“2023-12-05T11:00:41+08:00” level=info msg=“[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: and exit code: 0”
12月 05 11:00:41 localhost rancher-system-agent[2275]: time=“2023-12-05T11:00:41+08:00” level=info msg=“[K8s] updated plan secret fleet-default/custom-e136bf3d2c94-machine-plan with feedback”
12月 05 11:10:42 localhost rancher-system-agent[2275]: time=“2023-12-05T11:10:42+08:00” level=info msg=“[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20231205-111042/b34804e5329668e67f41db88f16a50bc6c56b5c4d36b1c2071d4704660ab616c_0”
12月 05 11:10:42 localhost rancher-system-agent[2275]: time=“2023-12-05T11:10:42+08:00” level=info msg=“[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]”
12月 05 11:10:43 localhost rancher-system-agent[2275]: time=“2023-12-05T11:10:43+08:00” level=info msg=“[b34804e5329668e67f41db88f16a50bc6c56b5c4d36b1c2071d4704660ab616c_0:stdout]: Name Location Size Created”
12月 05 11:10:43 localhost rancher-system-agent[2275]: time=“2023-12-05T11:10:43+08:00” level=info msg=“[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: and exit code: 0”
12月 05 11:10:43 localhost rancher-system-agent[2275]: time=“2023-12-05T11:10:43+08:00” level=info msg=“[K8s] updated plan secret fleet-default/custom-e136bf3d2c94-machine-plan with feedback”
^[1] “Event occurred” object=“kube-system/rke2-ingress-nginx” fieldPath=“” kind=“HelmChart” apiVersion=“helm.cattle.io/v1” type=“Normal” reason=“ApplyJob” message=“Applying HelmChart using Job kube-system/helm-install-rke2-ingress-nginx”
12月 05 10:42:26 localhost rke2[2377]: I1205 10:42:26.076418 2377 event.go:294] “Event occurred” object=“kube-system/rke2-ingress-nginx” fieldPath=“” kind=“HelmChart” apiVersion=“helm.cattle.io/v1” type=“Normal” reason=“ApplyJob” message=“Applying HelmChart using Job kube-system/helm-install-rke2-ingress-nginx”
12月 05 10:42:26 localhost rke2[2377]: time=“2023-12-05T10:42:26+08:00” level=error msg=“error syncing ‘kube-system/rke2-metrics-server’: handler helm-controller-chart-registration: helmcharts.helm.cattle.io "rke2-metrics-server" not found, requeuing”
12月 05 10:42:26 localhost rke2[2377]: I1205 10:42:26.467786 2377 event.go:294] “Event occurred” object=“kube-system/rke2-metrics-server” fieldPath=“” kind=“HelmChart” apiVersion=“helm.cattle.io/v1” type=“Normal” reason=“ApplyJob” message=“Applying HelmChart using Job kube-system/helm-install-rke2-metrics-server”
12月 05 10:42:26 localhost rke2[2377]: I1205 10:42:26.508008 2377 event.go:294] “Event occurred” object=“kube-system/rke2-metrics-server” fieldPath=“” kind=“HelmChart” apiVersion=“helm.cattle.io/v1” type=“Normal” reason=“ApplyJob” message=“Applying HelmChart using Job kube-system/helm-install-rke2-metrics-server”

[A^[1] “Event occurred” object=“kube-system/rke2-coredns” fieldPath=“” kind=“HelmChart” apiVersion=“helm.cattle.io/v1” type=“Normal” reason=“ApplyJob” message=“Applying HelmChart using Job kube-system/helm-install-rke2-coredns”
12月 05 10:42:25 localhost rke2[2377]: I1205 10:42:25.824444 2377 event.go:294] “Event occurred” object=“kube-system/rke2-metrics-server” fieldPath=“” kind=“HelmChart” apiVersion=“helm.cattle.io/v1” type=“Normal” reason=“ApplyJob” message=“Applying HelmChart using Job kube-system/helm-install-rke2-metrics-server”
12月 05 10:42:26 localhost rke2[2377]: time=“2023-12-05T10:42:26+08:00” level=error msg=“error syncing ‘kube-system/rke2-ingress-nginx’: handler helm-controller-chart-registration: helmcharts.helm.cattle.io "rke2-ingress-nginx" not found, requeuing”
12月 05 10:42:26 localhost rke2[2377]: I1205 10:42:26.076360 2377 event.go:294 ↩︎

jackyting825 · 2024 年1 月 14 日 11:28

你好。这个问题解决了吗

yatou · 2024 年1 月 22 日 07:07

解决了吗，我这边绑定一个域名之后，重新创建集群就这个，完犊子了

ksd · 2024 年1 月 22 日 07:42

你重新提交下，详细的描述下你的操作步骤

yatou · 2024 年1 月 22 日 07:54

集群报错：Waiting for agent to check in and apply initial plan