添加集群，集群的所有节点都显示waiting for node ref

dazan · 2024 年8 月 27 日 06:05

环境信息:
RKE2 版本:

v1.26.15+rke2r1

节点 CPU 架构，操作系统和版本：

Linux homeserver 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 8 17:36:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

集群配置:

三个kvm

etcd+control plane + worker
worker
worker

问题描述:
一台主机，上面安装3个kvm
主机是系统是rocky linux 9，kvm均为rocky linux8

主机上安装docker，并启动rancher 的docker镜像
local k3s一切正常

创建集群
并在kvm中粘贴复制的command

在2.8.5也是这样，目前降到了 2.7.6，也是这样

日志

节点的日志

//journalctl -u rancher-system-agent
//看起来三个都卡在这一句
pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *v1.Secret ended with: an error on the server  ("unable to decode an event from the watch stream: stream error: stream ID 159; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

ksd · 2024 年8 月 27 日 06:22

可以看看 rancher server 和 cluster agent 的日志，或者根据 RKE2 commands 排查下 rke2 的日志和容器的运行情况

dazan · 2024 年8 月 27 日 06:59

不是很清楚该怎么做
如果是节点的 rke2-server

journalctl -f -u rke2-server
[root@kvm151 ~]# journalctl -f -u rke2-server
-- Logs begin at Mon 2024-08-26 21:52:24 EDT. --
8月 27 00:00:02 kvm151 rke2[869]: {"level":"info","ts":"2024-08-27T00:00:02.237879-0400","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-kvm151-1724731202.part"}
8月 27 00:00:02 kvm151 rke2[869]: {"level":"info","ts":"2024-08-27T00:00:02.239663-0400","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
8月 27 00:00:02 kvm151 rke2[869]: {"level":"info","ts":"2024-08-27T00:00:02.240378-0400","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
8月 27 00:00:02 kvm151 rke2[869]: {"level":"info","ts":"2024-08-27T00:00:02.341515-0400","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
8月 27 00:00:02 kvm151 rke2[869]: {"level":"info","ts":"2024-08-27T00:00:02.353939-0400","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"11 MB","took":"now"}
8月 27 00:00:02 kvm151 rke2[869]: {"level":"info","ts":"2024-08-27T00:00:02.354017-0400","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-kvm151-1724731202"}
8月 27 00:00:02 kvm151 rke2[869]: time="2024-08-27T00:00:02-04:00" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/etcd-snapshot-kvm151-1724731202"
8月 27 00:00:02 kvm151 rke2[869]: time="2024-08-27T00:00:02-04:00" level=info msg="Applying snapshot retention=5 to local snapshots with prefix etcd-snapshot in /var/lib/rancher/rke2/server/db/snapshots"
8月 27 00:00:02 kvm151 rke2[869]: time="2024-08-27T00:00:02-04:00" level=info msg="Reconciling ETCDSnapshotFile resources"
8月 27 00:00:02 kvm151 rke2[869]: time="2024-08-27T00:00:02-04:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"

如果是rancher 服务的日志

docker logs rancher


2024/08/27 06:33:49 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:34:23 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-t6fbg exited 123
2024/08/27 06:34:27 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-6s2jz exited 123
W0827 06:37:42.124782      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.187241      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195398      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195414      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195432      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195448      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195451      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195470      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195477      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195481      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195489      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195506      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195514      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195528      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195533      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195532      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195541      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195545      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195562      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.195566      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.597342      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.597369      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:42.597401      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.218582      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.241896      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.241916      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.242002      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.242019      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.242528      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.242536      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.242542      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:43.242555      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.008755      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.008773      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.008784      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.008807      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.008820      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.012071      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.012077      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.013187      61 transport.go:313] Unable to cancel request for *client.addQuery
W0827 06:37:44.023524      61 transport.go:313] Unable to cancel request for *client.addQuery
2024/08/27 06:38:35 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:38:35 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:39:23 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-8gmbf exited 123
2024/08/27 06:39:27 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-xcdkh exited 123
2024/08/27 06:43:40 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:43:40 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:43:51 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:44:22 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-q5tkt exited 123
2024/08/27 06:44:26 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-64xk6 exited 123
2024/08/27 06:48:46 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:48:46 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:49:19 [ERROR] error syncing 'cattle-fleet-system/helm-operation-6wvjg': handler helm-operation: Operation cannot be fulfilled on operations.catalog.cattle.io "helm-operation-6wvjg": StorageError: invalid object, Code: 4, Key: /registry/catalog.cattle.io/operations/cattle-fleet-system/helm-operation-6wvjg, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 4eb73d43-8a6b-4551-be3c-3e84fdb5616d, UID in object meta: , requeuing
2024/08/27 06:49:23 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-6qxdj exited 123
2024/08/27 06:49:27 [ERROR] Failed to install system chart fleet: failed to install , pod cattle-system/helm-operation-vmdhc exited 123
2024/08/27 06:53:53 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:53:53 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node
2024/08/27 06:53:53 [INFO] [planner] rkecluster fleet-default/cluster1: non-ready bootstrap machine(s) custom-3e15b71f1cc0 and join url to be available on bootstrap node

如果是rancher-system-agent

journalctl -f -u rancher-system-agent
-- Logs begin at Mon 2024-08-26 21:52:24 EDT. --
8月 27 02:49:15 kvm151 rancher-system-agent[4573]: W0827 02:49:15.399624    4573 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 143; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 27 02:50:45 kvm151 rancher-system-agent[4573]: W0827 02:50:45.937104    4573 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 147; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 27 02:52:43 kvm151 rancher-system-agent[4573]: W0827 02:52:43.055210    4573 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 151; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 27 02:53:53 kvm151 rancher-system-agent[4573]: time="2024-08-27T02:53:53-04:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240827-025353/56c57b7165a002240d397da3e12a16dee7e36762938262947feaf3b178560f77_0"
8月 27 02:53:53 kvm151 rancher-system-agent[4573]: time="2024-08-27T02:53:53-04:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
8月 27 02:53:53 kvm151 rancher-system-agent[4573]: time="2024-08-27T02:53:53-04:00" level=info msg="[56c57b7165a002240d397da3e12a16dee7e36762938262947feaf3b178560f77_0:stdout]: Name                            Location                                                                         Size     Created"
8月 27 02:53:53 kvm151 rancher-system-agent[4573]: time="2024-08-27T02:53:53-04:00" level=info msg="[56c57b7165a002240d397da3e12a16dee7e36762938262947feaf3b178560f77_0:stdout]: etcd-snapshot-kvm151-1724716802 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-kvm151-1724716802 18128928 2024-08-26T20:00:02-04:00"
8月 27 02:53:53 kvm151 rancher-system-agent[4573]: time="2024-08-27T02:53:53-04:00" level=info msg="[56c57b7165a002240d397da3e12a16dee7e36762938262947feaf3b178560f77_0:stdout]: etcd-snapshot-kvm151-1724731202 file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-kvm151-1724731202 10928160 2024-08-27T00:00:02-04:00"
8月 27 02:53:53 kvm151 rancher-system-agent[4573]: time="2024-08-27T02:53:53-04:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
8月 27 02:53:53 kvm151 rancher-system-agent[4573]: time="2024-08-27T02:53:53-04:00" level=info msg="[K8s] updated plan secret fleet-default/custom-3e15b71f1cc0-machine-plan with feedback"
8月 27 02:56:14 kvm151 rancher-system-agent[4573]: W0827 02:56:14.496248    4573 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 155; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

ksd · 2024 年8 月 27 日 07:05

你的主机能从 docker hub 拉取镜像么？

dazan · 2024 年8 月 27 日 07:32

主机指的是运行rancher的机器吗，rancher 放到docker里面运行的

运行rancher 的机器根据GitHub - DaoCloud/public-image-mirror: 很多镜像都在国外。比如 gcr 。国内下载很慢，需要加速。致力于提供连接全世界的稳定可靠安全的容器镜像服务。

改写了 docker的配置

  "registry-mirrors": [
    "https://dockerhub.icu",
    "https://docker.chenby.cn",
    "https://docker.1panel.live",
    "https://docker.awsl9527.cn",
    "https://docker.anyhub.us.kg",
    "https://dhub.kubesre.xyz"
  ]

如果是集群中的那三个机器，我什么都没改过，只是把防火墙关了

systemctl stop firewalld
systemctl disable firewalld

ksd · 2024 年8 月 27 日 08:18

那应该是不行的，因为下游rke2 集群默认使用的是 containerd，默认是从 dockerhub 拉取镜像，但现在因为网络问题，从 dockerhub 拉取镜像失败，导致你一直创建不成功，你可以参考：如何使用国内资源安装 Rancher 使用阿里云镜像仓库创建下游集群

dazan · 2024 年8 月 27 日 08:19

万分感谢，我试试

dazan · 2024 年8 月 28 日 08:01

不太行，我看了一下，原来我的会自动设置register

这是我启动rancher的docker compose

version: '3'
services:
  rancher:
    image: rancher/rancher:v2.8.5
    container_name: rancher
    privileged: true
    restart: unless-stopped
    ports:
      - "4000:80"
      - "4443:443"
      - "30000-30100:30000-30100"
    environment:
      - CATTLE_SYSTEM_DEFAULT_REGISTRY=registry.cn-hangzhou.aliyuncs.com
    volumes:
      - ./data:/var/lib/rancher

我发现这样启动的rancher，在创建rke2的时候已经自动填写到了register中

但结果仍是这样，仍像提问中的那样，集群的节点卡在

unable to decode an event from the watch stream: stream error: stream ID 33; INTERNAL_ERROR; received from peer

dazan · 2024 年8 月 28 日 08:10

journalctl -f -u rancher-system-agent


8月 28 04:04:21 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:04:21-04:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240828-040421/ee1e1fedb17de6e9dbf59d658234c00d080993b210f53fdaee804ed0ce89651d_0"
8月 28 04:04:21 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:04:21-04:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
8月 28 04:04:21 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:04:21-04:00" level=info msg="[ee1e1fedb17de6e9dbf59d658234c00d080993b210f53fdaee804ed0ce89651d_0:stdout]: Name Location Size Created"
8月 28 04:04:21 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:04:21-04:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
8月 28 04:04:21 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:04:21-04:00" level=info msg="[K8s] updated plan secret fleet-default/custom-bf7782305410-machine-plan with feedback"
8月 28 04:05:42 kvm151 rancher-system-agent[5711]: W0828 04:05:42.026579    5711 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 71; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 28 04:06:43 kvm151 rancher-system-agent[5711]: W0828 04:06:43.103786    5711 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 77; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 28 04:07:45 kvm151 rancher-system-agent[5711]: W0828 04:07:45.979219    5711 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 81; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 28 04:08:50 kvm151 rancher-system-agent[5711]: W0828 04:08:50.926615    5711 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 85; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

又截取了一点日志

8月 28 04:21:47 kvm151 rancher-system-agent[5711]: W0828 04:21:47.226747    5711 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 123; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 28 04:22:53 kvm151 rancher-system-agent[5711]: W0828 04:22:53.528291    5711 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 127; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 28 04:24:03 kvm151 rancher-system-agent[5711]: W0828 04:24:03.876070    5711 reflector.go:456] pkg/mod/github.com/rancher/client-go@v1.27.4-rancher1/tools/cache/reflector.go:231: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 131; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
8月 28 04:24:24 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:24:24-04:00" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240828-042424/ee1e1fedb17de6e9dbf59d658234c00d080993b210f53fdaee804ed0ce89651d_0"
8月 28 04:24:24 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:24:24-04:00" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
8月 28 04:24:24 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:24:24-04:00" level=info msg="[ee1e1fedb17de6e9dbf59d658234c00d080993b210f53fdaee804ed0ce89651d_0:stdout]: Name Location Size Created"
8月 28 04:24:24 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:24:24-04:00" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
8月 28 04:24:24 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:24:24-04:00" level=info msg="[K8s] updated plan secret fleet-default/custom-bf7782305410-machine-plan with feedback"
8月 28 04:24:24 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:24:24-04:00" level=error msg="[K8s] received secret to process that was older than the last secret operated on. (19311 vs 22338)"
8月 28 04:24:24 kvm151 rancher-system-agent[5711]: time="2024-08-28T04:24:24-04:00" level=error msg="error syncing 'fleet-default/custom-bf7782305410-machine-plan': handler secret-watch: secret received was too old, requeuing"