对集群进行故障测试，突然出现了集群启动不了的问题？

Tinders · 2022 年11 月 20 日 09:51

RKE 版本:
rke version v1.2.21
Docker 版本: ( docker version , docker info )

Client: Docker Engine - Community
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        baeda1f
 Built:             Tue Oct 25 18:04:24 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 18:02:38 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.9
  GitCommit:        1c90a442489720eec95342e1789ee8a5e1b9536f
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

操作系统和内核: ( cat /etc/os-release , uname -r )
5.4.224-1.el7.elrepo.x86_64
主机类型和供应商: (移动云)

问题描述：
集群配置信息：

主机名	配置	角色
k8s-node01	4线程8G	controlplane，etcd，worker
k8s-node02	4线程8G	controlplane，etcd，worker
k8s-node03	4线程8G	controlplane，etcd，worker
nginx	2线程4G	nginx负载均衡

生产的集群搭建后正常工作一个星期，周一准备重启服务器来测试3节点现一节点关掉后还能不能正常使用，重启后突然出现ranche页面里报错误，无法连接到集群node节点
报错信息：
<small>Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded</small>
ssh登录node服务器发现docker ps -a里的容器都是Exited状态，查看容器的数量发现node01-03的组件数量不一样,有些组件直接不见了。

[root@k8s-node-01 ~]# docker ps -a|wc -l
33
[root@k8s-node-02 ~]# docker ps -a|wc -l
23
[root@k8s-node-03 ~]# docker ps -a|wc -l
21

重现步骤：
在VMware Workstation Pro搭建3节点集群，并通过rancher采用自定义方式加入集群后再将集群的3台虚拟机关闭后开机发现问题重现：

结果：

日志

组件报错日志

kubelet错误日志

E1120 09:41:54.000349    1855 kubelet_node_status.go:93] Unable to register node "192.168.100.111" with API server: Post "https://127.0.0.1:6443/api/v1/nodes": read tcp 127.0.0.1:52484->127.0.0.1:6443: read: connection reset by peer
I1120 09:41:54.350192    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:54.909215    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:55.349941    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:55.909332    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:56.350082    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:56.908854    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:57.349755    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:57.909607    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.349770    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.349798    1855 kubelet.go:449] kubelet nodes not sync
E1120 09:41:58.349804    1855 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
I1120 09:41:58.449920    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.449973    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.909106    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:59.450152    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:59.909439    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:00.450888    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:00.909297    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:01.001356    1855 kubelet_node_status.go:362] Setting node annotation to enable volume controller attach/detach
I1120 09:42:01.016185    1855 kubelet_node_status.go:554] Recording NodeHasSufficientMemory event message for node 192.168.100.111
I1120 09:42:01.016233    1855 kubelet_node_status.go:554] Recording NodeHasNoDiskPressure event message for node 192.168.100.111
I1120 09:42:01.016241    1855 kubelet_node_status.go:554] Recording NodeHasSufficientPID event message for node 192.168.100.111
I1120 09:42:01.016259    1855 kubelet_node_status.go:71] Attempting to register node 192.168.100.111
I1120 09:42:01.450268    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:01.909599    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:01.909632    1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:01.909649    1855 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: nodes have not yet been read at least once, cannot construct node object
I1120 09:42:02.450233    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:03.450344    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:04.450243    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:05.002410    1855 csi_plugin.go:1039] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csinodes/192.168.100.111": net/http: TLS handshake timeout
I1120 09:42:05.002524    1855 trace.go:205] Trace[337662251]: "Reflector ListAndWatch" name:k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46 (20-Nov-2022 09:41:49.838) (total time: 15164ms):
Trace[337662251]: [15.164090374s] [15.164090374s] END
E1120 09:42:05.002542    1855 reflector.go:138] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3D192.168.100.111&limit=500&resourceVersion=0": net/http: TLS handshake timeout
I1120 09:42:05.450155    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:06.450746    1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:07.155989    1855 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/192.168.100.111?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I1120 09:42:07.450321    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:08.450372    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:08.450391    1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:08.450397    1855 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
I1120 09:42:08.551436    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:08.551470    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:09.552388    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:10.552402    1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:11.016998    1855 kubelet_node_status.go:93] Unable to register node "192.168.100.111" with API server: Post "https://127.0.0.1:6443/api/v1/nodes": net/http: TLS handshake timeout
I1120 09:42:11.552190    1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:11.754531    1855 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"192.168.100.111.172938d44bb67119", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"192.168.100.111", UID:"192.168.100.111", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasNoDiskPressure", Message:"Node 192.168.100.111 status is now: NodeHasNoDiskPressure", Source:v1.EventSource{Component:"kubelet", Host:"192.168.100.111"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc0d69201562b7f19, ext:7657864109, loc:(*time.Location)(0x70e3140)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc0d692294255eb7a, ext:167325100036, loc:(*time.Location)(0x70e3140)}}, Count:15, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Patch "https://127.0.0.1:6443/api/v1/namespaces/default/events/192.168.100.111.172938d44bb67119": net/http: TLS handshake timeout'(may retry after sleeping)
I1120 09:42:11.910590    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:11.910624    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:12.552144    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:12.911618    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:13.551946    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:13.911393    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:14.552494    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:14.911079    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:15.552238    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:15.910761    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:16.552033    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:16.910808    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:17.552465    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:17.911118    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:18.017786    1855 kubelet_node_status.go:362] Setting node annotation to enable volume controller attach/detach
I1120 09:42:18.033242    1855 kubelet_node_status.go:554] Recording NodeHasSufficientMemory event message for node 192.168.100.111
I1120 09:42:18.033288    1855 kubelet_node_status.go:554] Recording NodeHasNoDiskPressure event message for node 192.168.100.111
I1120 09:42:18.033295    1855 kubelet_node_status.go:554] Recording NodeHasSufficientPID event message for node 192.168.100.111
I1120 09:42:18.033312    1855 kubelet_node_status.go:71] Attempting to register node 192.168.100.111
I1120 09:42:18.552174    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:18.552204    1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:18.552211    1855 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
I1120 09:42:18.653348    1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:18.653386    1855 kubelet.go:449] kubelet nodes not sync
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-iptables = 1

etcd报错日志

2022-11-20 09:46:38.189285 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
2022-11-20 09:46:38.189338 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:38.189346 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:38.193789 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
2022-11-20 09:46:42.284780 E | etcdserver: publish error: etcdserver: request timed out
2022-11-20 09:46:43.189733 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
2022-11-20 09:46:43.189781 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:43.189792 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:43.194241 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 is starting a new election at term 2339
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 became candidate at term 2340
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 received MsgVoteResp from 4a26a3091b1149f3 at term 2340
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 [logterm: 24, index: 65697] sent MsgVote request to 1c6ea6e3eae3082b at term 2340
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 [logterm: 24, index: 65697] sent MsgVote request to c3a369a54123aeaa at term 2340

niusmallnan · 2022 年11 月 21 日 01:38

我有一个Rancher集群，几乎每天都在定时关机定时开机，一直稳定运行1年。
不过，我一直在2.6基线上运行。

Tinders · 2022 年11 月 21 日 02:55

我这个是不是版本问题：我是基于这个文章部署的：RKE安装k8s及部署高可用rancher_曦风雨后的博客-CSDN博客