RKE 版本:
rke version v1.2.21
Docker 版本: ( docker version
, docker info
)
Client: Docker Engine - Community
Version: 20.10.21
API version: 1.41
Go version: go1.18.7
Git commit: baeda1f
Built: Tue Oct 25 18:04:24 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.21
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: 3056208
Built: Tue Oct 25 18:02:38 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.9
GitCommit: 1c90a442489720eec95342e1789ee8a5e1b9536f
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
操作系统和内核: ( cat /etc/os-release
, uname -r
)
5.4.224-1.el7.elrepo.x86_64
主机类型和供应商: (移动云)
问题描述:
集群配置信息:
主机名 | 配置 | 角色 |
---|---|---|
k8s-node01 | 4线程8G | controlplane,etcd,worker |
k8s-node02 | 4线程8G | controlplane,etcd,worker |
k8s-node03 | 4线程8G | controlplane,etcd,worker |
nginx | 2线程4G | nginx负载均衡 |
生产的集群搭建后正常工作一个星期,周一准备重启服务器来测试3节点现一节点关掉后还能不能正常使用,重启后突然出现ranche页面里报错误,无法连接到集群node节点
报错信息:
<small>Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded</small>
ssh登录node服务器发现docker ps -a里的容器都是Exited状态,查看容器的数量发现node01-03的组件数量不一样,有些组件直接不见了。
[root@k8s-node-01 ~]# docker ps -a|wc -l
33
[root@k8s-node-02 ~]# docker ps -a|wc -l
23
[root@k8s-node-03 ~]# docker ps -a|wc -l
21
重现步骤:
在VMware Workstation Pro搭建3节点集群,并通过rancher采用自定义方式加入集群后再将集群的3台虚拟机关闭后开机发现问题重现:
结果:
日志
组件报错日志
kubelet错误日志
E1120 09:41:54.000349 1855 kubelet_node_status.go:93] Unable to register node "192.168.100.111" with API server: Post "https://127.0.0.1:6443/api/v1/nodes": read tcp 127.0.0.1:52484->127.0.0.1:6443: read: connection reset by peer
I1120 09:41:54.350192 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:54.909215 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:55.349941 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:55.909332 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:56.350082 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:56.908854 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:57.349755 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:57.909607 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.349770 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.349798 1855 kubelet.go:449] kubelet nodes not sync
E1120 09:41:58.349804 1855 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
I1120 09:41:58.449920 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.449973 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:58.909106 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:59.450152 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:41:59.909439 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:00.450888 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:00.909297 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:01.001356 1855 kubelet_node_status.go:362] Setting node annotation to enable volume controller attach/detach
I1120 09:42:01.016185 1855 kubelet_node_status.go:554] Recording NodeHasSufficientMemory event message for node 192.168.100.111
I1120 09:42:01.016233 1855 kubelet_node_status.go:554] Recording NodeHasNoDiskPressure event message for node 192.168.100.111
I1120 09:42:01.016241 1855 kubelet_node_status.go:554] Recording NodeHasSufficientPID event message for node 192.168.100.111
I1120 09:42:01.016259 1855 kubelet_node_status.go:71] Attempting to register node 192.168.100.111
I1120 09:42:01.450268 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:01.909599 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:01.909632 1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:01.909649 1855 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: nodes have not yet been read at least once, cannot construct node object
I1120 09:42:02.450233 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:03.450344 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:04.450243 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:05.002410 1855 csi_plugin.go:1039] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1:6443/apis/storage.k8s.io/v1/csinodes/192.168.100.111": net/http: TLS handshake timeout
I1120 09:42:05.002524 1855 trace.go:205] Trace[337662251]: "Reflector ListAndWatch" name:k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46 (20-Nov-2022 09:41:49.838) (total time: 15164ms):
Trace[337662251]: [15.164090374s] [15.164090374s] END
E1120 09:42:05.002542 1855 reflector.go:138] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://127.0.0.1:6443/api/v1/pods?fieldSelector=spec.nodeName%3D192.168.100.111&limit=500&resourceVersion=0": net/http: TLS handshake timeout
I1120 09:42:05.450155 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:06.450746 1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:07.155989 1855 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/192.168.100.111?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I1120 09:42:07.450321 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:08.450372 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:08.450391 1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:08.450397 1855 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
I1120 09:42:08.551436 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:08.551470 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:09.552388 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:10.552402 1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:11.016998 1855 kubelet_node_status.go:93] Unable to register node "192.168.100.111" with API server: Post "https://127.0.0.1:6443/api/v1/nodes": net/http: TLS handshake timeout
I1120 09:42:11.552190 1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:11.754531 1855 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"192.168.100.111.172938d44bb67119", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"192.168.100.111", UID:"192.168.100.111", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasNoDiskPressure", Message:"Node 192.168.100.111 status is now: NodeHasNoDiskPressure", Source:v1.EventSource{Component:"kubelet", Host:"192.168.100.111"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc0d69201562b7f19, ext:7657864109, loc:(*time.Location)(0x70e3140)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc0d692294255eb7a, ext:167325100036, loc:(*time.Location)(0x70e3140)}}, Count:15, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Patch "https://127.0.0.1:6443/api/v1/namespaces/default/events/192.168.100.111.172938d44bb67119": net/http: TLS handshake timeout'(may retry after sleeping)
I1120 09:42:11.910590 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:11.910624 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:12.552144 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:12.911618 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:13.551946 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:13.911393 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:14.552494 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:14.911079 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:15.552238 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:15.910761 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:16.552033 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:16.910808 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:17.552465 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:17.911118 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:18.017786 1855 kubelet_node_status.go:362] Setting node annotation to enable volume controller attach/detach
I1120 09:42:18.033242 1855 kubelet_node_status.go:554] Recording NodeHasSufficientMemory event message for node 192.168.100.111
I1120 09:42:18.033288 1855 kubelet_node_status.go:554] Recording NodeHasNoDiskPressure event message for node 192.168.100.111
I1120 09:42:18.033295 1855 kubelet_node_status.go:554] Recording NodeHasSufficientPID event message for node 192.168.100.111
I1120 09:42:18.033312 1855 kubelet_node_status.go:71] Attempting to register node 192.168.100.111
I1120 09:42:18.552174 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:18.552204 1855 kubelet.go:449] kubelet nodes not sync
E1120 09:42:18.552211 1855 kubelet.go:2268] nodes have not yet been read at least once, cannot construct node object
I1120 09:42:18.653348 1855 kubelet.go:449] kubelet nodes not sync
I1120 09:42:18.653386 1855 kubelet.go:449] kubelet nodes not sync
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-iptables = 1
etcd报错日志
2022-11-20 09:46:38.189285 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
2022-11-20 09:46:38.189338 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:38.189346 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:38.193789 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
2022-11-20 09:46:42.284780 E | etcdserver: publish error: etcdserver: request timed out
2022-11-20 09:46:43.189733 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
2022-11-20 09:46:43.189781 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:43.189792 W | rafthttp: health check for peer 1c6ea6e3eae3082b could not connect: dial tcp 192.168.100.112:2380: connect: connection refused
2022-11-20 09:46:43.194241 W | rafthttp: health check for peer c3a369a54123aeaa could not connect: dial tcp 192.168.100.113:2380: connect: connection refused
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 is starting a new election at term 2339
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 became candidate at term 2340
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 received MsgVoteResp from 4a26a3091b1149f3 at term 2340
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 [logterm: 24, index: 65697] sent MsgVote request to 1c6ea6e3eae3082b at term 2340
raft2022/11/20 09:46:43 INFO: 4a26a3091b1149f3 [logterm: 24, index: 65697] sent MsgVote request to c3a369a54123aeaa at term 2340