三master节点，其中两个故障，剩余的一个master也无法访问

lchuanqi · 2024 年11 月 20 日 10:24

环境信息:
K3s 版本: v1.29.3+k3s1

节点 CPU 架构、操作系统和版本：:
Linux k3s01-201 5.10.0-60.139.0.166.oe2203.x86_64 #1 SMP Thu May 30 05:17:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

集群配置:
集群共计3节点，都为master角色。使用嵌入式ETCD，运行时为docker。

问题描述:
三节点，其中两台服务器故障，被关机。剩余的一台服务器运行正常，但集群无法访问。

执行 kubectl get no 输出如下：
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes)

复现步骤:
新部署了一个三节点的集群，都为master角色。直接关闭其中两台机器，集群无法访问。
随便启动一台刚关闭的节点，即，集群中有两台正常的机器，集群即可访问。
执行kubectl get no ，可以正常输出。

请问如何才能访问这个故障集群？（集群只剩一台master了）

从错误日志可以看出，这台存活的节点，无法连接到其他两个节点：

11月 20 18:15:46 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:46.963443+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“c6be6f8e2dfb10fd”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.202:2380: connect: no route to host”}
11月 20 18:15:46 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:46.963592+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_RAFT_MESSAGE”,“remote-peer-id”:“c6be6f8e2dfb10fd”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.202:2380: connect: no route to host”}
11月 20 18:15:46 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:46.963708+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“603d72d8634281b3”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.203:2380: connect: no route to host”}
11月 20 18:15:46 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:46.963748+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_RAFT_MESSAGE”,“remote-peer-id”:“603d72d8634281b3”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.203:2380: connect: no route to host”}
11月 20 18:15:49 k3s01-201 k3s[5238]: {“level”:“info”,“ts”:“2024-11-20T18:15:49.375525+0800”,“logger”:“raft”,“caller”:“etcdserver/zap_raft.go:77”,“msg”:“89e2916f55fcffc5 is starting a new election at term 5”}
11月 20 18:15:49 k3s01-201 k3s[5238]: {“level”:“info”,“ts”:“2024-11-20T18:15:49.375629+0800”,“logger”:“raft”,“caller”:“etcdserver/zap_raft.go:77”,“msg”:“89e2916f55fcffc5 became pre-candidate at term 5”}
11月 20 18:15:49 k3s01-201 k3s[5238]: {“level”:“info”,“ts”:“2024-11-20T18:15:49.375661+0800”,“logger”:“raft”,“caller”:“etcdserver/zap_raft.go:77”,“msg”:“89e2916f55fcffc5 received MsgPreVoteResp from 89e2916f55fcffc5 at term 5”}
11月 20 18:15:49 k3s01-201 k3s[5238]: {“level”:“info”,“ts”:“2024-11-20T18:15:49.375701+0800”,“logger”:“raft”,“caller”:“etcdserver/zap_raft.go:77”,“msg”:“89e2916f55fcffc5 [logterm: 5, index: 5487329] sent MsgPreVote request to 603d72d8634281b3 at term 5”}
11月 20 18:15:49 k3s01-201 k3s[5238]: {“level”:“info”,“ts”:“2024-11-20T18:15:49.375723+0800”,“logger”:“raft”,“caller”:“etcdserver/zap_raft.go:77”,“msg”:“89e2916f55fcffc5 [logterm: 5, index: 5487329] sent MsgPreVote request to c6be6f8e2dfb10fd at term 5”}
11月 20 18:15:51 k3s01-201 k3s[5238]: time=“2024-11-20T18:15:51+08:00” level=info msg=“Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error”
11月 20 18:15:51 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:51.964668+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_RAFT_MESSAGE”,“remote-peer-id”:“603d72d8634281b3”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.203:2380: connect: no route to host”}
11月 20 18:15:51 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:51.964787+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“603d72d8634281b3”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.203:2380: connect: no route to host”}
11月 20 18:15:51 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:51.964739+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_RAFT_MESSAGE”,“remote-peer-id”:“c6be6f8e2dfb10fd”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.202:2380: connect: no route to host”}
11月 20 18:15:51 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:51.964765+0800”,“caller”:“rafthttp/probing_status.go:68”,“msg”:“prober detected unhealthy status”,“round-tripper-name”:“ROUND_TRIPPER_SNAPSHOT”,“remote-peer-id”:“c6be6f8e2dfb10fd”,“rtt”:“0s”,“error”:“dial tcp 172.16.254.202:2380: connect: no route to host”}
11月 20 18:15:56 k3s01-201 k3s[5238]: {“level”:“warn”,“ts”:“2024-11-20T18:15:56.031628+0800”,“caller”:“etcdserver/server.go:2085”,“msg”:“failed to publish local member to cluster through raft”,“local-member-id”:“89e2916f55fcffc5”,“local-member-attributes”:“{Name:k3s01-201-ff5da987 ClientURLs:[https://172.16.254.201:2379]}”,“request-path”:“/0/members/89e2916f55fcffc5/attributes”,“publish-timeout”:“15s”,“error”:“etcdserver: request timed out”}

ksd · 2024 年11 月 20 日 12:17

我记得 etcd 的要求是 3 个节点的高可用集群，只允许一个主机故障

lchuanqi · 2024 年11 月 21 日 03:20

您没记错。。