麒麟v10上安装rke2节点 pod cattle-cluster-agent无法启动

硬件CPU intel x86
操作系统: 银河麒麟高级服务器操作系统(AMD64版)V10
rancher版本:v2.8.0
rke2版本:v1.27.12+rke2r1

环境信息:
RKE2 版本:
rke2 version v1.27.12+rke2r1 (25b27b4e4709a2ac4c550609ad730a9e172d110a)
go version go1.21.8 X:boringcrypto

节点 CPU 架构,操作系统和版本:
[root@ky-test ~]# uname -a
Linux ky-test 4.19.90-89.11.v2401.ky10.x86_64 #1 SMP Tue May 7 18:33:01 CST 2024 x86_64 x86_64 x86_64 GNU/Linux

问题描述:
cattle-cluster-agent始终无法启动

安装命令
curl --insecure -fL https://192.168.40.220/system-agent-install.sh | sudo sh -s - --server https://192.168.40.220 --label ‘cattle.io/os=linux’ --token 55hd59467fc45cdxlpt5n67nhv2rwpmnprmgf64vfdtxwlcrj4rcwp --ca-checksum dfc43177c7204bb998b3076aa1619c2c5110b3adf73f173b528ff0ea40d187bc --etcd --controlplane --worker

日志好像是连不上rancher,但是能ping通.是证书问题吗?该如何解决
[root@ky-test ~]# curl https://192.168.40.220/ping
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
[root@ky-test ~]# ping 192.168.40.220
PING 192.168.40.220 (192.168.40.220) 56(84) bytes of data.
64 bytes from 192.168.40.220: icmp_seq=1 ttl=64 time=2.96 ms
64 bytes from 192.168.40.220: icmp_seq=2 ttl=64 time=0.114 ms

日志

kubectl -n cattle-system logs cattle-cluster-agent-96f86b66b-k5jds --previous
INFO: Environment: CATTLE_ADDRESS=10.42.120.2 CATTLE_CA_CHECKSUM=dfc43177c7204bb998b3076aa1619c2c5110b3adf73f173b528ff0ea40d187bc CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.65.46:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.43.65.46:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.65.46 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.43.65.46:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.65.46 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.65.46 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY=repo.yousen.plus CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=f25f086d-3060-4582-97c9-268715f1c0c7 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-96f86b66b-k5jds CATTLE_RANCHER_WEBHOOK_VERSION= CATTLE_SERVER=https://192.168.40.220 CATTLE_SERVER_VERSION=v2.8.3
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local nameserver 10.43.0.10 options ndots:5
ERROR: https://192.168.40.220/ping is not accessible (Failed to connect to 192.168.40.220 port 443 after 1055 ms: Couldn’t connect to server)

[root@ky-test ~]# telnet 192.168.40.220 443
Trying 192.168.40.220…
Connected to 192.168.40.220.
Escape character is ‘^]’.
^CConnection closed by foreign host.

[root@ky-test-master ~]# kubectl get events --namespace=cattle-system --watch
LAST SEEN TYPE REASON OBJECT MESSAGE
3m3s Warning FailedScheduling pod/cattle-cluster-agent-7f766b8f87-q5lv8 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling…
2m37s Normal Scheduled pod/cattle-cluster-agent-7f766b8f87-q5lv8 Successfully assigned cattle-system/cattle-cluster-agent-7f766b8f87-q5lv8 to ky-test-master
2m37s Warning FailedCreatePodSandBox pod/cattle-cluster-agent-7f766b8f87-q5lv8 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox “a8f96d6496157e4fc5d29befafb1335804306429e1ac545f207e45d6b9f1dfc5”: plugin type=“calico” failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
2m24s Normal SandboxChanged pod/cattle-cluster-agent-7f766b8f87-q5lv8 Pod sandbox changed, it will be killed and re-created.
2m23s Normal Pulling pod/cattle-cluster-agent-7f766b8f87-q5lv8 Pulling image “repo.yousen.plus/rancher/rancher-agent:v2.8.3”
101s Normal Pulled pod/cattle-cluster-agent-7f766b8f87-q5lv8 Successfully pulled image “repo.yousen.plus/rancher/rancher-agent:v2.8.3” in 42.013581071s (42.013589033s including waiting)
52s Normal Created pod/cattle-cluster-agent-7f766b8f87-q5lv8 Created container cluster-register
52s Normal Started pod/cattle-cluster-agent-7f766b8f87-q5lv8 Started container cluster-register
52s Normal Pulled pod/cattle-cluster-agent-7f766b8f87-q5lv8 Container image “repo.yousen.plus/rancher/rancher-agent:v2.8.3” already present on machine
13s Warning BackOff pod/cattle-cluster-agent-7f766b8f87-q5lv8 Back-off restarting failed container cluster-register in pod cattle-cluster-agent-7f766b8f87-q5lv8_cattle-system(1a2d5d7b-f37e-4fc8-9135-448ba8d706b8)
3m4s Normal SuccessfulCreate replicaset/cattle-cluster-agent-7f766b8f87 Created pod: cattle-cluster-agent-7f766b8f87-q5lv8
3m5s Normal ScalingReplicaSet deployment/cattle-cluster-agent Scaled up replica set cattle-cluster-agent-7f766b8f87 to 1
0s Normal Pulled pod/cattle-cluster-agent-7f766b8f87-q5lv8 Container image “repo.yousen.plus/rancher/rancher-agent:v2.8.3” already present on machine
0s Normal Created pod/cattle-cluster-agent-7f766b8f87-q5lv8 Created container cluster-register

后续使用中感觉是K8S内部的DNS出了问题.创建的service,容器内部无法正常访问

已确认是麒麟系统/etc/sysctl.conf中几个关于ipv4默认参数为0导致不能创建vxlan .修改为1之后能够创建vxlan
calico-node能够正常运行
但是cattle-cluster-agent容器依然无法正常运行
查看日志无法访问rancher的ping接口
安装命令是带了–insecure

curl --insecure -fL https://192.168.40.220/system-agent-install.sh | sudo  sh -s - --server https://192.168.40.220 --label 'cattle.io/os=linux' --token tp4trblqqk9tbvzzkqbzjt74wx4bt9ftvh742s86qw5qxgtmjl2zl8 --ca-checksum dfc43177c7204bb998b3076aa1619c2c5110b3adf73f173b528ff0ea40d187bc --etcd --controlplane --worker

这个问题改如何解决呢?

cat /proc/sys/net/ipv4/ip_forward 查看下 主机的 ip_forward 的值,如果是 0,需要开启 ip_forward

image

麒麟 OS 是直接装在物理机上了,还是使用了某个 虚拟化平台?

暂时是虚拟化的

用的什么虚拟化平台?

virsh

我看了下coredns的日志
[root@jt ~]# kubectl logs -n kube-system -f pod/rke2-coredns-rke2-coredns-bcb9d866c-jb9dv
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get “https://10.43.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0”: dial tcp 10.43.0.1:443: connect: no route to host

感觉十有八九是主机之间网络的问题,之前在社区里遇到过由于使用的是 深信服的平台导致的类似问题。

好的.感谢你!