单机Rancher异常重启

Rancher Server 设置

  • Rancher 版本:v2.5.11
  • 安装选项 (Docker install/Helm Chart): Docker install
  • 在线或离线部署:在线

下游集群信息

  • Kubernetes 版本: v1.20.12
  • Cluster Type (Local/Downstream): 自定义

问题描述:
Rancher有时候会莫名重启,报错信息如下:
E0530 08:44:55.398066 7 leaderelection.go:301] Failed to release lock: resource name may not be empty
2022/05/30 08:44:55 [FATAL] leaderelection lost for cattle-controllers

需要提供fatal前面更详细的logs才好后续排查,一般情况下,fatal时会打印大量堆栈日志信息。

看起来你使用single docker模式按照,我个人经验,cpu/mem不足,会引起这个问题。因为single docker模式下,rancher server里面还有一个k3s,一旦计算资源不稳定,引起k3s api抖动,就会出现这种情况。

fatal之后就是重启的日志了,下面是重启前k3s的一些异常日志:
time=“2022-05-30T08:44:25.401443352Z” level=info msg=“error in remotedialer server [400]: read tcp 172.17.0.2:6443->172.17.0.2:57898: i/o timeout”
E0530 08:44:25.458119 43 leaderelection.go:361] Failed to update lock: resource name may not be empty
I0530 08:44:26.451576 43 event.go:291] “Event occurred” object=“kube-system/cloud-controller-manager” kind=“Endpoints” apiVersion=“v1” type=“Normal” reason=“LeaderElection” message=“c95e7471a7ae_804013e6-b8e6-48ea-9fd9-80c464d5c6f8 stopped leading”
I0530 08:44:26.457128 43 trace.go:205] Trace[311098023]: “Create” url:/apis/project.cattle.io/v3/namespaces/u-tonu55gyjp/sourcecoderepositories,user-agent:rancher/v0.0.0 (linux/amd64) kubernetes/$Format,client:127.0.0.1 (30-May-2022 08:44:12.360) (total time: 11106ms):
Trace[311098023]: —“Object stored in database” 1081ms (08:44:00.441)
Trace[311098023]: [11.106714268s] [11.106714268s] END
E0530 08:44:24.468878 43 leaderelection.go:361] Failed to update lock: resource name may not be empty
I0530 08:44:28.370601 43 leaderelection.go:278] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition
E0530 08:44:28.450274 43 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:“context canceled”}
I0530 08:44:32.462581 43 trace.go:205] Trace[532708969]: “iptables Monitor CANARY check” (30-May-2022 08:44:06.909) (total time: 25553ms):
Trace[532708969]: [25.553247623s] [25.553247623s] END
I0530 08:44:32.462685 43 event.go:291] “Event occurred” object="" kind=“Lease” apiVersion=“coordination.k8s.io/v1” type=“Normal” reason=“LeaderElection” message=“c95e7471a7ae_8a6aaacb-cf25-4c4f-a805-1e66bee22362 stopped leading”
I0530 08:44:32.462704 43 event.go:291] “Event occurred” object=“kube-system/kube-controller-manager” kind=“Endpoints” apiVersion=“v1” type=“Normal” reason=“LeaderElection” message=“c95e7471a7ae_8a6aaacb-cf25-4c4f-a805-1e66bee22362 stopped leading”
E0530 08:44:32.475183 43 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:“context canceled”}
I0530 08:44:33.364741 43 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
I0530 08:44:33.429304 43 garbagecollector.go:146] Shutting down garbage collector controller
I0530 08:44:33.469206 43 cleaner.go:91] Shutting down CSR cleaner controller
I0530 08:44:34.424344 43 dynamic_serving_content.go:145] Shutting down csr-controller::/var/lib/rancher/k3s/server/tls/client-ca.crt::/var/lib/rancher/k3s/server/tls/client-ca.key
I0530 08:44:34.424430 43 dynamic_serving_content.go:145] Shutting down csr-controller::/var/lib/rancher/k3s/server/tls/client-ca.crt::/var/lib/rancher/k3s/server/tls/client-ca.key
I0530 08:44:34.424445 43 dynamic_serving_content.go:145] Shutting down csr-controller::/var/lib/rancher/k3s/server/tls/client-ca.crt::/var/lib/rancher/k3s/server/tls/client-ca.key
I0530 08:44:45.348533 43 gc_controller.go:100] Shutting down GC controller
I0530 08:44:45.369471 43 endpoints_controller.go:201] Shutting down endpoint controller
I0530 08:44:45.369510 43 clusterroleaggregation_controller.go:161] Shutting down ClusterRoleAggregator
I0530 08:44:45.369524 43 pv_protection_controller.go:95] Shutting down PV protection controller
I0530 08:44:45.382294 43 certificate_controller.go:130] Shutting down certificate controller “csrsigning-kubelet-client”
I0530 08:44:45.382344 43 certificate_controller.go:130] Shutting down certificate controller “csrsigning-kube-apiserver-client”
I0530 08:44:45.382354 43 certificate_controller.go:130] Shutting down certificate controller “csrsigning-kubelet-serving”
I0530 08:44:45.382363 43 certificate_controller.go:130] Shutting down certificate controller “csrsigning-legacy-unknown”
I0530 08:44:45.382375 43 stateful_set.go:158] Shutting down statefulset controller
I0530 08:44:45.382393 43 horizontal.go:180] Shutting down HPA controller
I0530 08:44:45.382407 43 daemon_controller.go:299] Shutting down daemon sets controller
I0530 08:44:45.382422 43 pvc_protection_controller.go:122] Shutting down PVC protection controller
I0530 08:44:45.382439 43 attach_detach_controller.go:362] Shutting down attach detach controller
I0530 08:44:45.382453 43 namespace_controller.go:212] Shutting down namespace controller
I0530 08:44:50.368978 43 horizontal.go:215] horizontal pod autoscaler controller worker shutting down
I0530 08:44:50.395830 43 event.go:291] “Event occurred” object="" kind=“Lease” apiVersion=“coordination.k8s.io/v1” type=“Normal” reason=“LeaderElection” message=“c95e7471a7ae_804013e6-b8e6-48ea-9fd9-80c464d5c6f8 stopped leading”
I0530 08:44:51.437490 43 cronjob_controller.go:100] Shutting down CronJob Manager
I0530 08:44:52.429894 43 disruption.go:348] Shutting down disruption controller
I0530 08:44:53.416134 43 endpointslicemirroring_controller.go:224] Shutting down EndpointSliceMirroring controller
I0530 08:44:53.463512 43 dynamic_serving_content.go:145] Shutting down csr-controller::/var/lib/rancher/k3s/server/tls/client-ca.crt::/var/lib/rancher/k3s/server/tls/client-ca.key
I0530 08:44:45.459108 43 node_lifecycle_controller.go:589] Shutting down node controller
I0530 08:44:54.431976 43 serviceaccounts_controller.go:129] Shutting down service account controller
I0530 08:44:54.432115 43 endpointslice_controller.go:253] Shutting down endpoint slice controller
I0530 08:44:54.432143 43 expand_controller.go:315] Shutting down expand controller
I0530 08:44:54.432454 43 pv_controller_base.go:319] Shutting down persistent volume controller
I0530 08:44:54.446050 43 ttl_controller.go:130] Shutting down TTL controller

我的也是这种情况,日志也一样,也是单机docker run 在线安装运行的。。

你看下监控,rancher宕机的时候是不是有磁盘io瓶颈。我把磁盘换成ssd之后,情况好了一些,宕机频率低很多 :sweat_smile: