POD unhealthy && 容器进不去,POD还是running状态

Rancher Server 设置

  • Rancher 版本:V2.6.8
  • 安装选项 (Docker install/Helm Chart): Helm Chart
    • 如果是 Helm Chart 安装,需要提供 Local 集群的类型(RKE1, RKE2, k3s, EKS, 等)和版本:rke1, k3s
  • 在线或离线部署:在线部署

下游集群信息

  • Kubernetes 版本: v1.24.4
  • Cluster Type (Local/Downstream):
    • 如果 Downstream,是什么类型的集群?(自定义/导入或为托管 等): Local

用户信息

  • 登录用户的角色是什么? (管理员/集群所有者/集群成员/项目所有者/项目成员/自定义):管理员
    • 如果自定义,自定义权限集:

主机操作系统: CentOS Linux release 7.9.2009 (Core) / Linux 5.4.219-1.el7.elrepo.x86_64 x86_64

**问题描述:pod calico-kube-controllers 不健康,探测不到,容器也进不去,但还是running状态,没有被终止,持续几天了,是啥原因?
Unhealthy Pod calico-kube-controllers-868c4689cb-vcpln

Liveness probe errored: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

**重现步骤:自然发生

**结果:发生了

**预期结果:POD健康运行,或不健康的时候会被终止并启动新的POD实例

截图:



**其他上下文信息:无

日志
以下为kubelet日志

time="2022-11-11T01:25:44Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:25:47Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:25:52Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:25:57Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:25:59Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:26:02Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:08Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:13Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:14Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:26:18Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:23Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:28Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:29Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:26:33Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:38Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:43Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:44Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:26:48Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:49Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
E1111 01:26:51.549342    4674 remote_runtime.go:680] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c" cmd=[/usr/bin/check-status -r]
time="2022-11-11T01:26:53Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
W1111 01:26:58.514976    4674 watcher.go:93] Error while processing event ("/sys/fs/cgroup/pids/system.slice/frpc.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/system.slice/frpc.service: no such file or directory
time="2022-11-11T01:26:58Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:26:59Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
E1111 01:27:00.125954    4674 remote_runtime.go:680] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c" cmd=[/usr/bin/check-status -l]
time="2022-11-11T01:27:03Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:08Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:13Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:14Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:27:18Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:23Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:28Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:29Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:27:33Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
E1111 01:27:34.142014    4674 remote_runtime.go:578] "ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
E1111 01:27:34.142045    4674 container_log_manager.go:233] "Failed to get container status" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:27:38Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:43Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:44Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:27:49Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:54Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:59Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:27:59Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:28:04Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:09Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:14Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:28:14Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:19Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:24Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:29Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:28:29Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:34Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:39Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:44Z" level=error msg="Get \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c/stats?stream=0\": context deadline exceeded Failed to get stats from container 7dcf78b1ea92a7fe939c8478a635cc8b4a691e970f5c186584cbb7477c357f1c"
time="2022-11-11T01:28:45Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
time="2022-11-11T01:28:50Z" level=info msg="Using CNI configuration file /etc/cni/net.d/10-canal.conflist"
E1111 01:28:50.550101    4674 remote_runtime.go:680] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="7dcf78b1ea92a7fe939c8478a635cc8b4a691e97

calico-kube-controllers 是通过 deployment 控制的,如果通过 kubectl 都没办法查看到日志,你可以试试将这个 pod 删掉,等待重建之后再去检查日志。

至于 exec 登录不进去,这个是正常的。我也进不去,:smile:

还有,你这个应该是 rke 集群,并不是你开头说的 k3s集群

kubectl可以看到探测失败的日志,新启动的实例可以正常运行。请教两个问题,

  1. 有办法查看为啥不健康,为什么进不去容器么?
  2. 怎样可以自动终止、自动启动新POD实例?

集群是rke,local是k3s

看一下这个Pod的Liveness probe如何配置的,并尝试从kubelet网络发起对这个地址的请求,看一下效果。