Rancher 2.6.6 RKE1 创建集群失败

Rancher Server 设置

  • Rancher 版本:2.6.6
  • 安装选项 (Docker install/Helm Chart): RKE1
  • 在线或离线部署:在线部署

下游集群信息

  • Kubernetes 版本: v1.21.13-rancher1-1
  • Cluster Type (Local/Downstream):
    • 如果 Downstream,是什么类型的集群?(自定义/导入或为托管 等): local

主机操作系统:

Centos7.8

问题描述:

Rancher2.6.6 版本上面使用RKE1 新建集群后,一直显示无法 Waiting for API to be available

日志显示 [Disconnected] Cluster agent is not connected

我们项目正在进行,需要马上部署好服务器,很急!!! 大神们赶紧帮帮忙

截图:

其他上下文信息:

日志
kubelet 里面一直有报错日志

I1206 09:36:23.701062   24564 scope.go:110] "RemoveContainer" containerID="8e5d9a8b4f5580d5716c41cc68f2960a79686114e400f5ba514b8eebd6451635"
I1206 09:36:23.701392   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:36:23.701600   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:36:34.304729   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:36:34.304952   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:36:49.305169   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:36:49.305409   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:37:03.304580   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:37:03.304830   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:37:15.304338   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:37:15.304573   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:37:29.304068   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:37:29.304638   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:37:44.304003   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:37:44.304272   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:37:56.303872   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:37:56.304160   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:38:07.304629   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:38:07.304848   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:38:19.304954   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:38:19.305192   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:38:30.303782   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:38:30.304047   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596
I1206 09:38:45.304680   24564 scope.go:110] "RemoveContainer" containerID="a675e7942523c05e58b97bd768327410a599bee98ff1f3af3bc02f692870969a"
E1206 09:38:45.304907   24564 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-register pod=cattle-cluster-agent-5647775bd4-z66pf_cattle-system(a14b8612-692f-418b-b24f-a3d67c587596)\"" pod="cattle-system/cattle-cluster-agent-5647775bd4-z66pf" podUID=a14b8612-692f-418b-b24f-a3d67c587596

得看 下游集群的 cluster-agent 的容器日志

问题找到了
RKE 的集群部署完成之后会通过 k8s_cluster-register_cattle-cluster-agent 这个容器 告诉Rancher服务已经部署完毕,但是这个容器的内部 HOSTS 没有配置我们Rancher服务的域名(集群和Rancher不在同一个网络),导致没有办法通信,一直无法注册成功。 我在这个容器内部 手动修改了 /etc/hosts 添加了我们的域名。就通过了。不知道这个是BUG 还是 功能欠缺。

这个不是 bug 也不是功能欠缺,就是这样设计的

应该使用一个可以被正常解析的域名,自定义hosts本身就不是合理的IT环境,程序中不会考虑这样的特殊场景。

嗯,的确是我自己域名指向的是内网IP的问题导致的。

但是,其实我在使用RKE 注册新集群之前,就在宿主机设置了 域名的IP 指向。 否则也是无法创建集群的。

这个也是我排查错误最大的困扰,因为我已经设置了域名的公网IP指向,集群也大体上都部署成功了,就差最后一步。所以不会往域名不能解析这个方面去想。 最终我只能一个个查看所有容器的日志,才找到问题的关键。 因为k8s相关的容器都会继承我宿主机的hosts,其他Rancher相关的容器没有继承。

因为不是了解Rancher 这个注册过程。所以这个问题排查起来过于困难,无法准确定位问题。