Rke安装失败

**RKE 版本:**version: v1.2.8

Docker 版本: (docker version,docker info)
Server: Docker Engine - Community
Engine:
Version: 20.10.0
containerd:
Version: v1.4.3

操作系统和内核: (cat /etc/os-release, uname -r)

Kernel Version: 4.19.90-2112.8.0.0131.oe1.x86_64
Operating System: openEuler 20.03 (LTS-SP3)

主机类型和供应商: (VirtualBox/Bare-metal/AWS/GCE/DO)
vmware
cluster.yml 文件:
nodes:

  • address: 192.168.159.3
    user: rancher
    role: [“controlplane”, “etcd”, “worker”]
  • address: 192.168.159.4
    user: rancher
    #role: [“worker”,“etcd”]
    role: [“controlplane”, “etcd”, “worker”]

private_registries:

  • url: ******
    user: *********
    password: *********
    is_default: true

services:
etcd:
backup_config:
enabled: true
interval_hours: 12
retention: 6
extra_args:
quota-backend-bytes: “6442450944”
auto-compaction-retention: 240

kube-api:
extra_args:
watch-cache: true
default-watch-cache-size: 1500
# 事件保留时间,默认1小时
event-ttl: 1h0m0s
# 默认值400,设置0为不限制,一般来说,每25~30个Pod有15个并行
max-requests-inflight: 800
# 默认值200,设置0为不限制
max-mutating-requests-inflight: 400
http2-max-streams-per-connection: 1000

kube-controller:
extra_args:
# 修改每个节点子网大小(cidr掩码长度),默认为24,可用IP为254个;23,可用IP为510个;22,可用IP为1022个;
node-cidr-mask-size: “24”
# 控制器定时与节点通信以检查通信是否正常,周期默认5s
node-monitor-period: “5s”
## 当节点通信失败后,再等一段时间kubernetes判定节点为notready状态。
## 这个时间段必须是kubelet的nodeStatusUpdateFrequency(默认10s)的整数倍,
## 其中N表示允许kubelet同步节点状态的重试次数,默认40s。
node-monitor-grace-period: “20s”
## 再持续通信失败一段时间后,kubernetes判定节点为unhealthy状态,默认1m0s。
node-startup-grace-period: “30s”
## 再持续失联一段时间,kubernetes开始迁移失联节点的Pod,默认5m0s。
pod-eviction-timeout: “1m”

  # 默认5. 同时同步的deployment的数量。
  concurrent-deployment-syncs: 5
  # 默认5. 同时同步的endpoint的数量。
  concurrent-endpoint-syncs: 5
  # 默认20. 同时同步的垃圾收集器工作器的数量。
  concurrent-gc-syncs: 20
  # 默认10. 同时同步的命名空间的数量。
  concurrent-namespace-syncs: 10
  # 默认5. 同时同步的副本集的数量。
  concurrent-replicaset-syncs: 5
  # 默认5m0s. 同时同步的资源配额数。(新版本中已弃用)
  # concurrent-resource-quota-syncs: 5m0s
  # 默认1. 同时同步的服务数。
  concurrent-service-syncs: 1
  # 默认5. 同时同步的服务帐户令牌数。
  concurrent-serviceaccount-token-syncs: 5
  # 默认30s. 同步deployment的周期。
  deployment-controller-sync-period: 30s
  # 默认15s。同步PV和PVC的周期。
  pvclaimbinder-sync-period: 15s

kubelet:
extra_args:
# 传递给网络插件的MTU值,以覆盖默认值,设置为0(零)则使用默认的1460
network-plugin-mtu: “1500”
# 修改节点最大Pod数量
max-pods: “250”
# 密文和配置映射同步时间,默认1分钟
sync-frequency: “3s”
# Kubelet进程可以打开的文件数(默认1000000),根据节点配置情况调整
max-open-files: “2000000”
# 与apiserver会话时的并发数,默认是10
kube-api-burst: “30”
# 与apiserver会话时的 QPS,默认是5,QPS = 并发量/平均响应时间
kube-api-qps: “15”
# kubelet默认一次拉取一个镜像,设置为false可以同时拉取多个镜像,
# 前提是存储驱动要为overlay2,对应的Dokcer也需要增加下载并发数,参考docker配置
serialize-image-pulls: “false”
# 拉取镜像的最大并发数,registry-burst不能超过registry-qps。
# 仅当registry-qps大于0(零)时生效,(默认10)。如果registry-qps为0则不限制(默认5)。
registry-burst: “10”
registry-qps: “0”
cgroups-per-qos: “true”
cgroup-driver: “cgroupfs”
# 节点资源预留
enforce-node-allocatable: “pods”
system-reserved: “cpu=0.25,memory=200Mi”
kube-reserved: “cpu=0.25,memory=1500Mi”

  eviction-hard: "memory.available<300Mi,nodefs.available<5%,imagefs.available<5%,nodefs.inodesFree<5%"
  ## 软驱逐阈值
  ### 以下四个参数配套使用,当节点上的可用资源少于这个值时但大于硬驱逐阈值时候,会等待eviction-soft-grace-period设置的时长;
  ### 等待中每10s检查一次,当最后一次检查还触发了软驱逐阈值就会开始驱逐,驱逐不会直接Kill POD,先发送停止信号给POD,然后等待eviction-max-pod-grace-period设置的时长;
  ### 在eviction-max-pod-grace-period时长之后,如果POD还未退出则发送强制kill POD"
  eviction-soft: "memory.available<500Mi,nodefs.available<10%,imagefs.available<10%,nodefs.inodesFree<10%"
  eviction-soft-grace-period: "memory.available=1m30s,nodefs.available=1m30s,imagefs.available=1m30s,nodefs.inodesFree=1m30s"
  eviction-max-pod-grace-period: "30"
  eviction-pressure-transition-period: "30s"

kubeproxy:
extra_args:
# 与kubernetes apiserver通信并发数,默认10;
kube-api-burst: 20
# 与kubernetes apiserver通信时使用QPS,默认值5,QPS=并发量/平均响应时间
kube-api-qps: 10

重现步骤:
su - rancher
./rke up
结果:
INFO[0000] Running RKE version: v1.2.8
INFO[0000] Initiating Kubernetes cluster
INFO[0001] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0001] [certificates] Generating admin certificates and kubeconfig
INFO[0001] Successfully Deployed state file at [./cluster.rkestate]
INFO[0001] Building Kubernetes cluster
INFO[0001] [dialer] Setup tunnel for host [192.168.159.4]
INFO[0001] [dialer] Setup tunnel for host [192.168.159.3]
WARN[0001] Failed to set up SSH tunneling for host [192.168.159.3]: Can’t retrieve Docker Info: error during connect: Get “http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info”: Unable to access node with address [192.168.159.3:22] using SSH. Please check if you are able to SSH to the node using the specified SSH Private Key and if you have configured the correct SSH username. Error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
WARN[0001] Failed to set up SSH tunneling for host [192.168.159.4]: Can’t retrieve Docker Info: error during connect: Get “http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info”: Unable to access node with address [192.168.159.4:22] using SSH. Please check if you are able to SSH to the node using the specified SSH Private Key and if you have configured the correct SSH username. Error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
WARN[0001] Removing host [192.168.159.3] from node lists
WARN[0001] Removing host [192.168.159.4] from node lists
FATA[0001] Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s) [192.168.159.3]

rke 安装 K8s 集群,首先需要配置节点的免密登录。其次,cluster.yaml 中配置的用户还需要具有 docker 的权限。

根据你以上的日志,也许是没把 rancher 用户加到 docker 组里,你可以通过以下命令确认:

ssh rancher@192.168.159.3 docker ps

如果将用户加到 docker 组,可参考:Rancher Docs: Requirements

免密登录以及rancher对于docker的使用权限都是已经授权过得,在启动rke之前都有验证过,而后在启动的

那你把执行这条命令的截图发来看看:

ssh rancher@192.168.159.3 docker ps

找到原因了,是由于没有开启sshd_config下允许tcp的转发

把 ssh 的具体配置贴出来下呗,方便其他用户查看

但是我有个疑问哈,因为我们rke跟rancher安装是做的ansible一键部署的(系统是用centos7.6),然后由于新项目这个是用华为欧拉系统,之前的并没有开启AllowTcpForwarding yes 这个选项也能安装,这是由于系统不同的原因吗

也许吧,这个不清楚