背景:

sealos安装的Kubernetes集群,删除后增加节点失败。

Error: failed to add hosts: run command /var/lib/sealos/data/default/rootfs/opt/sealctl hosts add --ip 10.0.0.Master --domain sealos.hub on 10.0.0.Node:22, output: , error: Process exited with status 139 from signal SEGV,

环境:

应用 信息
os Ubuntu
sealos v4.2.2
Kubernetes v1.25.7

问题1:文件时md5校验失败

操作步骤:

sealos delete --nodes 10.0.0.X 
sealos add --nodes 10.0.0.X 

执行报错:

sha256 sum not match

Sealos 的 –debug 参数是一个全局参数,用于开启调试模式,以便在出现问题时能更详细地了解系统的运行情况。

sealos add --nodes 10.0.0.X  --debug

Sealos传输文件时比较md5查看是否报错,issues给了一个解决办法:可以关闭校验。

#  export SEALOS_SCP_CHECKSUM=false

修改完,出现了另外一个问题:

问题2:sealctl 增加host失败

Error: failed to add hosts: run command /var/lib/sealos/data/default/rootfs/opt/sealctl hosts add --ip 10.0.0.Master --domain sealos.hub on 10.0.0.Node:22, output: , error: Process exited with status 139 from signal SEGV,

 debug模式查看详细

2023-06-28T09:56:43 debug show registry info, IP: 10.0.0.X:22, Domain: sealos.hub, Data: /var/lib/registry2023-06-28T09:56:43 debug start to exec /var/lib/sealos/data/default/rootfs/opt/sealctl hosts add --ip 10.0.0.X --domain sealos.hub on 10.0.0.M:222023-06-28T09:56:44 error Applied to cluster error: failed to add hosts: run command /var/lib/sealos/data/default/rootfs/opt/sealctl hosts add --ip 10.0.0.X --domain sealos.hub on 10.0.0.M:22, output: , error: Process exited with status 139 from signal SEGV,2023-06-28T09:56:44 debug save objects into local: /root/.sealos/default/Clusterfile, objects: [apiVersion: apps.sealos.io/v1beta1kind: Cluster

可见,是seactl执行"sealctl hosts add –ip 10.0.0.X –domain sealos.hub" 失败。此文件是从master拷贝过来的:

/var/lib/containers/storage/overlay/5dd64dde7bc046cfd7554458e2950c41c0d86536e4e96532cfa2ee60685404d4/merged/opt to dst /var/lib/sealos/data/default/rootfs/opt2023-06-28T09:56:40 debug remote copy files src /var/lib/containers/storage/overlay/44ce75c1b3e483c61ecfbc07a72f87da8627ce40d9df9451f2d04dfad8dffc65/merged/opt to dst /var/lib/sealos/data/default/rootfs/opt2023-06-28T09:56:42 debug remote copy files src

我们试着手动拷贝过去,执行是正常的。初步判断是传输的文件出错 ,issure沟通,建议试着“scp进程可能有问题,比较md5查看是否有问题。export SEALOS_SCP_CHECKSUM = true”,还是出现 “sha256 sum not match”。无效。

解决

这个v4.2.2 的一个bug,希望下个版本能修复。最终一个临时处理办法:重置集群

# sealos reset 

我尝试重置集群,再加入node是可行的。

小结

本文主要记录排查、解决【sealos-v4.2.2删除节点后再增加节点失败】的过程,原因sealos的版本v4.2.2 问题,,其一,旧版本在拷贝seactl文件到node时会抛出失败:sha256比较文件时,经排查判断是传输过程文件出错了。其二,可以临时通过重置集群来恢复。 希望对你有所帮助。

参考

背景:

sealos安装的Kubernetes集群,master节点出现大量端口占用。

环境:

应用 信息
os Ubuntu
sealos v4.1.7
Kubernetes v1.23.9

排查问题

按照本地端口对输出进行排序

netstat -anp | grep ESTABLISHED | awk '{print $4}' | sort | uniq -c | sort -n
    28 10.0.0.101:6443
    111 127.0.0.1:2379
  9009 10.0.0.101:5000

查找本地端口对应的应用程序

  lsof -i :5000
  image-cri 2811249 root 4120u  IPv4 291818320      0t0  TCP master1:41710->master1:5000 (ESTABLISHED)

可见,是image-cri-shim占用的端口数。

image-cri-shim 工作原理

image-cri-shim 是一个基于 CRI (Container Runtime Interface) 和 kubelet 的 gRPC (Google Remote Procedure Call) shim。CRI 是 Kubernetes 中用于与容器运行时进行交互的接口,而 kubelet 是负责维护容器运行状态和节点级别的资源管理的 Kubernetes 组件。

常用操作:

  • 启动服务: systemctl start image-cri-shim
  • 停止服务: systemctl stop image-cri-shim
  • 重启服务: systemctl restart image-cri-shim
  • 查看服务状态: systemctl status image-cri-shim
  • 参考日志: journalctl -u image-cri-shim -f

重现问题

  1. 在测试环境安装相同版本的sealos版本4.1.7,这里就不赘述,可以参考(https://sealos.io/zh-Hans/docs/getting-started/kuberentes-life-cycle)

  2. 安装好之后,发现image-cri-shim的版本(4.1.3)不对,

image-cri-shim --version
image-cri-shim version 4.1.3-ed0a75b9
  1. 更新image-cri-shim版本为4.1.7
wget https://github.com/labring/sealos/releases/download/v4.1.7/sealos_4.1.7_linux_amd64.tar.gz && tar xvf sealos_4.1.7_linux_amd64.tar.gz.1 image-cri-shim
sealos exec -r master,node "systemctl stop image-cri-shim"
sealos scp "./image-cri-shim" "/usr/bin/image-cri-shim"
sealos exec -r master,node "systemctl start image-cri-shim"
sealos exec -r master,node "image-cri-shim -v"

输出类似以下内容,表示成功:

image-cri-shim version 4.1.7-ed0a75b9
192.168.1.101:22: image-cri-shim version 4.1.7-ed0a75b9
192.168.1.102:22: image-cri-shim version 4.1.7-ed0a75b9
  1. 启动后,观察日志:
root@master1:~# journalctl -u image-cri-shim -f
-- Logs begin at Wed 2023-04-19 22:42:32 UTC. --
May 25 06:49:38 master1 image-cri-shim[1116090]: 2023-05-25T06:49:38 info actual imageName: pause:3.6
May 25 06:49:38 master1 image-cri-shim[1116090]: 2023-05-25T06:49:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 06:54:38 master1 image-cri-shim[1116090]: 2023-05-25T06:54:38 info actual imageName: pause:3.6
May 25 06:54:38 master1 image-cri-shim[1116090]: 2023-05-25T06:54:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 06:59:38 master1 image-cri-shim[1116090]: 2023-05-25T06:59:38 info actual imageName: pause:3.6
May 25 06:59:38 master1 image-cri-shim[1116090]: 2023-05-25T06:59:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
  1. 监控端口数的变化
netstat -na | grep 5000 | wc -l
96
116
120
...
  1. 可见,每个5分钟,占用的端口数就增加20左右。问题复现了,接着我们来看看怎么处理。

更新版本为4.2.0

从sealos的github的issues有提供一个修复方案:升级image-cri-shim的版本为4.2.0

wget https://github.com/labring/sealos/releases/download/v4.2.0/sealos_4.2.0_linux_amd64.tar.gz && tar xvf sealos_4.2.0_linux_amd64.tar.gz image-cri-shim
sealos exec -r master,node "systemctl stop image-cri-shim"
sealos scp "./image-cri-shim" "/usr/bin/image-cri-shim"
sealos exec -r master,node "systemctl start image-cri-shim"
sealos exec -r master,node "image-cri-shim -v"

输出如下,表示成功:

image-cri-shim version 4.2.0-f696a621
192.168.1.101:22: image-cri-shim version 4.2.0-f696a621
192.168.1.102:22: image-cri-shim version 4.2.0-f696a621

观察是否已修复

  1. 观察inamge-cri-shim日志:
root@master1:~# journalctl -u image-cri-shim -f
-- Logs begin at Wed 2023-04-19 22:42:32 UTC. --
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info Timeout: {15m0s}
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info criRegistryAuth: map[]
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info criOfflineAuth: map[sealos.hub:5000:{Username:admin Password:passw0rd Auth: Email: ServerAddress:http://sealos.hub:5000 IdentityToken: RegistryToken:}]
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info socket info shim: /var/run/image-cri-shim.sock ,image: /run/containerd/containerd.sock, registry: http://sealos.hub:5000
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info changed ownership of socket "/var/run/image-cri-shim.sock" to root/root
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info changed permissions of socket "/var/run/image-cri-shim.sock" to -rw-rw----
May 25 07:44:41 master1 image-cri-shim[1159432]: 2023-05-25T07:44:41 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 07:49:42 master1 image-cri-shim[1159432]: 2023-05-25T07:49:42 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 07:54:42 master1 image-cri-shim[1159432]: 2023-05-25T07:54:42 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
  1. 监控端口数的变化
root@master1:~# netstat -na | grep 5000 | wc -l
1 
6 
1
  1. 可见,请求是还有,但并没有增加端口数,看来问题修复了。

ps: 我们重现问题的Kubenetes版本是v1.23.9,在实际开发中发现其他版本的v1.25.7也会出现类似问题,验证此版本待后续更新。

小结

本文主要记录排查、复现、处理【sealos安装的Kubenetes集群master节点大量占用端口】的过程,原因系image-cri-shim的版本问题,旧版本会频繁请求master,导致创建大量的连接,占用端口数。通过升级image-cri-shim版本可以解决该问题。

参考

jefffff

Stay hungry. Stay Foolish COOL

Go backend developer

China Amoy