背景:

sealos安装的Kubernetes集群,master节点出现大量端口占用。

环境:

应用 信息
os Ubuntu
sealos v4.1.7
Kubernetes v1.23.9

排查问题

按照本地端口对输出进行排序

netstat -anp | grep ESTABLISHED | awk '{print $4}' | sort | uniq -c | sort -n
    28 10.0.0.101:6443
    111 127.0.0.1:2379
  9009 10.0.0.101:5000

查找本地端口对应的应用程序

  lsof -i :5000
  image-cri 2811249 root 4120u  IPv4 291818320      0t0  TCP master1:41710->master1:5000 (ESTABLISHED)

可见,是image-cri-shim占用的端口数。

image-cri-shim 工作原理

image-cri-shim 是一个基于 CRI (Container Runtime Interface) 和 kubelet 的 gRPC (Google Remote Procedure Call) shim。CRI 是 Kubernetes 中用于与容器运行时进行交互的接口,而 kubelet 是负责维护容器运行状态和节点级别的资源管理的 Kubernetes 组件。

常用操作:

  • 启动服务: systemctl start image-cri-shim
  • 停止服务: systemctl stop image-cri-shim
  • 重启服务: systemctl restart image-cri-shim
  • 查看服务状态: systemctl status image-cri-shim
  • 参考日志: journalctl -u image-cri-shim -f

重现问题

  1. 在测试环境安装相同版本的sealos版本4.1.7,这里就不赘述,可以参考(https://sealos.io/zh-Hans/docs/getting-started/kuberentes-life-cycle)

  2. 安装好之后,发现image-cri-shim的版本(4.1.3)不对,

image-cri-shim --version
image-cri-shim version 4.1.3-ed0a75b9
  1. 更新image-cri-shim版本为4.1.7
wget https://github.com/labring/sealos/releases/download/v4.1.7/sealos_4.1.7_linux_amd64.tar.gz && tar xvf sealos_4.1.7_linux_amd64.tar.gz.1 image-cri-shim
sealos exec -r master,node "systemctl stop image-cri-shim"
sealos scp "./image-cri-shim" "/usr/bin/image-cri-shim"
sealos exec -r master,node "systemctl start image-cri-shim"
sealos exec -r master,node "image-cri-shim -v"

输出类似以下内容,表示成功:

image-cri-shim version 4.1.7-ed0a75b9
192.168.1.101:22: image-cri-shim version 4.1.7-ed0a75b9
192.168.1.102:22: image-cri-shim version 4.1.7-ed0a75b9
  1. 启动后,观察日志:
root@master1:~# journalctl -u image-cri-shim -f
-- Logs begin at Wed 2023-04-19 22:42:32 UTC. --
May 25 06:49:38 master1 image-cri-shim[1116090]: 2023-05-25T06:49:38 info actual imageName: pause:3.6
May 25 06:49:38 master1 image-cri-shim[1116090]: 2023-05-25T06:49:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 06:54:38 master1 image-cri-shim[1116090]: 2023-05-25T06:54:38 info actual imageName: pause:3.6
May 25 06:54:38 master1 image-cri-shim[1116090]: 2023-05-25T06:54:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 06:59:38 master1 image-cri-shim[1116090]: 2023-05-25T06:59:38 info actual imageName: pause:3.6
May 25 06:59:38 master1 image-cri-shim[1116090]: 2023-05-25T06:59:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
  1. 监控端口数的变化
netstat -na | grep 5000 | wc -l
96
116
120
...
  1. 可见,每个5分钟,占用的端口数就增加20左右。问题复现了,接着我们来看看怎么处理。

更新版本为4.2.0

从sealos的github的issues有提供一个修复方案:升级image-cri-shim的版本为4.2.0

wget https://github.com/labring/sealos/releases/download/v4.2.0/sealos_4.2.0_linux_amd64.tar.gz && tar xvf sealos_4.2.0_linux_amd64.tar.gz image-cri-shim
sealos exec -r master,node "systemctl stop image-cri-shim"
sealos scp "./image-cri-shim" "/usr/bin/image-cri-shim"
sealos exec -r master,node "systemctl start image-cri-shim"
sealos exec -r master,node "image-cri-shim -v"

输出如下,表示成功:

image-cri-shim version 4.2.0-f696a621
192.168.1.101:22: image-cri-shim version 4.2.0-f696a621
192.168.1.102:22: image-cri-shim version 4.2.0-f696a621

观察是否已修复

  1. 观察inamge-cri-shim日志:
root@master1:~# journalctl -u image-cri-shim -f
-- Logs begin at Wed 2023-04-19 22:42:32 UTC. --
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info Timeout: {15m0s}
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info criRegistryAuth: map[]
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info criOfflineAuth: map[sealos.hub:5000:{Username:admin Password:passw0rd Auth: Email: ServerAddress:http://sealos.hub:5000 IdentityToken: RegistryToken:}]
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info socket info shim: /var/run/image-cri-shim.sock ,image: /run/containerd/containerd.sock, registry: http://sealos.hub:5000
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info changed ownership of socket "/var/run/image-cri-shim.sock" to root/root
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info changed permissions of socket "/var/run/image-cri-shim.sock" to -rw-rw----
May 25 07:44:41 master1 image-cri-shim[1159432]: 2023-05-25T07:44:41 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 07:49:42 master1 image-cri-shim[1159432]: 2023-05-25T07:49:42 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 07:54:42 master1 image-cri-shim[1159432]: 2023-05-25T07:54:42 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
  1. 监控端口数的变化
root@master1:~# netstat -na | grep 5000 | wc -l
1 
6 
1
  1. 可见,请求是还有,但并没有增加端口数,看来问题修复了。

ps: 我们重现问题的Kubenetes版本是v1.23.9,在实际开发中发现其他版本的v1.25.7也会出现类似问题,验证此版本待后续更新。

小结

本文主要记录排查、复现、处理【sealos安装的Kubenetes集群master节点大量占用端口】的过程,原因系image-cri-shim的版本问题,旧版本会频繁请求master,导致创建大量的连接,占用端口数。通过升级image-cri-shim版本可以解决该问题。

参考