背景:
sealos安装的Kubernetes集群,master节点出现大量端口占用。
环境:
应用 | 信息 |
---|---|
os | Ubuntu |
sealos | v4.1.7 |
Kubernetes | v1.23.9 |
排查问题
按照本地端口对输出进行排序
netstat -anp | grep ESTABLISHED | awk '{print $4}' | sort | uniq -c | sort -n
28 10.0.0.101:6443
111 127.0.0.1:2379
9009 10.0.0.101:5000
查找本地端口对应的应用程序
lsof -i :5000
image-cri 2811249 root 4120u IPv4 291818320 0t0 TCP master1:41710->master1:5000 (ESTABLISHED)
可见,是image-cri-shim占用的端口数。
image-cri-shim 工作原理
image-cri-shim 是一个基于 CRI (Container Runtime Interface) 和 kubelet 的 gRPC (Google Remote Procedure Call) shim。CRI 是 Kubernetes 中用于与容器运行时进行交互的接口,而 kubelet 是负责维护容器运行状态和节点级别的资源管理的 Kubernetes 组件。
常用操作:
- 启动服务: systemctl start image-cri-shim
- 停止服务: systemctl stop image-cri-shim
- 重启服务: systemctl restart image-cri-shim
- 查看服务状态: systemctl status image-cri-shim
- 参考日志: journalctl -u image-cri-shim -f
重现问题
-
在测试环境安装相同版本的sealos版本4.1.7,这里就不赘述,可以参考(https://sealos.io/zh-Hans/docs/getting-started/kuberentes-life-cycle)
-
安装好之后,发现image-cri-shim的版本(4.1.3)不对,
image-cri-shim --version
image-cri-shim version 4.1.3-ed0a75b9
- 更新image-cri-shim版本为4.1.7
wget https://github.com/labring/sealos/releases/download/v4.1.7/sealos_4.1.7_linux_amd64.tar.gz && tar xvf sealos_4.1.7_linux_amd64.tar.gz.1 image-cri-shim
sealos exec -r master,node "systemctl stop image-cri-shim"
sealos scp "./image-cri-shim" "/usr/bin/image-cri-shim"
sealos exec -r master,node "systemctl start image-cri-shim"
sealos exec -r master,node "image-cri-shim -v"
输出类似以下内容,表示成功:
image-cri-shim version 4.1.7-ed0a75b9
192.168.1.101:22: image-cri-shim version 4.1.7-ed0a75b9
192.168.1.102:22: image-cri-shim version 4.1.7-ed0a75b9
- 启动后,观察日志:
root@master1:~# journalctl -u image-cri-shim -f
-- Logs begin at Wed 2023-04-19 22:42:32 UTC. --
May 25 06:49:38 master1 image-cri-shim[1116090]: 2023-05-25T06:49:38 info actual imageName: pause:3.6
May 25 06:49:38 master1 image-cri-shim[1116090]: 2023-05-25T06:49:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 06:54:38 master1 image-cri-shim[1116090]: 2023-05-25T06:54:38 info actual imageName: pause:3.6
May 25 06:54:38 master1 image-cri-shim[1116090]: 2023-05-25T06:54:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 06:59:38 master1 image-cri-shim[1116090]: 2023-05-25T06:59:38 info actual imageName: pause:3.6
May 25 06:59:38 master1 image-cri-shim[1116090]: 2023-05-25T06:59:38 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
- 监控端口数的变化
netstat -na | grep 5000 | wc -l
96
116
120
...
- 可见,每个5分钟,占用的端口数就增加20左右。问题复现了,接着我们来看看怎么处理。
更新版本为4.2.0
从sealos的github的issues有提供一个修复方案:升级image-cri-shim的版本为4.2.0
wget https://github.com/labring/sealos/releases/download/v4.2.0/sealos_4.2.0_linux_amd64.tar.gz && tar xvf sealos_4.2.0_linux_amd64.tar.gz image-cri-shim
sealos exec -r master,node "systemctl stop image-cri-shim"
sealos scp "./image-cri-shim" "/usr/bin/image-cri-shim"
sealos exec -r master,node "systemctl start image-cri-shim"
sealos exec -r master,node "image-cri-shim -v"
输出如下,表示成功:
image-cri-shim version 4.2.0-f696a621
192.168.1.101:22: image-cri-shim version 4.2.0-f696a621
192.168.1.102:22: image-cri-shim version 4.2.0-f696a621
观察是否已修复
- 观察inamge-cri-shim日志:
root@master1:~# journalctl -u image-cri-shim -f
-- Logs begin at Wed 2023-04-19 22:42:32 UTC. --
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info Timeout: {15m0s}
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info criRegistryAuth: map[]
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info criOfflineAuth: map[sealos.hub:5000:{Username:admin Password:passw0rd Auth: Email: ServerAddress:http://sealos.hub:5000 IdentityToken: RegistryToken:}]
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info socket info shim: /var/run/image-cri-shim.sock ,image: /run/containerd/containerd.sock, registry: http://sealos.hub:5000
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info changed ownership of socket "/var/run/image-cri-shim.sock" to root/root
May 25 07:44:09 master1 image-cri-shim[1159432]: 2023-05-25T07:44:09 info changed permissions of socket "/var/run/image-cri-shim.sock" to -rw-rw----
May 25 07:44:41 master1 image-cri-shim[1159432]: 2023-05-25T07:44:41 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 07:49:42 master1 image-cri-shim[1159432]: 2023-05-25T07:49:42 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
May 25 07:54:42 master1 image-cri-shim[1159432]: 2023-05-25T07:54:42 info image: k8s.gcr.io/pause:3.6, newImage: sealos.hub:5000/pause:3.6, action: ImageStatus
- 监控端口数的变化
root@master1:~# netstat -na | grep 5000 | wc -l
1
6
1
- 可见,请求是还有,但并没有增加端口数,看来问题修复了。
ps: 我们重现问题的Kubenetes版本是v1.23.9,在实际开发中发现其他版本的v1.25.7也会出现类似问题,验证此版本待后续更新。
小结
本文主要记录排查、复现、处理【sealos安装的Kubenetes集群master节点大量占用端口】的过程,原因系image-cri-shim的版本问题,旧版本会频繁请求master,导致创建大量的连接,占用端口数。通过升级image-cri-shim版本可以解决该问题。