Linux GPU 节点安装 NVIDIA
一、环境检查
检查所有节点 SSH,Docker,GPU驱动(GPU服务器)是否正常,时间是否同步
# SSH | 如果终端返回结果中没有 sshd,则说明系统还没有安装 ssh-server 服务
ps -e | grep ssh
# Docker 环境 | 显示对应版本号
docker --version
# GPU驱动 | 显示驱动版本与显卡型号
nvidia-smi
# 验证 NVIDIA容器运行时是否配置
sudo systemctl status nvidia-container-runtime
# 显示当前时间
date
# 更新时区为东八区北京时间
timedatectl set-timezone Asia/Shanghai
查看 gpu 驱动:
[root@192.168.1.101 ~]#nvidia-smi
Wed Jul 31 14:53:21 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:13:00.0 Off | N/A |
| 23% 29C P8 7W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@192.168.1.101 ~]#
GPU节点安装忽略
安装NVIDIA容器运行时及NVIDIA-DOCKER
安装NVIDIA容器运行时:
# 设置变量(确定操作系统版本)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# 安装公匙
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
# 改变安装地址文件
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
# 安装
sudo apt-get update
sudo apt-get install -y nvidia-container-runtime
安装 nvidia-docker2:
# 设置变量(确定操作系统版本)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# 安装公匙
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
# 改变安装地址文件
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# 安装
sudo apt-get update
sudo apt-get install -y nvidia-docker2
配置 Docker 守护进程以使用 NVIDIA 容器运行时: 将默认运行时设置为 nvidia。如果该文件不存在,请创建它。
sudo pkill -SIGHUP dockerd
vim /etc/docker/daemon.json
配置 Docker 守护进程以使用 NVIDIA 容器运行时: 将默认运行时设置为 nvidia。如果该文件不存在,请创建它。
查看 docker 配置:
[root@192.168.1.11 ~]# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"registry-mirrors": ["https://docker.mirrors.ustc.edu.cn","https://registry.docker-cn.com","https://cr.console.aliyun.com","https://hub-mirror.c.163.com", "https://192.168.1.64","https://magic-harbor.magic.com","https://22amdajy.mirror.aliyuncs.com"],
"insecure-registries": ["magic-harbor.magic.com"],
"data-root": "/data/docker_data",
"hosts": ["tcp://0.0.0.0:2375", "unix:///var/run/docker.sock"],
"storage-driver": "overlay2",
"log-driver": "json-file",
"log-opts": {
"max-file": "3",
"max-size": "10m",
"env": "os,customer",
"labels": "somelabel"
}
}
[root@192.168.1.11 ~]#
相关文章:
天枢项目部署(集群)
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)