Linux GPU 节点安装 NVIDIA

一、环境检查

检查所有节点 SSH,Docker,GPU驱动(GPU服务器)是否正常,时间是否同步

#  SSH  |  如果终端返回结果中没有 sshd,则说明系统还没有安装 ssh-server 服务
ps -e | grep ssh
#  Docker 环境  | 显示对应版本号
docker --version
#  GPU驱动  |  显示驱动版本与显卡型号
nvidia-smi
#  验证 NVIDIA容器运行时是否配置
sudo systemctl status nvidia-container-runtime
#  显示当前时间
date
#  更新时区为东八区北京时间
timedatectl set-timezone Asia/Shanghai

查看 gpu 驱动:

[root@192.168.1.101 ~]#nvidia-smi
Wed Jul 31 14:53:21 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:13:00.0 Off |                  N/A |
| 23%   29C    P8     7W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[root@192.168.1.101 ~]#

GPU节点安装忽略

安装NVIDIA容器运行时及NVIDIA-DOCKER

安装NVIDIA容器运行时:

#  设置变量(确定操作系统版本)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
#  安装公匙
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
#  改变安装地址文件
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
#  安装
sudo apt-get update
sudo apt-get install -y nvidia-container-runtime

安装 nvidia-docker2:

#  设置变量(确定操作系统版本)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
#  安装公匙
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
#  改变安装地址文件
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
#  安装
sudo apt-get update
sudo apt-get install -y nvidia-docker2

配置 Docker 守护进程以使用 NVIDIA 容器运行时: 将默认运行时设置为 nvidia。如果该文件不存在,请创建它。

sudo pkill -SIGHUP dockerd
vim /etc/docker/daemon.json
配置 Docker 守护进程以使用 NVIDIA 容器运行时: 将默认运行时设置为 nvidia。如果该文件不存在,请创建它。

查看 docker 配置:

[root@192.168.1.11 ~]# cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "registry-mirrors": ["https://docker.mirrors.ustc.edu.cn","https://registry.docker-cn.com","https://cr.console.aliyun.com","https://hub-mirror.c.163.com", "https://192.168.1.64","https://magic-harbor.magic.com","https://22amdajy.mirror.aliyuncs.com"],
    "insecure-registries": ["magic-harbor.magic.com"],
   "data-root": "/data/docker_data",
        "hosts": ["tcp://0.0.0.0:2375", "unix:///var/run/docker.sock"],
        "storage-driver": "overlay2",
        "log-driver": "json-file",
        "log-opts": {
            "max-file": "3",
            "max-size": "10m",
            "env": "os,customer",
            "labels": "somelabel"
        }
}
[root@192.168.1.11 ~]#

相关文章:
天枢项目部署(集群)

为者常成,行者常至