👉点击这里申请火山引擎VIP帐号,立即体验火山引擎产品>>>
操作场景
NVIDIA-Fabric Manager服务可以使多A100/A800显卡间通过NVSwitch互联。有关NVSwitch的更多介绍,请参见NVIDIA官网。
说明
搭载A100/A800显卡的实例请参见实例规格介绍,如果未安装与GPU驱动版本对应的NVIDIA-Fabric Manager服务,您将无法正常使用该类GPU实例。
搭载A100/A800显卡的实例升级GPU驱动的同时,还需同步升级Fabric Manager,否则将无法正常使用。如何升级NVIDIA Tesla驱动?
火山引擎提供的公共镜像默认已安装NVIDIA-Fabric Manager及devel软件包,您只需启动NVIDIA-Fabric Manager即可实现NVSwitch互联。
如果您使用未安装NVIDIA-Fabric Manager的自定义镜像,购买了搭载多张A100/A800显卡的GPU实例后,则必须安装与GPU驱动版本对应的NVIDIA-Fabric Manager软件包。
步骤一:安装NVIDIA-Fabric Manager
您可以通过安装包或者源码两种方式安装NVIDIA-Fabric Manager服务,下文以GPU驱动为470.57.02版本为例,为您介绍如何安装并启动NVIDIA-Fabric Manager服务。如需下载其它版本,请将命令中的版本号替换为相应的GPU驱动版本号。您可以执行nvidia-smi命令,查看GPU驱动版本。
方式一:通过安装包安装
CentOS 8.x
wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel8/x86_64/nvidia-fabric-manager-470.57.02-1.x86_64.rpm rpm -ivh nvidia-fabric-manager-470.57.02-1.x86_64.rpm
CentOS 7.x
wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-470.57.02-1.x86_64.rpm rpm -ivh nvidia-fabric-manager-470.57.02-1.x86_64.rpm
Ubuntu 20.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fabricmanager-470_470.57.02-1_amd64.deb dpkg -i nvidia-fabricmanager-470_470.57.02-1_amd64.deb
Ubuntu 18.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/nvidia-fabricmanager-470_470.57.02-1_amd64.deb dpkg -i nvidia-fabricmanager-470_470.57.02-1_amd64.deb
Debain 10、veLinux 1.0
wget https://developer.download.nvidia.cn/compute/cuda/repos/debian10/x86_64/nvidia-fabricmanager-470_470.57.02-1_amd64.deb dpkg -i nvidia-fabricmanager-470_470.57.02-1_amd64.deb
方式二:通过源安装
CentOS 8.x
dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo dnf module enable -y nvidia-driver:470 dnf install -y nvidia-fabric-manager-0:470.57.02-1
CentOS 7.x
yum -y install yum-utils yum-config-manager --add-repo https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo yum install -y nvidia-fabric-manager-470.57.02-1
Ubuntu 20.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub apt-key add 7fa2af80.pub rm 7fa2af80.pub echo "deb http://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 /" | tee /etc/apt/sources.list.d/cuda.list apt-get update apt-get -y install nvidia-fabricmanager-470=470.57.02-1
Ubuntu 18.04
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub apt-key add 7fa2af80.pub rm 7fa2af80.pub echo "deb http://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 /" | tee /etc/apt/sources.list.d/cuda.list apt-get update apt-get -y install nvidia-fabricmanager-470=470.57.02-1
步骤二:安装Nvidia-Fabric-Manager-devel
CentOS 7.x/8.x
yum install nvidia-fabric-manager-devel-470.57.02-1 -yUbuntu 20.04/18.04、Debain 10、veLinux 1.0
dpkg -i nvidia-fabric-manager-devel-470.57.02-1_amd64.deb
步骤三:启动NVIDIA-Fabric Manager
执行如下命令启动Fabric Manager服务。
sudo systemctl start nvidia-fabricmanager执行如下命令查看Fabric Manager服务是否正常启动,回显active(running)表示启动成功。
sudo systemctl status nvidia-fabricmanager执行如下命令配置Fabric Manager服务随实例开机自启动。
sudo systemctl enable nvidia-fabricmanager