Nvidia GPU Device Plugin

Name: nvidia-gpu

The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia plugin is built into Nomad and does not need to be downloaded separately.

Fingerprinted Attributes

Attribute	Unit
`memory`	MiB
`power`	W (Watt)
`bar1`	MiB
`driver_version`	string
`cores_clock`	MHz
`memory_clock`	MHz
`pci_bandwidth`	MB/s
`display_state`	string
`persistence_mode`	string

Runtime Environment

The nvidia-gpu device plugin exposes the following environment variables:

NVIDIA_VISIBLE_DEVICES - List of Nvidia GPU IDs available to the task.

Additional Task Configurations

Additional environment variables can be set by the task to influence the runtime environment. See Nvidia's documentation.

Installation Requirements

In order to use the nvidia-gpu the following prerequisites must be met:

GNU/Linux x86_64 with kernel version > 3.10
NVIDIA GPU with Architecture > Fermi (2.1)
NVIDIA drivers >= 340.29 with binary nvidia-smi

Docker Driver Requirements

In order to use the Nvidia driver plugin with the Docker driver, please follow the installation instructions for nvidia-docker.

Plugin Configuration

plugin "nvidia-gpu" {  config {    enabled            = true    ignored_gpu_ids    = ["GPU-fef8089b", "GPU-ac81e44d"]    fingerprint_period = "1m"  }}

The nvidia-gpu device plugin supports the following configuration in the agent config:

enabled (bool: true) - Control whether the plugin should be enabled and running.
ignored_gpu_ids (array<string>: []) - Specifies the set of GPU UUIDs that should be ignored when fingerprinting.
fingerprint_period (string: "1m") - The period in which to fingerprint for device changes.

Restrictions

The Nvidia integration only works with drivers who natively integrate with Nvidia's container runtime library.

Nomad has tested support with the docker driver and plans to bring support to the built-in exec and java drivers. Support for lxc should be possible by installing the Nvidia hook but is not tested or documented by Nomad.

Examples

Inspect a node with a GPU:

$ nomad node status 4d46e59fID            = 4d46e59fName          = nomadClass         = <none>DC            = dc1Drain         = falseEligibility   = eligibleStatus        = readyUptime        = 19m43sDriver Status = docker,mock_driver,raw_execNode EventsTime                  Subsystem  Message2019-01-23T18:25:18Z  Cluster    Node registeredAllocated ResourcesCPU          Memory      Disk0/15576 MHz  0 B/55 GiB  0 B/28 GiBAllocation Resource UtilizationCPU          Memory0/15576 MHz  0 B/55 GiBHost Resource UtilizationCPU             Memory          Disk2674/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiBDevice Resource Utilizationnvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiBAllocationsNo allocations placed

Display detailed statistics on a node with a GPU:

$ nomad node status -stats 4d46e59fID            = 4d46e59fName          = nomadClass         = <none>DC            = dc1Drain         = falseEligibility   = eligibleStatus        = readyUptime        = 19m59sDriver Status = docker,mock_driver,raw_execNode EventsTime                  Subsystem  Message2019-01-23T18:25:18Z  Cluster    Node registeredAllocated ResourcesCPU          Memory      Disk0/15576 MHz  0 B/55 GiB  0 B/28 GiBAllocation Resource UtilizationCPU          Memory0/15576 MHz  0 B/55 GiBHost Resource UtilizationCPU             Memory          Disk2673/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiBDevice Resource Utilizationnvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB// ...TRUNCATED...Device StatsDevice              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]BAR1 buffer state   = 2 / 16384 MiBDecoder utilization = 0 %ECC L1 errors       = 0ECC L2 errors       = 0ECC memory errors   = 0Encoder utilization = 0 %GPU utilization     = 0 %Memory state        = 0 / 11441 MiBMemory utilization  = 0 %Power usage         = 37 / 149 WTemperature         = 34 CAllocationsNo allocations placed

Run the following example job to see that that the GPU was mounted in the container:

job "gpu-test" {  datacenters = ["dc1"]  type = "batch"  group "smi" {    task "smi" {      driver = "docker"      config {        image = "nvidia/cuda:9.0-base"        command = "nvidia-smi"      }      resources {        device "nvidia/gpu" {          count = 1          # Add an affinity for a particular model          affinity {            attribute = "${device.model}"            value     = "Tesla K80"            weight    = 50          }        }      }    }  }}

$ nomad run example.nomad==> Monitoring evaluation "21bd7584"    Evaluation triggered by job "gpu-test"    Allocation "d250baed" created: node "4d46e59f", group "smi"    Evaluation status changed: "pending" -> "complete"==> Evaluation "21bd7584" finished with status "complete"$ nomad alloc status d250baedID                  = d250baedEval ID             = 21bd7584Name                = gpu-test.smi[0]Node ID             = 4d46e59fJob ID              = exampleJob Version         = 0Client Status       = completeClient Description  = All tasks have completedDesired Status      = runDesired Description = <none>Created             = 7s agoModified            = 2s agoTask "smi" is "dead"Task ResourcesCPU        Memory       Disk     Addresses0/100 MHz  0 B/300 MiB  300 MiBDevice Statsnvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiBTask Events:Started At     = 2019-01-23T18:25:32ZFinished At    = 2019-01-23T18:25:34ZTotal Restarts = 0Last Restart   = N/ARecent Events:Time                  Type        Description2019-01-23T18:25:34Z  Terminated  Exit Code: 02019-01-23T18:25:32Z  Started     Task started by client2019-01-23T18:25:29Z  Task Setup  Building Task Directory2019-01-23T18:25:29Z  Received    Task received by client$ nomad alloc logs d250baedWed Jan 23 18:25:32 2019+-----------------------------------------------------------------------------+| NVIDIA-SMI 410.48                 Driver Version: 410.48                    ||-------------------------------+----------------------+----------------------+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC || Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. ||===============================+======================+======================||   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 || N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes:                                                       GPU Memory ||  GPU       PID   Type   Process name                             Usage      ||=============================================================================||  No running processes found                                                 |+-----------------------------------------------------------------------------+