geopm_pio_dcgm(7) – IOGroup providing signals and controls for NVIDIA GPUs

Description

The DCGMIOGroup implements the geopm::IOGroup(3) interface to provide hardware signals for NVIDIA GPUs from the NVIDIA Datacenter GPU Manager. This IO Group is intended for use with the NVMLIOGroup

Requirements

To use the GEOPM DCGM signals and controls GEOPM must be compiled against the NVML and DCGM libraries and must be run on a system with hardware supported by DCGM. To compile against the NVML and DCGM libraries GEOPM must be configured using both the --enable-nvml flag and the --enable-dcgm flag. The optional flags --with-nvml and --with-dcgm may be used to indicate the path of the required libraries. See configure --help for more information about these flags.

When enabling the DCGM signals, a small modification should be made to the geopm.service systemd configuration file to encode the requirements on the DGCM service. The three lines below should be added to the [Unit] section:

[Unit]
Wants=nvidia-dcgm.service
After=nvidia-dcgm.service
PartOf=nvidia-dcgm.service

The DCGM IOGroup requires an instance of nv-hostengine be running on the node prior to loading the IOGroup. For DCGM installation and usage information see the DCGM Getting Started Guide. The DCGM IOGroup will connect to a running nv-hostengine instance if it is available when the GEOPM service is started. If the nv-hostengine is stopped the geopm DCGM Signals will throw an error of the form:

"Error getting latest values for fields in read_batch: Host engine connection invalid/disconnected"

Restarting the nv-hostengine and the geopm service are required to restore access to DCGM signals.

Signals

DCGM::SM_ACTIVE

Streaming Multiprocessor (SM) activity expressed as a ratio of cycles.

  • Aggregation: average

  • Domain: gpu

  • Format: double.

  • Unit: n/a

DCGM::SM_OCCUPANCY

Warp residency expressed as a ratio of maximum warps.

  • Aggregation: average

  • Domain: gpu

  • Format: double

  • Unit: n/a

DCGM::DRAM_ACTIVE

DRAM send and receive metrics expressed as a ratio of cycles.

  • Aggregation: average

  • Domain: gpu

  • Format: double

  • Unit: n/a

Controls

Every control is exposed as a signal with the same name. The relevant signal aggregation information is provided below.

DCGM::FIELD_UPDATE_RATE

Rate at which field data is polled.

  • Aggregation: expect_same

  • Domain: board

  • Format: double

  • Unit: seconds

DCGM::MAX_STORAGE_TIME

The maximum time field data will be stored.

  • Aggregation: expect_same

  • Domain: board

  • Format: double

  • Unit: seconds

DCGM::MAX_SAMPLES

The maximum number of samples to be stored. Zero implies no limit.

  • Aggregation: expect_same

  • Domain: board

  • Format: integer

  • Unit: seconds

Aliases

This IOGroup provides the following high-level aliases:

Signal Aliases

GPU_CORE_ACTIVITY

Maps to DCGM::SM_ACTIVE.

GPU_UNCORE_ACTIVITY

Maps to DCGM::DRAM_ACTIVE.

See Also

DCGM API, geopm(7), geopm::IOGroup(3), geopmwrite(1), geopmread(1), geopm::Agg(3) geopm_pio_nvml(7),