geopm_pio_dcgm(7) -- IOGroup providing signals and controls for NVIDIA GPUs =========================================================================== Description ----------- The DCGMIOGroup implements the :doc:`geopm::IOGroup(3) <geopm::IOGroup.3>` interface to provide hardware signals for NVIDIA GPUs from the NVIDIA Datacenter GPU Manager. This IO Group is intended for use with the :doc:`NVMLIOGroup <geopm_pio_nvml.7>` Requirements ^^^^^^^^^^^^ To use the GEOPM DCGM signals and controls GEOPM must be compiled against the NVML and DCGM libraries and must be run on a system with hardware supported by DCGM. To compile against the NVML and DCGM libraries GEOPM must be configured using both the ``--enable-nvml`` flag and the ``--enable-dcgm`` flag. The optional flags ``--with-nvml`` and ``--with-dcgm`` may be used to indicate the path of the required libraries. See ``configure --help`` for more information about these flags. When enabling the DCGM signals, a small modification should be made to the `geopm.service <https://github.com/geopm/geopm/blob/dev/service/geopm.service>`_ systemd configuration file to encode the requirements on the DGCM service. The three lines below should be added to the ``[Unit]`` section: .. code-block:: [Unit] Wants=nvidia-dcgm.service After=nvidia-dcgm.service PartOf=nvidia-dcgm.service The DCGM IOGroup requires an instance of nv-hostengine be running on the node prior to loading the IOGroup. For DCGM installation and usage information see the `DCGM Getting Started Guide <https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html>`_. The DCGM IOGroup will connect to a running nv-hostengine instance if it is available when the GEOPM service is started. If the nv-hostengine is stopped the geopm DCGM Signals will throw an error of the form: .. code-block:: "Error getting latest values for fields in read_batch: Host engine connection invalid/disconnected" Restarting the nv-hostengine and the geopm service are required to restore access to DCGM signals. Signals ------- ``DCGM::SM_ACTIVE`` Streaming Multiprocessor (SM) activity expressed as a ratio of cycles. * **Aggregation**: average * **Domain**: gpu * **Format**: double. * **Unit**: n/a ``DCGM::SM_OCCUPANCY`` Warp residency expressed as a ratio of maximum warps. * **Aggregation**: average * **Domain**: gpu * **Format**: double * **Unit**: n/a ``DCGM::DRAM_ACTIVE`` DRAM send and receive metrics expressed as a ratio of cycles. * **Aggregation**: average * **Domain**: gpu * **Format**: double * **Unit**: n/a Controls -------- Every control is exposed as a signal with the same name. The relevant signal aggregation information is provided below. ``DCGM::FIELD_UPDATE_RATE`` Rate at which field data is polled. * **Aggregation**: expect_same * **Domain**: board * **Format**: double * **Unit**: seconds ``DCGM::MAX_STORAGE_TIME`` The maximum time field data will be stored. * **Aggregation**: expect_same * **Domain**: board * **Format**: double * **Unit**: seconds ``DCGM::MAX_SAMPLES`` The maximum number of samples to be stored. Zero implies no limit. * **Aggregation**: expect_same * **Domain**: board * **Format**: integer * **Unit**: seconds Aliases ------- This IOGroup provides the following high-level aliases: Signal Aliases ^^^^^^^^^^^^^^ ``GPU_CORE_ACTIVITY`` Maps to ``DCGM::SM_ACTIVE``. ``GPU_UNCORE_ACTIVITY`` Maps to ``DCGM::DRAM_ACTIVE``. See Also -------- `DCGM API <https://docs.nvidia.com/datacenter/dcgm/latest/>`_\ , :doc:`geopm(7) <geopm.7>`\ , :doc:`geopm::IOGroup(3) <geopm::IOGroup.3>`\ , :doc:`geopmwrite(1) <geopmwrite.1>`\ , :doc:`geopmread(1) <geopmread.1>`, :doc:`geopm::Agg(3) <geopm::Agg.3>` :doc:`geopm_pio_nvml(7) <geopm_pio_nvml.7>`\ ,