geopm_pio_levelzero(7) – IOGroup providing signals and controls for Intel GPUs

Description

The LevelZeroIOGroup implements the geopm::IOGroup(3) interface to provide hardware signals and controls for Intel GPUs.

Requirements

To use the GEOPM LevelZero signals and controls GEOPM must be compiled against the oneAPI LevelZero libraries and must be run on a system with discrete GPUs supported by LevelZero. To compile against the oneAPI LevelZero libraries geopm must be configured using the –enable-levelzero flag. The optional –with-levelzero flag may be used to indicate the path of the required libraries. In addition the user must export ZES_ENABLE_SYSMAN=1 as specified by the Intel oneAPI Level Zero Sysman documentation. See the Sysman specification for more info on related environment variables and their usage.

Since signals and controls are exposed via the Sysman API they will be impacted by Sysman environment variables. Please review oneAPI LevelZero Sysman Environment Variables and oneAPI LevelZero Core Programming Guide Environment Variables.

Note on RAS Signals

The Monitoring of RAS counters have a high overhead (0.5 seconds each to read). And so, reporting of any errors while monitoring these signals (for e.g., due to unsupported firmware) will be delayed until the user attempts to actually read any of these signals.

Signals

LEVELZERO::GPU_CORE_FREQUENCY_STATUS

The current frequency of the GPU Compute Hardware.

  • Aggregation: average

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_CORE_FREQUENCY_EFFICIENT

The efficient minimum frequency of the GPU Compute Hardware.

  • Aggregation: average

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_CORE_FREQUENCY_MAX_AVAIL

The maximum supported frequency of the GPU Compute Hardware.

  • Aggregation: expect_same

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_CORE_FREQUENCY_MIN_AVAIL

The minimum supported frequency of the GPU Compute Hardware.

  • Aggregation: expect_same

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_CORE_TEMPERATURE_MAXIMUM

The maximum measured temperature across all sensors in the GPU accelerator.”

  • Aggregation: max

  • Domain: gpu_chip

  • Format: double

  • Unit: celsius

LEVELZERO::GPU_MEMORY_TEMPERATURE_MAXIMUM

The maximum measured temperature across all sensors in the GPU memory.”

  • Aggregation: max

  • Domain: gpu_chip

  • Format: double

  • Unit: celsius

LEVELZERO::GPU_CORE_FREQUENCY_STEP

The GPU Compute Hardware frequency step size in hertz. The average step size is provided in the case where the step size is variable.

  • Aggregation: expect_same

  • Domain: gpu

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_ENERGY

GPU energy in joules.

  • Aggregation: sum

  • Domain: gpu

  • Format: double

  • Unit: joules

LEVELZERO::GPU_CORE_ENERGY

GPU Compute Hardware chip energy in joules.

  • Aggregation: sum

  • Domain: gpu_chip for multi-chip systems or gpu for single chip per gpu systems

  • Format: double

  • Unit: joules

LEVELZERO::GPU_CORE_ENERGY_TIMESTAMP

GPU compute hardware domain energy timestamp in seconds. Value cached on LEVELZERO::GPU_CORE_ENERGY read.

  • Aggregation: sum

  • Domain: gpu_chip for multi-chip systems or gpu for single chip per gpu systems

  • Format: double

  • Unit: seconds

LEVELZERO::GPU_ENERGY_TIMESTAMP

Timestamp for the GPU energy read in seconds.

  • Aggregation: sum

  • Domain: gpu

  • Format: double

  • Unit: seconds

LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR

Performance Factor of the GPU Compute Hardware Domain. Expresses a trade-off between energy provided to the GPU compute hardware and the supporting units. A value of 1 indicates a compute focused energy trade-off, a value of 0 indicates a memory focused energy trade-off. Default value is 0.5

  • Aggregation: averge

  • Domain: gpu_chip for multi-chip systems or gpu for single chip per gpu systems

  • Format: double

  • Unit: none

LEVELZERO::GPU_UNCORE_FREQUENCY_STATUS

The current frequency of the GPU Memory hardware.

  • Aggregation: average

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_UNCORE_FREQUENCY_MAX_AVAIL

The maximum supported frequency of the GPU Memory Hardware.

  • Aggregation: expect_same

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_UNCORE_FREQUENCY_MIN_AVAIL

The minimum supported frequency of the GPU Memory Hardware.

  • Aggregation: expect_same

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_POWER_LIMIT_DEFAULT

Default power limit of the GPU in watts.

  • Aggregation: sum

  • Domain: gpu

  • Format: double

  • Unit: watts

LEVELZERO::GPU_POWER_LIMIT_MIN_AVAIL

The minimum supported power limit in watts.

  • Aggregation: sum

  • Domain: gpu

  • Format: double

  • Unit: watts

LEVELZERO::GPU_POWER_LIMIT_MAX_AVAIL

The maximum supported power limit in watts.

  • Aggregation: sum

  • Domain: gpu

  • Format: double

  • Unit: watts

LEVELZERO::GPU_RAS_RESET_COUNT_CORRECTABLE

The number of correctable accelerator engine resets by the driver.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_PROGRAMMING_ERRCOUNT_CORRECTABLE

The number of correctable hardware exceptions generated by the way workloads have programmed the hardware.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_DRIVER_ERRCOUNT_CORRECTABLE

The number of correctable low level driver communication errors.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_COMPUTE_ERRCOUNT_CORRECTABLE

The number of correctable errors in the compute accelerator hardware.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_NONCOMPUTE_ERRCOUNT_CORRECTABLE

The number of correctable errors in the fixed-function accelerator hardware.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_CACHE_ERRCOUNT_CORRECTABLE

The number of correctable errors in caches (L1/L3/register file/shared local memory/sampler).

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_DISPLAY_ERRCOUNT_CORRECTABLE

The number of correctable errors in the display.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_RESET_COUNT_UNCORRECTABLE

The number of uncorrectable accelerator engine resets by the driver.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_PROGRAMMING_ERRCOUNT_UNCORRECTABLE

The number of uncorrectable hardware exceptions generated by the way workloads have programmed the hardware.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_DRIVER_ERRCOUNT_UNCORRECTABLE

The number of uncorrectable low level driver communication errors.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_COMPUTE_ERRCOUNT_UNCORRECTABLE

The number of uncorrectable errors in the compute accelerator hardware.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_NONCOMPUTE_ERRCOUNT_UNCORRECTABLE

The number of uncorrectable errors in the fixed-function accelerator hardware.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_CACHE_ERRCOUNT_UNCORRECTABLE

The number of uncorrectable errors in caches (L1/L3/register file/shared local memory/sampler).

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_RAS_DISPLAY_ERRCOUNT_UNCORRECTABLE

The number of uncorrectable errors in the display.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_ACTIVE_TIME

Time that this resource is actively running a workload in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_ACTIVE_TIME_TIMESTAMP

The timestamp for the LEVELZERO::GPU_ACTIVE_TIME read in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_CORE_ACTIVE_TIME

Time that the GPU compute engines (EUs) are actively running a workload in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_CORE_ACTIVE_TIME_TIMESTAMP

The timestamp for the LEVELZERO::GPU_CORE_ACTIVE_TIME signal read in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_UNCORE_ACTIVE_TIME

Time that the GPU copy engines are actively running a workload in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_UNCORE_ACTIVE_TIME_TIMESTAMP

The timestamp for the LEVELZERO::GPU_UNCORE_ACTIVE_TIME signal read in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.

  • Aggregation: sum

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_POWER

Average GPU power over 40ms (via geopmread) or 8 control loop iterations. Derivative signal based on LEVELZERO::GPU_ENERGY.

  • Aggregation: average

  • Domain: gpu

  • Format: double

  • Unit: watts

LEVELZERO::GPU_CORE_POWER

Average GPU Compute Hardware power over 40ms (via geopmread) or 8 control loop iterations. Derivative signal based on LEVELZERO::GPU_CORE_ENERGY.

  • Aggregation: average

  • Domain: gpu_chip

  • Format: double

  • Unit: watts

LEVELZERO::GPU_UTILIZATION

Utilization of all GPU engines. Level Zero logical engines may map to the same hardware, resulting in a reduced signal range (i.e. less than 0 to 1) in some cases. See the LevelZero Sysman Engine documentation for more info.

  • Aggregation: average

  • Domain: gpu

  • Format: double

  • Unit: none

LEVELZERO::GPU_CORE_UTILIZATION

Utilization of the GPU Compute Engines (EUs). Level Zero logical engines may map to the same hardware, resulting in a reduced signal range (i.e. less than 0 to 1) in some cases. See the LevelZero Sysman Engine documentation for more info.

  • Aggregation: average

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_UNCORE_UTILIZATION

Utilization of the GPU Copy Engines. Level Zero logical engines may map to the same hardware, resulting in a reduced signal range (i.e. less than 0 to 1) in some cases. See the LevelZero Sysman Engine documentation for more info.

  • Aggregation: average

  • Domain: gpu_chip

  • Format: double

  • Unit: none

LEVELZERO::GPU_CORE_THROTTLE_REASONS

GPU Compute Hardware throttle reasons. See oneAPI Level Zero Sysman Spec for decoding.

  • Aggregation: integer_bitwise_or

  • Domain: gpu_chip

  • Format: integer

  • Unit: none

Controls

Every control is exposed as a signal with the same name. The relevant signal aggregation information is provided below.

LEVELZERO::GPU_CORE_FREQUENCY_MIN_CONTROL

Sets the minimum frequency request for the GPU Compute Hardware.

  • Aggregation: expect_same

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_CORE_FREQUENCY_MAX_CONTROL

Sets the minimum frequency request for the GPU Compute Hardware.

  • Aggregation: expect_same

  • Domain: gpu_chip

  • Format: double

  • Unit: hertz

LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR_CONTROL

Performance Factor of the GPU Compute Hardware Domain. Expresses a trade-off between energy provided to the GPU compute hardware and the supporting units. A value of 1 indicates a compute focused energy trade-off, a value of 0 indicates a memory focused energy trade-off. Default value is 0.5

  • Aggregation: averge

  • Domain: gpu_chip

  • Format: double

  • Unit: none

Aliases

This IOGroup provides the following high-level aliases:

Signal Aliases

GPU_ENERGY

Maps to LEVELZERO::GPU_ENERGY.

GPU_POWER

Maps to LEVELZERO::GPU_POWER.

GPU_CORE_ENERGY

Maps to LEVELZERO::GPU_CORE_ENERGY.

GPU_CORE_POWER

Maps to LEVELZERO::GPU_CORE_POWER.

GPU_UTILIZATION

Maps to LEVELZERO::GPU_UTILIZATION.

GPU_CORE_ACTIVITY

Maps to LEVELZERO::GPU_CORE_UTILIZATION.

GPU_UNCORE_ACTIVITY

Maps to LEVELZERO::GPU_UNCORE_UTILIZATION.

GPU_CORE_FREQUENCY_STATUS

Maps to LEVELZERO::GPU_CORE_FREQUENCY_STATUS.

GPU_CORE_FREQUENCY_MIN_AVAIL

Maps to LEVELZERO::GPU_CORE_FREQUENCY_MIN_AVAIL.

GPU_CORE_FREQUENCY_MAX_AVAIL

Maps to LEVELZERO::GPU_CORE_FREQUENCY_MAX_AVAIL.

GPU_CORE_FREQUENCY_MIN_CONTROL

Maps to LEVELZERO::GPU_CORE_FREQUENCY_MIN_CONTROL.

GPU_CORE_FREQUENCY_MAX_CONTROL

Maps to LEVELZERO::GPU_CORE_FREQUENCY_MAX_CONTROL.

GPU_CORE_FREQUENCY_STEP

Maps to LEVELZERO::GPU_CORE_FREQUENCY_STEP.

LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR_CONTROL

Maps to LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR Writes to performance factor may not be granted. To confirm the actual control setting the signal must be read.

Control Aliases

GPU_CORE_FREQUENCY_MAX_CONTROL

Maps to LEVELZERO::GPU_CORE_FREQUENCY_MAX_CONTROL

GPU_CORE_FREQUENCY_MIN_CONTROL

Maps to LEVELZERO::GPU_CORE_FREQUENCY_MIN_CONTROL

See Also

oneAPI LevelZero Sysman, geopm(7), geopm::IOGroup(3), geopmwrite(1), geopmread(1)