geopm_pio_levelzero(7) – IOGroup providing signals and controls for Intel GPUs
Description
The LevelZeroIOGroup implements the geopm::IOGroup(3) interface to provide hardware signals and controls for Intel GPUs.
Requirements
To use the GEOPM LevelZero signals and controls GEOPM must be compiled against the oneAPI LevelZero libraries and must be run on a system with discrete GPUs supported by LevelZero. To compile against the oneAPI LevelZero libraries geopm must be configured using the –enable-levelzero flag. The optional –with-levelzero flag may be used to indicate the path of the required libraries. In addition the user must export ZES_ENABLE_SYSMAN=1 as specified by the Intel oneAPI Level Zero Sysman documentation. See the Sysman specification for more info on related environment variables and their usage.
Since signals and controls are exposed via the Sysman API they will be impacted by Sysman environment variables. Please review oneAPI LevelZero Sysman Environment Variables and oneAPI LevelZero Core Programming Guide Environment Variables.
Note on RAS Signals
The Monitoring of RAS counters have a high overhead (0.5 seconds each to read). And so, reporting of any errors while monitoring these signals (for e.g., due to unsupported firmware) will be delayed until the user attempts to actually read any of these signals.
Signals
LEVELZERO::GPU_CORE_FREQUENCY_STATUS
The current frequency of the GPU Compute Hardware.
Aggregation: average
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_CORE_FREQUENCY_EFFICIENT
The efficient minimum frequency of the GPU Compute Hardware.
Aggregation: average
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_CORE_FREQUENCY_MAX_AVAIL
The maximum supported frequency of the GPU Compute Hardware.
Aggregation: expect_same
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_CORE_FREQUENCY_MIN_AVAIL
The minimum supported frequency of the GPU Compute Hardware.
Aggregation: expect_same
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_CORE_TEMPERATURE_MAXIMUM
The maximum measured temperature across all sensors in the GPU accelerator.”
Aggregation: max
Domain: gpu_chip
Format: double
Unit: celsius
LEVELZERO::GPU_MEMORY_TEMPERATURE_MAXIMUM
The maximum measured temperature across all sensors in the GPU memory.”
Aggregation: max
Domain: gpu_chip
Format: double
Unit: celsius
LEVELZERO::GPU_CORE_FREQUENCY_STEP
The GPU Compute Hardware frequency step size in hertz. The average step size is provided in the case where the step size is variable.
Aggregation: expect_same
Domain: gpu
Format: double
Unit: hertz
LEVELZERO::GPU_ENERGY
GPU energy in joules.
Aggregation: sum
Domain: gpu
Format: double
Unit: joules
LEVELZERO::GPU_CORE_ENERGY
GPU Compute Hardware chip energy in joules.
Aggregation: sum
Domain: gpu_chip for multi-chip systems or gpu for single chip per gpu systems
Format: double
Unit: joules
LEVELZERO::GPU_CORE_ENERGY_TIMESTAMP
GPU compute hardware domain energy timestamp in seconds. Value cached on LEVELZERO::GPU_CORE_ENERGY read.
Aggregation: sum
Domain: gpu_chip for multi-chip systems or gpu for single chip per gpu systems
Format: double
Unit: seconds
LEVELZERO::GPU_ENERGY_TIMESTAMP
Timestamp for the GPU energy read in seconds.
Aggregation: sum
Domain: gpu
Format: double
Unit: seconds
LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR
Performance Factor of the GPU Compute Hardware Domain. Expresses a trade-off between energy provided to the GPU compute hardware and the supporting units. A value of 1 indicates a compute focused energy trade-off, a value of 0 indicates a memory focused energy trade-off. Default value is 0.5
Aggregation: averge
Domain: gpu_chip for multi-chip systems or gpu for single chip per gpu systems
Format: double
Unit: none
LEVELZERO::GPU_UNCORE_FREQUENCY_STATUS
The current frequency of the GPU Memory hardware.
Aggregation: average
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_UNCORE_FREQUENCY_MAX_AVAIL
The maximum supported frequency of the GPU Memory Hardware.
Aggregation: expect_same
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_UNCORE_FREQUENCY_MIN_AVAIL
The minimum supported frequency of the GPU Memory Hardware.
Aggregation: expect_same
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_POWER_LIMIT_DEFAULT
Default power limit of the GPU in watts.
Aggregation: sum
Domain: gpu
Format: double
Unit: watts
LEVELZERO::GPU_POWER_LIMIT_MIN_AVAIL
The minimum supported power limit in watts.
Aggregation: sum
Domain: gpu
Format: double
Unit: watts
LEVELZERO::GPU_POWER_LIMIT_MAX_AVAIL
The maximum supported power limit in watts.
Aggregation: sum
Domain: gpu
Format: double
Unit: watts
LEVELZERO::GPU_RAS_RESET_COUNT_CORRECTABLE
The number of correctable accelerator engine resets by the driver.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_PROGRAMMING_ERRCOUNT_CORRECTABLE
The number of correctable hardware exceptions generated by the way workloads have programmed the hardware.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_DRIVER_ERRCOUNT_CORRECTABLE
The number of correctable low level driver communication errors.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_COMPUTE_ERRCOUNT_CORRECTABLE
The number of correctable errors in the compute accelerator hardware.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_NONCOMPUTE_ERRCOUNT_CORRECTABLE
The number of correctable errors in the fixed-function accelerator hardware.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_CACHE_ERRCOUNT_CORRECTABLE
The number of correctable errors in caches (L1/L3/register file/shared local memory/sampler).
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_DISPLAY_ERRCOUNT_CORRECTABLE
The number of correctable errors in the display.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_RESET_COUNT_UNCORRECTABLE
The number of uncorrectable accelerator engine resets by the driver.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_PROGRAMMING_ERRCOUNT_UNCORRECTABLE
The number of uncorrectable hardware exceptions generated by the way workloads have programmed the hardware.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_DRIVER_ERRCOUNT_UNCORRECTABLE
The number of uncorrectable low level driver communication errors.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_COMPUTE_ERRCOUNT_UNCORRECTABLE
The number of uncorrectable errors in the compute accelerator hardware.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_NONCOMPUTE_ERRCOUNT_UNCORRECTABLE
The number of uncorrectable errors in the fixed-function accelerator hardware.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_CACHE_ERRCOUNT_UNCORRECTABLE
The number of uncorrectable errors in caches (L1/L3/register file/shared local memory/sampler).
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_RAS_DISPLAY_ERRCOUNT_UNCORRECTABLE
The number of uncorrectable errors in the display.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_ACTIVE_TIME
Time that this resource is actively running a workload in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_ACTIVE_TIME_TIMESTAMP
The timestamp for the
LEVELZERO::GPU_ACTIVE_TIME
read in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_CORE_ACTIVE_TIME
Time that the GPU compute engines (EUs) are actively running a workload in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_CORE_ACTIVE_TIME_TIMESTAMP
The timestamp for the
LEVELZERO::GPU_CORE_ACTIVE_TIME
signal read in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_UNCORE_ACTIVE_TIME
Time that the GPU copy engines are actively running a workload in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.
Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_UNCORE_ACTIVE_TIME_TIMESTAMP
The timestamp for the
LEVELZERO::GPU_UNCORE_ACTIVE_TIME
signal read in unspecified units. See the Intel oneAPI Level Zero Sysman documentation for more info.Aggregation: sum
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_POWER
Average GPU power over 40ms (via geopmread) or 8 control loop iterations. Derivative signal based on
LEVELZERO::GPU_ENERGY
.Aggregation: average
Domain: gpu
Format: double
Unit: watts
LEVELZERO::GPU_CORE_POWER
Average GPU Compute Hardware power over 40ms (via geopmread) or 8 control loop iterations. Derivative signal based on
LEVELZERO::GPU_CORE_ENERGY
.Aggregation: average
Domain: gpu_chip
Format: double
Unit: watts
LEVELZERO::GPU_UTILIZATION
Utilization of all GPU engines. Level Zero logical engines may map to the same hardware, resulting in a reduced signal range (i.e. less than 0 to 1) in some cases. See the LevelZero Sysman Engine documentation for more info.
Aggregation: average
Domain: gpu
Format: double
Unit: none
LEVELZERO::GPU_CORE_UTILIZATION
Utilization of the GPU Compute Engines (EUs). Level Zero logical engines may map to the same hardware, resulting in a reduced signal range (i.e. less than 0 to 1) in some cases. See the LevelZero Sysman Engine documentation for more info.
Aggregation: average
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_UNCORE_UTILIZATION
Utilization of the GPU Copy Engines. Level Zero logical engines may map to the same hardware, resulting in a reduced signal range (i.e. less than 0 to 1) in some cases. See the LevelZero Sysman Engine documentation for more info.
Aggregation: average
Domain: gpu_chip
Format: double
Unit: none
LEVELZERO::GPU_CORE_THROTTLE_REASONS
GPU Compute Hardware throttle reasons. See oneAPI Level Zero Sysman Spec for decoding.
Aggregation: integer_bitwise_or
Domain: gpu_chip
Format: integer
Unit: none
Controls
Every control is exposed as a signal with the same name. The relevant signal aggregation information is provided below.
LEVELZERO::GPU_CORE_FREQUENCY_MIN_CONTROL
Sets the minimum frequency request for the GPU Compute Hardware.
Aggregation: expect_same
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_CORE_FREQUENCY_MAX_CONTROL
Sets the minimum frequency request for the GPU Compute Hardware.
Aggregation: expect_same
Domain: gpu_chip
Format: double
Unit: hertz
LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR_CONTROL
Performance Factor of the GPU Compute Hardware Domain. Expresses a trade-off between energy provided to the GPU compute hardware and the supporting units. A value of 1 indicates a compute focused energy trade-off, a value of 0 indicates a memory focused energy trade-off. Default value is 0.5
Aggregation: averge
Domain: gpu_chip
Format: double
Unit: none
Aliases
This IOGroup provides the following high-level aliases:
Signal Aliases
GPU_ENERGY
Maps to
LEVELZERO::GPU_ENERGY
.GPU_POWER
Maps to
LEVELZERO::GPU_POWER
.GPU_CORE_ENERGY
Maps to
LEVELZERO::GPU_CORE_ENERGY
.GPU_CORE_POWER
Maps to
LEVELZERO::GPU_CORE_POWER
.GPU_UTILIZATION
Maps to
LEVELZERO::GPU_UTILIZATION
.GPU_CORE_ACTIVITY
Maps to
LEVELZERO::GPU_CORE_UTILIZATION
.GPU_UNCORE_ACTIVITY
Maps to
LEVELZERO::GPU_UNCORE_UTILIZATION
.GPU_CORE_FREQUENCY_STATUS
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_STATUS
.GPU_CORE_FREQUENCY_MIN_AVAIL
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_MIN_AVAIL
.GPU_CORE_FREQUENCY_MAX_AVAIL
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_MAX_AVAIL
.GPU_CORE_FREQUENCY_MIN_CONTROL
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_MIN_CONTROL
.GPU_CORE_FREQUENCY_MAX_CONTROL
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_MAX_CONTROL
.GPU_CORE_FREQUENCY_STEP
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_STEP
.LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR_CONTROL
Maps to
LEVELZERO::GPU_CORE_PERFORMANCE_FACTOR
Writes to performance factor may not be granted. To confirm the actual control setting the signal must be read.
Control Aliases
GPU_CORE_FREQUENCY_MAX_CONTROL
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_MAX_CONTROL
GPU_CORE_FREQUENCY_MIN_CONTROL
Maps to
LEVELZERO::GPU_CORE_FREQUENCY_MIN_CONTROL
See Also
oneAPI LevelZero Sysman, geopm(7), geopm::IOGroup(3), geopmwrite(1), geopmread(1)