geopm_agent_gpu_activity(7) – agent for selecting GPU frequency based on GPU compute activity
Description
Note
This is currently an experimental agent and is only available when
building GEOPM with the --enable-beta
flag. Some areas or aspects that
are subject to change include its interface (e.g. the policy) and
algorithm. It is also possible that this agent may be refactored and
combined with other agents.
The goal of GPUActivityAgent is to save GPU energy by scaling GPU frequency
based upon the compute activity of each GPU as provided by the
GPU_CORE_ACTIVITY
signal and modified by the GPU_UTILIZATION
signal.
The agent scales frequency in the range of Fe
to Fmax
, where Fmax
is provided by the NVMLIOGroup or LevelZeroIOGroup and Fe
is provided by the
ConstConfigIOGroup or LevelZeroIOGroup.
Low activity regions (compute activity
of 0.0) run at the Fe
frequency, high activity regions (compute activity of 1.0)
run at the Fmax
frequency, and regions in between the extremes run at a frequency (F)
selected using the equation:
F = Fe + (Fmax - Fe) * GPU_CORE_ACTIVITY/GPU_UTILIZATION
GPU_UTILIZATION
is used to scale the GPU_CORE_ACTIVITY
in order
to scale frequency selection with the percentage of time a kernel is running on
the GPU. This tends to help with workloads that contain short but highly
scalable GPU phases.
Fe
is intended to be an energy efficient frequency that is selected via system
characterization. The recommended approach to selecting Fe
is to perform a
frequency sweep on the GPUs of interest using a workload that scales strongly with
frequency. With this approach, Fe
will be the frequency that provides the lowest
GPU energy consumption for the workload.
Fmax
is intended to be the maximum allowable frequency, and may be set as the
default GPU maximum frequency, or limited based upon user/admin preference.
The agent provides an optional input of phi
that allows for biasing the
frequency range used by the agent. The default phi
value of 0.5 provides frequency
selection in the full range from Fe
to Fmax
. A phi
value less than 0.5 biases the
agent towards higher frequencies by increasing the Fe
value.
In the extreme case (phi
of 0) Fe
will be raised to Fmax
. A phi
value greater than
0.5 biases the agent towards lower frequencies by reducing the Fmax
value.
In the extreme case (phi
of 1.0) Fmax
will be lowered to Fe
.
For NVIDIA based systems the agent should be used with DCGM settings of
DCGM::FIELD_UPDATE_RATE
= 100 ms, DCGM::MAX_STORAGE_TIME
= 1 s, and DCGM::MAX_SAMPLES
= 100. While the DCGM documentation indicates that users should generally query
no faster than 100 ms, the interface allows for setting the polling rate in the
microsecond range. If the agent is intended to be used with workloads that exhibit
extremely short phase behavior a 1 ms polling rate can be used.
As the 1 ms polling rate is not officially recommended by the DCGM API the 100 ms
setting should be used by default.
Agent Name
The agent described in this manual is selected in many geopm
interfaces with the "gpu_activity"
agent name. This name can be
passed to geopmlaunch(1) as the argument to the --geopm-agent
option, or the GEOPM_AGENT
environment variable can be set to this
name (see geopm(7)). This name can also be passed to the
geopmagent(1) as the argument to the '-a'
option.
Policy Parameters
The Phi
input is the only policy value.
GPU_PHI
:The performance bias knob. The value must be between 0.0 and 1.0. If NAN is passed, it will use 0.5 by default.
ConstConfigIOGroup Configuration File Generation
This version of the agent uses ConstConfigIO to provide per-node Fe values.
The GPU compute activity ConstConfigIOGroup configuration file can be generated by running:
integration/experiment/gpu_frequency_sweep/gen_gpu_activity_constconfig_recommendation.py --path <GPU_SWEEP_DIR>
Depending on the number of runs, system noise, and other factors there may be more than one reasonable
value for Fe
. In these cases a warning similar to the following will be provided:
'Warning: Found N possible alternate Fe value(s) within 5% energy consumption of Fe for <frequency>.
Consider using the energy-margin options.\n'
If the occurs the user may choose to use the provided configuration file or rerun the recommendation script with
the energy-margin option --gpu-energy-margin
along with a value such as 0.05 (5%).
This option attempt to identify a lower Fe
for the gpu domain that costs less than the energy consumed at Fe
plus the energy-margin percentage provided.
An example ConstConfigIOGroup configuration file is provided below:
{
"GPU_FREQUENCY_EFFICIENT_HIGH_INTENSITY": {
"domain": "board",
"description": "Defines the efficient compute frequency to use for GPUs. This value is based on a workload that scales strongly with the frequency domain.",
"units": "hertz",
"aggregation": "average",
"values": [982000000.0]
}
}
Example Policy
An example policy is provided below:
{"GPU_PHI": 0.5}
Report Extensions
GPU Frequency Requests
:The number of frequency requests made by the agent
Resolved Max Frequency
:
Fmax
afterphi
has been taken into accountResolved Efficient Frequency
:
Fe
afterphi
has been taken into accountResolved Frequency Range
:The frequency selection range of the agent after
phi
has been taken into accountGPU # Active Region Energy
:Per GPU energy reading during the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity.
GPU # Active Region Time
:Per GPU time during the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity.
GPU # Active Region Start Time
:Per GPU start time for the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity.
GPU # Active Region Stop Time
:Per GPU stop time for the Region of Interest (ROI) where ROI is determined as the first sample of GPU activity to the last sample of GPU activity.
Control Loop Rate
The agent gates the control loop to a cadence of 20ms.
SEE ALSO
geopm(7), geopm_agent_monitor(7), geopm::Agent(3), geopm_agent(3), geopm_prof(3), geopmagent(1), geopmlaunch(1)